Anonymous1223334444 commited on
Commit
c2e3cf5
·
1 Parent(s): 2d00ebd

Initial commit of multimodal multilingual PDF embedding pipeline

Browse files
Files changed (7) hide show
  1. .gitignore +28 -0
  2. LICENSE +21 -0
  3. README.md +183 -5
  4. requirements.txt +8 -0
  5. run_pipeline.py +222 -0
  6. src/embedding_utils.py +205 -0
  7. src/pdf_processor.py +261 -0
.gitignore ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.pyc
4
+ *.pyo
5
+ *.pyd
6
+ .Python
7
+ env/
8
+ venv/
9
+ *.egg
10
+ *.egg-info/
11
+ .env
12
+
13
+ # Jupyter Notebook
14
+ .ipynb_checkpoints
15
+
16
+ # IDEs
17
+ .idea/
18
+ .vscode/
19
+
20
+ # Output files
21
+ output/
22
+ embeddings_statistiques_multimodal.json
23
+ extracted_graphs/
24
+ extracted_tables/
25
+
26
+ # API Keys (IMPORTANT!)
27
+ *.env
28
+ *.key
LICENSE CHANGED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Andre Sarr
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,5 +1,183 @@
1
- ---
2
- license: other
3
- license_name: aslic
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multimodal & Multilingual PDF Embedding Pipeline
2
+
3
+ This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images), and then creates multilingual text embeddings for all extracted information. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
4
+
5
+ **Key Features:**
6
+ - **Multimodal:** Processes text, tables, and images from PDFs.
7
+ - **Multilingual:** Leverages Google's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
8
+ - **Contextual Descriptions:** Uses Google Gemini (Gemini 1.5 Flash) to generate descriptive text for tables and images in French.
9
+ - **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
10
+
11
+ ## How it Works
12
+
13
+ 1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
14
+ 2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
15
+ 3. **Multimodal Description (for Tables & Images):**
16
+ - For tables, the pipeline captures an image of the table and also uses its text representation.
17
+ - For standalone images (e.g., graphs, charts), it captures the image.
18
+ - These images are then sent to the `gemini-1.5-flash-latest` model (via `google.generativeai`) with specific prompts to generate rich, descriptive text in French.
19
+ 4. **Multilingual Text Embedding:**
20
+ - The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
21
+ - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content.
22
+ 5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
23
+
24
+ ## Requirements & Setup
25
+
26
+ This pipeline relies on **Google Cloud Platform** services and specific Python libraries. You will need:
27
+
28
+ 1. **A Google Cloud Project:**
29
+ - Enable the **Vertex AI API**.
30
+ - Enable the **Generative Language API** (for Gemini 1.5 Flash descriptions).
31
+ 2. **Authentication:**
32
+ - **Google Cloud Authentication:** The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`. For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
33
+ - **Gemini API Key:** An API key for the Google AI Gemini models. You can get one from [Google AI Studio](https://aistudio.google.com/app/apikey). Set this as an environment variable or directly in the code (though environment variables are recommended for security).
34
+
35
+ ### Local Setup
36
+
37
+ 1. **Clone the repository:**
38
+ ```bash
39
+ git clone [https://huggingface.co/](https://huggingface.co/)Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
40
+ cd pdf-multimodal-multilingual-embedding-pipeline
41
+ ```
42
+ 2. **Install dependencies:**
43
+ ```bash
44
+ pip install -r requirements.txt
45
+ ```
46
+ **System-level dependencies for Camelot/PyMuPDF (Linux/Colab):**
47
+ You might need to install these system packages for `PyMuPDF` and `Camelot` to function correctly.
48
+ ```bash
49
+ # Update package list
50
+ sudo apt-get update
51
+ # Install Ghostscript (required by Camelot)
52
+ sudo apt-get install -y ghostscript
53
+ # Install python3-tk (required by some PyMuPDF functionalities)
54
+ sudo apt-get install -y python3-tk
55
+ # Install OpenCV (via apt, for camelot-py[cv])
56
+ sudo apt-get install -y libopencv-dev python3-opencv
57
+ ```
58
+ *Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
59
+
60
+ 3. **Set up Environment Variables:**
61
+ ```bash
62
+ export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
63
+ export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
64
+ export GEMINI_API_KEY="your-gemini-api-key"
65
+ ```
66
+ Replace `your-gcp-project-id`, `us-central1`, and `your-gemini-api-key` with your actual values.
67
+
68
+ 4. **Place your PDF files:**
69
+ Create a `docs` directory in the root of the repository and place your PDF documents inside it.
70
+ ```
71
+ pdf-multimodal-multilingual-embedding-pipeline/
72
+ ├── docs/
73
+ │ └── your_document.pdf
74
+ │ └── another_document.pdf
75
+ ```
76
+
77
+ 5. **Run the pipeline:**
78
+ ```bash
79
+ python run_pipeline.py
80
+ ```
81
+ The generated embedding file (`embeddings_statistiques_multimodal.json`) and extracted assets will be saved in the `output/` directory.
82
+
83
+ ### Google Colab Usage
84
+
85
+ A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments.
86
+
87
+ 1. **Open a new Google Colab notebook.**
88
+ 2. **Install system dependencies:**
89
+ ```python
90
+ !pip uninstall -y camelot camelot-py # Ensure clean install
91
+ !pip install PyMuPDF
92
+ !apt-get update
93
+ !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
94
+ !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow
95
+ ```
96
+ 3. **Authenticate:**
97
+ ```python
98
+ from google.colab import auth
99
+ auth.authenticate_user()
100
+ ```
101
+ 4. **Set your API Key and Project/Location:**
102
+ ```python
103
+ import os
104
+ # Replace with your actual Gemini API key
105
+ os.environ["GENAI_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE"
106
+ # Replace with your actual Google Cloud Project ID
107
+ os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
108
+ # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
109
+ os.environ["VERTEX_AI_LOCATION"] = "us-central1"
110
+ ```
111
+ 5. **Upload your PDF files:**
112
+ You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
113
+ ```python
114
+ # Example for uploading
115
+ from google.colab import files
116
+ import os
117
+ PDF_DIRECTORY = Path("/content/docs")
118
+ PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
119
+ uploaded = files.upload()
120
+ for filename in uploaded.keys():
121
+ os.rename(filename, PDF_DIRECTORY / filename)
122
+ ```
123
+ 6. **Copy and paste the code from `run_pipeline.py` (and `src/` files if you don't use modules) into Colab cells and execute.**
124
+
125
+ ## Output
126
+
127
+ The pipeline will generate:
128
+ - `embeddings_statistiques_multimodal.json`: A JSON file containing all generated embeddings and their metadata.
129
+ - `output/extracted_graphs/`: Directory containing extracted images (PNG format).
130
+ - `output/extracted_tables/`: Directory containing HTML representations of extracted tables.
131
+
132
+ ## Example `embeddings_statistiques_multimodal.json` Entry
133
+
134
+ ```json
135
+ [
136
+ {
137
+ "pdf_file": "sample.pdf",
138
+ "page_number": 1,
139
+ "chunk_id": "text_0",
140
+ "content_type": "text",
141
+ "text_content": "This is a chunk of text extracted from the first page of the document...",
142
+ "embedding": [0.123, -0.456, ..., 0.789],
143
+ "pdf_title": "Sample Document",
144
+ "pdf_subject": "Data Analysis",
145
+ "pdf_keywords": "statistics, report"
146
+ },
147
+ {
148
+ "pdf_file": "sample.pdf",
149
+ "page_number": 2,
150
+ "chunk_id": "table_0",
151
+ "content_type": "table",
152
+ "text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...",
153
+ "embedding": [-0.987, 0.654, ..., 0.321],
154
+ "table_html_url": "/static/extracted_tables/sample_p2_table0.html",
155
+ "image_url": "/static/extracted_graphs/sample_p2_table0.png",
156
+ "pdf_title": "Sample Document",
157
+ "pdf_subject": "Data Analysis",
158
+ "pdf_keywords": "statistics, report"
159
+ },
160
+ {
161
+ "pdf_file": "sample.pdf",
162
+ "page_number": 3,
163
+ "chunk_id": "image_0",
164
+ "content_type": "image",
165
+ "text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.",
166
+ "embedding": [0.456, -0.789, ..., 0.123],
167
+ "image_url": "/static/extracted_graphs/sample_p3_img0.png",
168
+ "pdf_title": "Sample Document",
169
+ "pdf_subject": "Data Analysis",
170
+ "pdf_keywords": "statistics, report"
171
+ }
172
+ ]
173
+ ```
174
+
175
+ # Acknowledgments
176
+ This pipeline leverages the power of:
177
+ - Google Cloud Vertex AI
178
+ - Google AI Gemini Models
179
+ - PyMuPDF
180
+ - Camelot
181
+ - Tiktoken
182
+ - Pandas
183
+ - BeautifulSoup
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ PyMuPDF
2
+ camelot-py[cv]
3
+ google-cloud-aiplatform
4
+ google-generativeai
5
+ tiktoken
6
+ pandas
7
+ beautifulsoup4
8
+ Pillow
run_pipeline.py ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import traceback
4
+ from pathlib import Path
5
+ import tiktoken
6
+
7
+ # Import functions from your src directory
8
+ from src.pdf_processor import extract_page_data_pymupdf, clean_text
9
+ from src.embedding_utils import initialize_clients, token_chunking, generate_multimodal_description, generate_text_embedding, ENCODING_NAME, MAX_TOKENS_NORMAL
10
+
11
+ # --- Configuration ---
12
+ # You can set these directly or get them from environment variables (recommended)
13
+ PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
14
+ LOCATION = os.getenv("VERTEX_AI_LOCATION")
15
+ GENAI_API_KEY = os.getenv("GENAI_API_KEY") # For Gemini API
16
+
17
+ # Path configuration
18
+ BASE_DIR = Path.cwd() # Current working directory of the script
19
+ PDF_DIRECTORY = BASE_DIR / "docs"
20
+ OUTPUT_DIR = BASE_DIR / "output" # New output directory for generated files
21
+ EMBEDDINGS_FILE_PATH = OUTPUT_DIR / "embeddings_statistiques_multimodal.json"
22
+
23
+ # Directory to save extracted images and tables HTML (within output)
24
+ IMAGE_SAVE_SUBDIR = "extracted_graphs"
25
+ TABLE_SAVE_SUBDIR = "extracted_tables"
26
+ # Absolute paths for saving
27
+ IMAGE_SAVE_DIR = OUTPUT_DIR / IMAGE_SAVE_SUBDIR
28
+ TABLE_SAVE_DIR = OUTPUT_DIR / TABLE_SAVE_SUBDIR
29
+
30
+ # --- Main Processing Function ---
31
+
32
+ def process_pdfs_in_directory(directory):
33
+ """Main processing pipeline for all PDFs in a directory."""
34
+ all_embeddings_data = []
35
+ processed_files = 0
36
+ pdf_files = list(directory.glob("*.pdf"))
37
+ total_files = len(pdf_files)
38
+
39
+ if total_files == 0:
40
+ print(f"Aucun fichier PDF trouvé dans le répertoire : {directory}")
41
+ return []
42
+
43
+ for pdf_file_path in pdf_files:
44
+ processed_files += 1
45
+ print(f"\nTraitement de {pdf_file_path.name} ({processed_files}/{total_files})...")
46
+
47
+ page_data_list = extract_page_data_pymupdf(pdf_file_path, IMAGE_SAVE_DIR, TABLE_SAVE_DIR, IMAGE_SAVE_SUBDIR, TABLE_SAVE_SUBDIR)
48
+
49
+ if not page_data_list:
50
+ print(f" Aucune donnée extraite de {pdf_file_path.name}.")
51
+ continue
52
+
53
+ for page_data in page_data_list:
54
+ pdf_file = page_data['pdf_file']
55
+ page_num = page_data['page_number']
56
+ page_text = page_data['text']
57
+ images = page_data['images'] # List of non-table image dicts
58
+ tables = page_data['tables'] # List of table dicts
59
+ pdf_title = page_data.get('pdf_title')
60
+ pdf_subject = page_data.get('pdf_subject')
61
+ pdf_keywords = page_data.get('pdf_keywords')
62
+
63
+ print(f" Génération des descriptions et embeddings pour la page {page_num}...")
64
+
65
+ # Process tables: Generate description and then embedding
66
+ for table_idx, table in enumerate(tables):
67
+ table_image_bytes = table.get('image_bytes')
68
+ table_text_repr = table.get('table_text_representation', '')
69
+ table_html_url = table.get('table_html_url')
70
+
71
+ description = None
72
+ if table_image_bytes:
73
+ prompt = "Décrivez en français le contenu et la structure de ce tableau. Mettez l'accent sur les données principales et les tendances si visibles."
74
+ print(f" Page {page_num}: Génération de la description multimodale pour le tableau {table_idx}...")
75
+ description = generate_multimodal_description(table_image_bytes, prompt)
76
+ elif table_text_repr:
77
+ prompt = f"Décrivez en français le contenu et la structure de ce tableau basé sur sa représentation textuelle:\n{table_text_repr[:1000]}..."
78
+ print(f" Page {page_num}: Génération de la description textuelle pour le tableau {table_idx} (fallback)...")
79
+ # Use the multimodal model with text-only input (via google.generativeai)
80
+ if GENAI_API_KEY:
81
+ try:
82
+ model = genai.GenerativeModel("models/gemini-1.5-flash-latest") # Explicitly use the model
83
+ response = model.generate_content(prompt)
84
+ description = response.text.strip()
85
+ except Exception as e:
86
+ print(f" Erreur lors de la génération de description textuelle pour le tableau {table_idx}: {e}")
87
+ description = None
88
+ else:
89
+ print(" Skipping text description generation for table: GEMINI_API_KEY is not set.")
90
+ description = None
91
+
92
+
93
+ if description:
94
+ print(f" Page {page_num}: Description générée pour le tableau {table_idx}.")
95
+ embedding_vector = generate_text_embedding(description) # max_retries, delay are defaults
96
+
97
+ if embedding_vector is not None:
98
+ chunk_data = {
99
+ "pdf_file": pdf_file,
100
+ "page_number": page_num,
101
+ "chunk_id": f"table_{table_idx}",
102
+ "content_type": "table",
103
+ "text_content": description,
104
+ "embedding": embedding_vector,
105
+ "table_html_url": table_html_url,
106
+ "image_url": table.get('image_url'),
107
+ "pdf_title": pdf_title,
108
+ "pdf_subject": pdf_subject,
109
+ "pdf_keywords": pdf_keywords
110
+ }
111
+ all_embeddings_data.append(chunk_data)
112
+ print(f" Page {page_num}: Embedding généré pour la description du tableau {table_idx}.")
113
+ else:
114
+ print(f" Page {page_num}: Échec de la génération de l'embedding pour la description du tableau {table_idx}. Chunk ignoré.")
115
+ else:
116
+ print(f" Page {page_num}: Aucune description générée pour le tableau {table_idx}. Chunk ignoré.")
117
+
118
+
119
+ # Process images (non-table): Generate description and then embedding
120
+ for img_idx, image in enumerate(images):
121
+ image_bytes = image.get('image_bytes')
122
+ image_url = image.get('image_url')
123
+
124
+ if image_bytes:
125
+ prompt = "Décrivez en français le contenu de cette image. S'il s'agit d'un graphique, décrivez le type de graphique (histogramme, courbe, etc.), les axes, les légendes et les principales informations ou tendances visibles."
126
+ print(f" Page {page_num}: Génération de la description multimodale pour l'image {img_idx}...")
127
+ description = generate_multimodal_description(image_bytes, prompt)
128
+
129
+ if description:
130
+ print(f" Page {page_num}: Description générée pour l'image {img_idx}.")
131
+ embedding_vector = generate_text_embedding(description) # max_retries, delay are defaults
132
+
133
+ if embedding_vector is not None:
134
+ chunk_data = {
135
+ "pdf_file": pdf_file,
136
+ "page_number": page_num,
137
+ "chunk_id": f"image_{img_idx}",
138
+ "content_type": "image",
139
+ "text_content": description,
140
+ "embedding": embedding_vector,
141
+ "image_url": image_url,
142
+ "pdf_title": pdf_title,
143
+ "pdf_subject": pdf_subject,
144
+ "pdf_keywords": pdf_keywords
145
+ }
146
+ all_embeddings_data.append(chunk_data)
147
+ print(f" Page {page_num}: Embedding généré pour la description de l'image {img_idx}.")
148
+ else:
149
+ print(f" Page {page_num}: Échec de la génération de l'embedding pour la description de l'image {img_idx}. Chunk ignoré.")
150
+ else:
151
+ print(f" Page {page_num}: Aucune description générée pour l'image {img_idx}. Chunk ignoré.")
152
+
153
+
154
+ # Process regular text: Chunk and then generate embeddings
155
+ if page_text:
156
+ try:
157
+ encoding = tiktoken.get_encoding(ENCODING_NAME)
158
+ text_chunks = token_chunking(page_text, MAX_TOKENS_NORMAL, encoding)
159
+ except Exception as e:
160
+ print(f"Erreur lors du chunking du texte de la page {page_num} : {e}. Utilisation du chunking simple.")
161
+ text_chunks = [page_text]
162
+
163
+
164
+ for chunk_idx, chunk_content in enumerate(text_chunks):
165
+ print(f" Page {page_num}: Génération de l'embedding pour le chunk de texte {chunk_idx}...")
166
+ embedding_vector = generate_text_embedding(chunk_content) # max_retries, delay are defaults
167
+
168
+ if embedding_vector is not None:
169
+ chunk_data = {
170
+ "pdf_file": pdf_file,
171
+ "page_number": page_num,
172
+ "chunk_id": f"text_{chunk_idx}",
173
+ "content_type": "text",
174
+ "text_content": chunk_content,
175
+ "embedding": embedding_vector,
176
+ "pdf_title": pdf_title,
177
+ "pdf_subject": pdf_subject,
178
+ "pdf_keywords": pdf_keywords
179
+ }
180
+ all_embeddings_data.append(chunk_data)
181
+ print(f" Page {page_num}: Chunk de texte {chunk_idx} traité avec succès.")
182
+ else:
183
+ print(f" Page {page_num}: Échec de la génération de l'embedding pour le chunk de texte {chunk_idx}. Chunk ignoré.")
184
+
185
+
186
+ print(f" Page {page_num} terminée. Éléments traités : {len(tables)} tableaux, {len(images)} images, {len(text_chunks)} chunks de texte.")
187
+
188
+
189
+ return all_embeddings_data
190
+
191
+ # --- Main Execution ---
192
+ if __name__ == "__main__":
193
+ print("Démarrage du traitement PDF multimodal avec génération de descriptions et embeddings textuels multilingues...")
194
+
195
+ # Validate and create directories
196
+ if not PDF_DIRECTORY.is_dir():
197
+ print(f"❌ ERREUR: Répertoire PDF non trouvé ou n'est pas un répertoire : {PDF_DIRECTORY}")
198
+ exit(1)
199
+
200
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
201
+ IMAGE_SAVE_DIR.mkdir(parents=True, exist_ok=True)
202
+ TABLE_SAVE_DIR.mkdir(parents=True, exist_ok=True)
203
+ print(f"Répertoire de sortie : {OUTPUT_DIR}")
204
+ print(f"Répertoire de sauvegarde des images : {IMAGE_SAVE_DIR}")
205
+ print(f"Répertoire de sauvegarde des tableaux (HTML) : {TABLE_SAVE_DIR}")
206
+
207
+ # Initialize clients for Vertex AI and GenAI
208
+ initialize_clients(PROJECT_ID, LOCATION, GENAI_API_KEY)
209
+
210
+ final_embeddings = process_pdfs_in_directory(PDF_DIRECTORY)
211
+
212
+ if final_embeddings:
213
+ print(f"\nTotal d'embeddings générés : {len(final_embeddings)}.")
214
+ try:
215
+ with EMBEDDINGS_FILE_PATH.open('w', encoding='utf-8') as f:
216
+ json.dump(final_embeddings, f, indent=2, ensure_ascii=False)
217
+ print(f"Embeddings sauvegardés avec succès dans : {EMBEDDINGS_FILE_PATH}")
218
+ except Exception as e:
219
+ print(f"\nErreur lors de la sauvegarde du fichier JSON d'embeddings : {e}")
220
+ traceback.print_exc()
221
+ else:
222
+ print("\nAucun embedding n'a été généré.")
src/embedding_utils.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import random
4
+ import traceback
5
+ import tiktoken
6
+
7
+ import google.generativeai as genai
8
+ import vertexai
9
+ from vertexai.language_models import TextEmbeddingModel
10
+
11
+ # Configuration (will be initialized from run_pipeline.py)
12
+ # For module, these should ideally be arguments or imported from a config
13
+ # GENAI_API_KEY = os.getenv("GENAI_API_KEY")
14
+ # PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
15
+ # LOCATION = os.getenv("VERTEX_AI_LOCATION")
16
+
17
+ MULTIMODAL_MODEL_GENAI = "models/gemini-1.5-flash-latest"
18
+ TEXT_EMBEDDING_MODEL_VERTEXAI = "text-multilingual-embedding-002"
19
+ EMBEDDING_DIMENSION = 768 # text-multilingual-embedding-002 has 768 dimensions
20
+
21
+ MAX_TOKENS_NORMAL = 500
22
+ ENCODING_NAME = "cl100k_base"
23
+
24
+ # Global client for Vertex AI Text Embedding Model
25
+ text_embedding_model_client = None
26
+
27
+ def initialize_clients(project_id, location, genai_api_key):
28
+ """Initializes Vertex AI and GenAI clients."""
29
+ global text_embedding_model_client
30
+
31
+ if genai_api_key:
32
+ genai.configure(api_key=genai_api_key)
33
+ print("✓ Google Generative AI configured.")
34
+ else:
35
+ print("⚠️ AVERTISSEMENT: La clé API Gemini n'est pas définie. La génération de descriptions multimodales échouera.")
36
+
37
+ if project_id and location:
38
+ try:
39
+ vertexai.init(project=project_id, location=location)
40
+ print(f"✓ Vertex AI SDK initialisé pour le projet {project_id} dans la région {location}.")
41
+ text_embedding_model_client = TextEmbeddingModel.from_pretrained(TEXT_EMBEDDING_MODEL_VERTEXAI)
42
+ print(f"✓ Modèle d'embedding textuel Vertex AI '{TEXT_EMBEDDING_MODEL_VERTEXAI}' chargé avec succès.")
43
+ except Exception as e:
44
+ print(f"❌ ERREUR: Échec de l'initialisation du Vertex AI SDK ou du chargement du modèle d'embedding textuel : {str(e)}")
45
+ print("⚠️ La génération d'embeddings textuels échouera.")
46
+ text_embedding_model_client = None
47
+ else:
48
+ print("⚠️ Vertex AI SDK non initialisé car l'ID du projet Google Cloud ou la localisation sont manquants.")
49
+ print("⚠️ La génération d'embeddings textuels échouera.")
50
+ text_embedding_model_client = None
51
+
52
+
53
+ def token_chunking(text, max_tokens, encoding):
54
+ """Chunk text based on token count with smarter boundaries (sentences, paragraphs)"""
55
+ if not text:
56
+ return []
57
+
58
+ tokens = encoding.encode(text)
59
+ chunks = []
60
+ start_token_idx = 0
61
+
62
+ while start_token_idx < len(tokens):
63
+ end_token_idx = min(start_token_idx + max_tokens, len(tokens))
64
+
65
+ if end_token_idx < len(tokens):
66
+ look_ahead_limit = min(start_token_idx + max_tokens * 2, len(tokens))
67
+ text_segment_to_check = encoding.decode(tokens[start_token_idx:look_ahead_limit])
68
+
69
+ paragraph_break = text_segment_to_check.rfind('\n\n', 0, len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens)))
70
+ if paragraph_break != -1:
71
+ tokens_up_to_break = encoding.encode(text_segment_to_check[:paragraph_break])
72
+ end_token_idx = start_token_idx + len(tokens_up_to_break)
73
+ else:
74
+ sentence_end = re.search(r'[.!?]\s+', text_segment_to_check[:len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens))][::-1])
75
+ if sentence_end:
76
+ char_index_in_segment = len(text_segment_to_check) - 1 - sentence_end.start()
77
+ tokens_up_to_end = encoding.encode(text_segment_to_check[:char_index_in_segment + 1])
78
+ end_token_idx = start_token_idx + len(tokens_up_to_end)
79
+
80
+ current_chunk_tokens = tokens[start_token_idx:end_token_idx]
81
+ chunk_text = encoding.decode(current_chunk_tokens).strip()
82
+
83
+ if chunk_text:
84
+ chunks.append(chunk_text)
85
+
86
+ if start_token_idx == end_token_idx:
87
+ start_token_idx += 1
88
+ else:
89
+ start_token_idx = end_token_idx
90
+
91
+ return chunks
92
+
93
+
94
+ def generate_multimodal_description(image_bytes, prompt_text, multimodal_model_genai_name=MULTIMODAL_MODEL_GENAI, max_retries=5, delay=10):
95
+ """
96
+ Generate a text description for an image using a multimodal model (google.generativeai).
97
+ Returns description text or None if all retries fail or API key is missing.
98
+ """
99
+ if not genai.api_key: # Check if API key is configured
100
+ print(" Skipping multimodal description generation: GEMINI_API_KEY is not set.")
101
+ return None
102
+
103
+ for attempt in range(max_retries):
104
+ try:
105
+ time.sleep(delay + random.uniform(0, 5))
106
+
107
+ content = [
108
+ prompt_text,
109
+ {
110
+ 'mime_type': 'image/png',
111
+ 'data': image_bytes
112
+ }
113
+ ]
114
+
115
+ model = genai.GenerativeModel(multimodal_model_genai_name)
116
+ response = model.generate_content(content)
117
+
118
+ description = response.text.strip()
119
+
120
+ if description:
121
+ return description
122
+ else:
123
+ print(f" Tentative {attempt+1}/{max_retries}: Réponse vide ou inattendue du modèle multimodal.")
124
+ if attempt < max_retries - 1:
125
+ retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
126
+ print(f" Réessai dans {retry_delay:.2f}s...")
127
+ time.sleep(retry_delay)
128
+ continue
129
+ # else:
130
+ # print(f" Toutes les {max_retries} tentatives ont échoué pour générer la description.")
131
+ # return None
132
+
133
+
134
+ except Exception as e:
135
+ error_msg = str(e)
136
+ print(f" Tentative {attempt+1}/{max_retries} échouée pour la description : {error_msg}")
137
+
138
+ if "429" in error_msg or "quota" in error_msg.lower() or "rate limit" in error_msg.lower() or "unavailable" in error_msg.lower() or "internal error" in error_msg.lower():
139
+ if attempt < max_retries - 1:
140
+ retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
141
+ print(f" Erreur d'API retryable détectée. Réessai dans {retry_delay:.2f}s...")
142
+ time.sleep(retry_delay)
143
+ continue
144
+ # else:
145
+ # print(f" Toutes les {max_retries} tentatives ont échoué pour la description.")
146
+ # return None
147
+
148
+ else:
149
+ print(f" Erreur d'API non retryable détectée : {error_msg}")
150
+ traceback.print_exc()
151
+ return None
152
+
153
+ print(f" Toutes les {max_retries} tentatives ont échoué pour la description (fin de boucle).")
154
+ return None
155
+
156
+
157
+ def generate_text_embedding(text_content, max_retries=5, delay=5):
158
+ """
159
+ Generate text embedding using the Vertex AI multilingual embedding model.
160
+ Returns embedding vector (list) or None if all retries fail or client is not initialized.
161
+ """
162
+ global text_embedding_model_client # Ensure we are using the global client
163
+
164
+ if not text_embedding_model_client:
165
+ print(" Skipping text embedding generation: Vertex AI embedding client is not initialized.")
166
+ return None
167
+
168
+ if not text_content or not text_content.strip():
169
+ return None # Cannot embed empty text
170
+
171
+ for attempt in range(max_retries):
172
+ try:
173
+ time.sleep(delay + random.uniform(0, 2))
174
+
175
+ embeddings = text_embedding_model_client.get_embeddings( # Corrected method name
176
+ [text_content] # Removed task_type
177
+ )
178
+
179
+ if embeddings and len(embeddings) > 0 and hasattr(embeddings[0], 'values') and isinstance(embeddings[0].values, list) and len(embeddings[0].values) == EMBEDDING_DIMENSION:
180
+ return embeddings[0].values
181
+ else:
182
+ print(f" Tentative {attempt+1}/{max_retries}: Format d'embedding Vertex AI inattendu. Réponse : {embeddings}")
183
+ return None
184
+
185
+ except Exception as e:
186
+ error_msg = str(e)
187
+ print(f" Tentative {attempt+1}/{max_retries} échouée pour l'embedding Vertex AI : {error_msg}")
188
+
189
+ if "429" in error_msg or "quota" in error_msg.lower() or "rate limit" in error_msg.lower() or "unavailable" in error_msg.lower() or "internal error" in error_msg.lower():
190
+ if attempt < max_retries - 1:
191
+ retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
192
+ print(f" Erreur d'API Vertex AI retryable détectée. Réessai dans {retry_delay:.2f}s...")
193
+ time.sleep(retry_delay)
194
+ continue
195
+ # else:
196
+ # print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding Vertex AI.")
197
+ # return None
198
+
199
+ else:
200
+ print(f" Erreur d'API Vertex AI non retryable détectée : {error_msg}")
201
+ traceback.print_exc()
202
+ return None
203
+
204
+ print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding Vertex AI (fin de boucle).")
205
+ return None
src/pdf_processor.py ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import fitz # PyMuPDF
2
+ import camelot # For table extraction
3
+ import pandas as pd
4
+ from bs4 import BeautifulSoup
5
+ import re
6
+ from pathlib import Path
7
+ import traceback
8
+
9
+ # Path configuration (assuming these are passed or relative to run_pipeline.py)
10
+ # For module, these should ideally be arguments or imported from a config
11
+ # BASE_DIR = Path("/content/")
12
+ # PDF_DIRECTORY = BASE_DIR / "docs"
13
+ # IMAGE_SAVE_SUBDIR = "extracted_graphs"
14
+ # TABLE_SAVE_SUBDIR = "extracted_tables"
15
+ # STATIC_DIR = BASE_DIR / "static"
16
+ # IMAGE_SAVE_DIR = STATIC_DIR / IMAGE_SAVE_SUBDIR
17
+ # TABLE_SAVE_DIR = STATIC_DIR / TABLE_SAVE_SUBDIR
18
+
19
+ # These should be passed as arguments or configured at a higher level
20
+ IMAGE_MIN_WIDTH = 100 # Ignore very small images (likely logos/icons)
21
+ IMAGE_MIN_HEIGHT = 100
22
+
23
+ def clean_text(text):
24
+ """Normalize whitespace and clean text while preserving paragraph breaks"""
25
+ if not text:
26
+ return ""
27
+ # Replace tabs with spaces, but preserve paragraph breaks
28
+ text = text.replace('\t', ' ')
29
+ # Normalize multiple spaces to single spaces
30
+ text = re.sub(r' +', ' ', text)
31
+ # Preserve paragraph breaks but normalize them
32
+ text = re.sub(r'\n{3,}', '\n\n', text)
33
+ return text.strip()
34
+
35
+ def extract_page_data_pymupdf(pdf_path, image_save_dir, table_save_dir, image_save_subdir, table_save_subdir):
36
+ """Extract text, tables and save images from each page using PyMuPDF and Camelot."""
37
+ page_data_list = []
38
+ try:
39
+ doc = fitz.open(pdf_path)
40
+ metadata = doc.metadata or {}
41
+ pdf_data = {
42
+ 'pdf_title': metadata.get('title', pdf_path.name),
43
+ 'pdf_subject': metadata.get('subject', 'Statistiques'),
44
+ 'pdf_keywords': metadata.get('keywords', '')
45
+ }
46
+
47
+ for page_num in range(len(doc)):
48
+ page = doc.load_page(page_num)
49
+ page_index = page_num + 1 # 1-based index
50
+
51
+ print(f" Extraction des données de la page {page_index}...")
52
+
53
+ # Extract tables first
54
+ table_data = extract_tables_and_images_from_page(pdf_path, page, page_index, table_save_dir, image_save_dir, image_save_subdir, table_save_subdir)
55
+
56
+ # Track table regions to avoid double-processing text
57
+ table_regions = []
58
+ for item in table_data:
59
+ if 'rect' in item and item['rect'] and len(item['rect']) == 4:
60
+ table_regions.append(fitz.Rect(item['rect']))
61
+ else:
62
+ print(f" Warning: Invalid rect for table on page {page_index}")
63
+
64
+
65
+ # Extract text excluding table regions
66
+ page_text = ""
67
+ if table_regions:
68
+ # Get text blocks
69
+ blocks = page.get_text("blocks")
70
+ for block in blocks:
71
+ block_rect = fitz.Rect(block[:4])
72
+ is_in_table = False
73
+ for table_rect in table_regions:
74
+ if block_rect.intersects(table_rect):
75
+ is_in_table = True
76
+ break
77
+ if not is_in_table:
78
+ page_text += block[4] + "\n" # Add text content
79
+ else:
80
+ # If no tables, get all text
81
+ page_text = page.get_text("text")
82
+
83
+ page_text = clean_text(page_text)
84
+
85
+
86
+ # Extract and save images (excluding those identified as tables)
87
+ image_data = extract_images_from_page(pdf_path, page, page_index, image_save_dir, image_save_subdir, excluded_rects=table_regions)
88
+
89
+
90
+ page_data_list.append({
91
+ 'pdf_file': pdf_path.name,
92
+ 'page_number': page_index,
93
+ 'text': page_text,
94
+ 'images': image_data, # Includes non-table images
95
+ 'tables': [item for item in table_data if item['content_type'] == 'table'], # Only table data here
96
+ 'pdf_title': pdf_data.get('pdf_title'),
97
+ 'pdf_subject': pdf_data.get('pdf_subject'),
98
+ 'pdf_keywords': pdf_data.get('pdf_keywords')
99
+ })
100
+ doc.close()
101
+ except Exception as e:
102
+ print(f"Erreur lors du traitement du PDF {pdf_path.name} avec PyMuPDF : {str(e)}")
103
+ traceback.print_exc() # Print traceback for debugging
104
+ return page_data_list
105
+
106
+
107
+ def extract_tables_and_images_from_page(pdf_path, page, page_num, table_save_dir, image_save_dir, image_save_subdir, table_save_subdir):
108
+ """Extract tables using Camelot and capture images of table areas."""
109
+ table_and_image_data = []
110
+ try:
111
+ tables = camelot.read_pdf(
112
+ str(pdf_path),
113
+ pages=str(page_num),
114
+ flavor='lattice',
115
+ )
116
+
117
+ if len(tables) == 0:
118
+ tables = camelot.read_pdf(
119
+ str(pdf_path),
120
+ pages=str(page_num),
121
+ flavor='stream'
122
+ )
123
+
124
+ for i, table in enumerate(tables):
125
+ if table.accuracy < 70:
126
+ print(f" Skipping low accuracy table ({table.accuracy:.2f}%) on page {page_num}")
127
+ continue
128
+
129
+ table_bbox = table.parsing_report.get('page_bbox', [0, 0, 0, 0])
130
+ if not table_bbox or len(table_bbox) != 4:
131
+ print(f" Warning: Invalid bounding box for table {i} on page {page_num}. Skipping image capture.")
132
+ table_rect = None
133
+ else:
134
+ table_rect = fitz.Rect(table_bbox)
135
+
136
+ safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
137
+ table_html_filename = f"{safe_pdf_name}_p{page_num}_table{i}.html"
138
+ table_html_save_path = table_save_dir / table_html_filename
139
+ relative_html_url_path = f"/static/{table_save_subdir}/{table_html_filename}"
140
+
141
+ table_image_filename = f"{safe_pdf_name}_p{page_num}_table{i}.png"
142
+ table_image_save_path = image_save_dir / table_image_filename
143
+ relative_image_url_path = f"/static/{image_save_subdir}/{table_image_filename}"
144
+
145
+
146
+ df = table.df
147
+ html = f"<caption>Table extrait de {pdf_path.name}, page {page_num}</caption>\n" + df.to_html(index=False)
148
+ soup = BeautifulSoup(html, 'html.parser')
149
+ table_tag = soup.find('table')
150
+ if table_tag:
151
+ table_tag['class'] = 'table table-bordered table-striped'
152
+ table_tag['style'] = 'width:100%; border-collapse:collapse;'
153
+
154
+ style_tag = soup.new_tag('style')
155
+ style_tag.string = """
156
+ .table { border-collapse: collapse; width: 100%; margin-bottom: 1rem;}
157
+ .table caption { caption-side: top; padding: 0.5rem; text-align: left; font-weight: bold; }
158
+ .table th, .table td { border: 1px solid #ddd; padding: 8px; text-align: left; }
159
+ .table th { background-color: #f2f2f2; font-weight: bold; }
160
+ .table-striped tbody tr:nth-of-type(odd) { background-color: rgba(0,0,0,.05); }
161
+ .table-responsive { overflow-x: auto; margin-bottom: 1rem; }
162
+ """
163
+ soup.insert(0, style_tag)
164
+
165
+ div = soup.new_tag('div')
166
+ div['class'] = 'table-responsive'
167
+ table_tag.wrap(div)
168
+
169
+ with open(table_html_save_path, 'w', encoding='utf-8') as f:
170
+ f.write(str(soup))
171
+ else:
172
+ print(f" Warning: Could not find table tag in HTML for table on page {page_num}. Skipping HTML save.")
173
+ continue
174
+
175
+ table_image_bytes = None
176
+ if table_rect:
177
+ try:
178
+ pix = page.get_pixmap(clip=table_rect)
179
+ table_image_bytes = pix.tobytes(format='png')
180
+
181
+ with open(table_image_save_path, "wb") as img_file:
182
+ img_file.write(table_image_bytes)
183
+
184
+ except Exception as img_capture_e:
185
+ print(f" Erreur lors de la capture d'image du tableau {i} page {page_num} : {img_capture_e}")
186
+ traceback.print_exc()
187
+ table_image_bytes = None
188
+
189
+
190
+ table_and_image_data.append({
191
+ 'content_type': 'table',
192
+ 'table_html_url': relative_html_url_path,
193
+ 'table_text_representation': df.to_string(index=False),
194
+ 'rect': [table_rect.x0, table_rect.y0, table_rect.x1, table_rect.y1] if table_rect else None,
195
+ 'accuracy': table.accuracy,
196
+ 'image_bytes': table_image_bytes,
197
+ 'image_url': relative_image_url_path if table_image_bytes else None
198
+ })
199
+
200
+ return table_and_image_data
201
+
202
+ except Exception as e:
203
+ print(f" Erreur lors de l'extraction des tableaux de la page {page_num} : {str(e)}")
204
+ traceback.print_exc()
205
+ return []
206
+
207
+ def extract_images_from_page(pdf_path, page, page_num, image_save_dir, image_save_subdir, excluded_rects=[]):
208
+ """Extract and save images from a page, excluding specified regions (like tables)."""
209
+ image_data = []
210
+ image_list = page.get_images(full=True)
211
+
212
+ for img_index, img_info in enumerate(image_list):
213
+ xref = img_info[0]
214
+ try:
215
+ base_image = page.parent.extract_image(xref)
216
+ image_bytes = base_image["image"]
217
+ image_ext = base_image["ext"]
218
+ width = base_image["width"]
219
+ height = base_image["height"]
220
+
221
+ if width < IMAGE_MIN_WIDTH or height < IMAGE_MIN_HEIGHT:
222
+ continue
223
+
224
+ img_rect = None
225
+ img_rects = page.get_image_rects(xref)
226
+ if img_rects:
227
+ img_rect = img_rects[0]
228
+
229
+ if img_rect is None:
230
+ print(f" Warning: Could not find rectangle for image {img_index} on page {page_num}. Skipping.")
231
+ continue
232
+
233
+ is_excluded = False
234
+ for excluded_rect in excluded_rects:
235
+ if img_rect.intersects(excluded_rect):
236
+ is_excluded = True
237
+ break
238
+ if is_excluded:
239
+ print(f" Image {img_index} on page {page_num} is within an excluded region (e.g., table). Skipping.")
240
+ continue
241
+
242
+ safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
243
+ image_filename = f"{safe_pdf_name}_p{page_num}_img{img_index}.{image_ext}"
244
+ image_save_path = image_save_dir / image_filename
245
+ relative_url_path = f"/static/{image_save_subdir}/{image_filename}"
246
+
247
+ with open(image_save_path, "wb") as img_file:
248
+ img_file.write(image_bytes)
249
+
250
+ image_data.append({
251
+ 'content_type': 'image',
252
+ 'image_url': relative_url_path,
253
+ 'rect': [img_rect.x0, img_rect.y0, img_rect.x1, img_rect.y1],
254
+ 'image_bytes': image_bytes
255
+ })
256
+
257
+ except Exception as img_save_e:
258
+ print(f" Erreur lors du traitement de l'image {img_index} de la page {page_num} : {img_save_e}")
259
+ traceback.print_exc()
260
+
261
+ return image_data