Anonymous1223334444 commited on
Commit
2721ce7
·
1 Parent(s): ad96b83
Files changed (3) hide show
  1. README.md +36 -34
  2. requirements.txt +5 -2
  3. run_pipeline.py +485 -30
README.md CHANGED
@@ -7,48 +7,47 @@ tags:
7
  - rag
8
  - google-cloud
9
  - vertex-ai
10
- - gemini
11
  - python
12
  datasets:
13
- - any
14
  license: mit
15
  ---
16
 
17
- # Multimodal & Multilingual PDF Embedding Pipeline
18
 
19
- This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images), and then creates multilingual text embeddings for all extracted information. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
20
 
21
  **Key Features:**
22
- - **Multimodal:** Processes text, tables, and images from PDFs.
23
- - **Multilingual:** Leverages Google's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
24
- - **Contextual Descriptions:** Uses Google Gemini (Gemini 1.5 Flash) to generate descriptive text for tables and images in French.
25
  - **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
26
 
27
  ## How it Works
28
 
29
  1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
30
  2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
31
- 3. **Multimodal Description (for Tables & Images):**
32
  - For tables, the pipeline captures an image of the table and also uses its text representation.
33
  - For standalone images (e.g., graphs, charts), it captures the image.
34
- - These images are then sent to the `gemini-1.5-flash-latest` model (via `google.generativeai`) with specific prompts to generate rich, descriptive text in French.
35
- 4. **Multilingual Text Embedding:**
36
  - The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
37
- - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content.
38
  5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
39
 
40
  ## Requirements & Setup
41
 
42
- This pipeline relies on **Google Cloud Platform** services and specific Python libraries. You will need:
43
 
44
- 1. **A Google Cloud Project with Billing Enabled:**
45
- - **IMPORTANT:** Running this pipeline will incur costs on your Google Cloud Platform account for API usage (Vertex AI and Gemini API). Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project.
46
  - Enable the **Vertex AI API**.
47
- - Enable the **Generative Language API** (for Gemini 1.5 Flash descriptions).
48
- 2. **Authentication:**
49
- - **Google Cloud Authentication:** The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`. For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
50
- - **Gemini API Key:** An API key for the Google AI Gemini models. You can get one from [Google AI Studio](https://aistudio.google.com/app/apikey). Set this as an environment variable or directly in the code (though environment variables are recommended for security).
51
-
52
 
53
  ### Local Setup
54
 
@@ -57,7 +56,7 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
57
  git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
58
  cd pdf-multimodal-multilingual-embedding-pipeline
59
  ```
60
- 2. **Install dependencies:**
61
  ```bash
62
  pip install -r requirements.txt
63
  ```
@@ -75,13 +74,12 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
75
  ```
76
  *Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
77
 
78
- 3. **Set up Environment Variables:**
79
  ```bash
80
  export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
81
  export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
82
- export GEMINI_API_KEY="your-gemini-api-key"
83
  ```
84
- Replace `your-gcp-project-id`, `us-central1`, and `your-gemini-api-key` with your actual values.
85
 
86
  4. **Place your PDF files:**
87
  Create a `docs` directory in the root of the repository and place your PDF documents inside it.
@@ -89,7 +87,7 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
89
  pdf-multimodal-multilingual-embedding-pipeline/
90
  ├── docs/
91
  │ └── your_document.pdf
92
- └── another_document.pdf
93
  ```
94
 
95
  5. **Run the pipeline:**
@@ -100,45 +98,49 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
100
 
101
  ### Google Colab Usage
102
 
103
- A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments.
104
 
105
  1. **Open a new Google Colab notebook.**
106
- 2. **Install system dependencies:**
 
107
  ```python
108
  !pip uninstall -y camelot camelot-py # Ensure clean install
109
  !pip install PyMuPDF
110
  !apt-get update
111
  !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
112
- !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow
113
  ```
114
- 3. **Authenticate:**
115
  ```python
116
  from google.colab import auth
117
  auth.authenticate_user()
118
  ```
119
- 4. **Set your API Key and Project/Location:**
120
  ```python
121
  import os
122
- # Replace with your actual Gemini API key
123
- os.environ["GENAI_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE"
124
  # Replace with your actual Google Cloud Project ID
125
  os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
126
  # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
127
  os.environ["VERTEX_AI_LOCATION"] = "us-central1"
 
 
 
128
  ```
129
- 5. **Upload your PDF files:**
130
  You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
131
  ```python
132
  # Example for uploading
133
  from google.colab import files
134
  import os
 
 
135
  PDF_DIRECTORY = Path("/content/docs")
136
  PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
137
  uploaded = files.upload()
138
  for filename in uploaded.keys():
139
  os.rename(filename, PDF_DIRECTORY / filename)
140
  ```
141
- 6. **Copy and paste the code from `run_pipeline.py` (and `src/` files if you don't use modules) into Colab cells and execute.**
142
 
143
  ## Output
144
 
@@ -192,7 +194,7 @@ The pipeline will generate:
192
 
193
  # Acknowledgments
194
  This pipeline leverages the power of:
195
- - Google Cloud Vertex AI
196
  - Google AI Gemini Models
197
  - PyMuPDF
198
  - Camelot
 
7
  - rag
8
  - google-cloud
9
  - vertex-ai
10
+ - gemma
11
  - python
12
  datasets:
13
+ - no_dataset
14
  license: mit
15
  ---
16
 
17
+ # Multimodal & Multilingual PDF Embedding Pipeline with Gemma and Vertex AI
18
 
19
+ This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images) using **Google's Gemma model (running locally)**, and then creates multilingual text embeddings for all extracted information using **Google Cloud Vertex AI's `text-multilingual-embedding-002` model**. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
20
 
21
  **Key Features:**
22
+ - **Multimodal Descriptions (via Gemma):** Processes tables and images from PDFs, generating rich descriptive text in French using the open-source Gemma 3.4B-IT model, which runs locally on your machine/Colab GPU.
23
+ - **Multilingual Text Embeddings (via Vertex AI):** Leverages Google Cloud's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
 
24
  - **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
25
 
26
  ## How it Works
27
 
28
  1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
29
  2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
30
+ 3. **Multimodal Description (for Tables & Images using Gemma):**
31
  - For tables, the pipeline captures an image of the table and also uses its text representation.
32
  - For standalone images (e.g., graphs, charts), it captures the image.
33
+ - These images (and optionally table text) are then passed to the **Gemma 3.4B-IT model** (via the `gemma` Python library) with specific prompts to generate rich, descriptive text in French. **This step runs locally and does not incur direct API costs.**
34
+ 4. **Multilingual Text Embedding (via Vertex AI):**
35
  - The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
36
+ - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content. **This step connects to Google Cloud Vertex AI and will incur costs.**
37
  5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
38
 
39
  ## Requirements & Setup
40
 
41
+ This pipeline uses a combination of local models (Gemma) and **Google Cloud Platform** services (Vertex AI).
42
 
43
+ 1. **Google Cloud Project with Billing Enabled (for Text Embeddings):**
44
+ - **CRITICAL:** The text embedding generation step uses Google Cloud Vertex AI. This **will incur costs** on your Google Cloud Platform account. Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project.
45
  - Enable the **Vertex AI API**.
46
+ 2. **Authentication for Google Cloud (for Text Embeddings):**
47
+ - The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`.
48
+ - For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
49
+ 3. **Hardware Requirements (for Gemma):**
50
+ - Running the Gemma 3.4B-IT model requires a **GPU with sufficient VRAM** (e.g., a Colab T4 or V100 GPU, or a local GPU with at least ~8-10GB VRAM is recommended). If a GPU is not available, Gemma will likely run on CPU but will be significantly slower.
51
 
52
  ### Local Setup
53
 
 
56
  git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
57
  cd pdf-multimodal-multilingual-embedding-pipeline
58
  ```
59
+ 2. **Install Python dependencies:**
60
  ```bash
61
  pip install -r requirements.txt
62
  ```
 
74
  ```
75
  *Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
76
 
77
+ 3. **Set up Environment Variables (for Vertex AI Text Embeddings):**
78
  ```bash
79
  export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
80
  export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
 
81
  ```
82
+ Replace `your-gcp-project-id` and `us-central1` with your actual Google Cloud Project ID and Vertex AI region.
83
 
84
  4. **Place your PDF files:**
85
  Create a `docs` directory in the root of the repository and place your PDF documents inside it.
 
87
  pdf-multimodal-multilingual-embedding-pipeline/
88
  ├── docs/
89
  │ └── your_document.pdf
90
+ └── another_document.pdf
91
  ```
92
 
93
  5. **Run the pipeline:**
 
98
 
99
  ### Google Colab Usage
100
 
101
+ A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments and GPU access.
102
 
103
  1. **Open a new Google Colab notebook.**
104
+ 2. **Change runtime to GPU:** Go to `Runtime > Change runtime type` and select `T4 GPU` or `V100 GPU`.
105
+ 3. **Install system and Python dependencies:**
106
  ```python
107
  !pip uninstall -y camelot camelot-py # Ensure clean install
108
  !pip install PyMuPDF
109
  !apt-get update
110
  !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
111
+ !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow gemma jax jaxlib numpy
112
  ```
113
+ 4. **Authenticate to Google Cloud (for Vertex AI):**
114
  ```python
115
  from google.colab import auth
116
  auth.authenticate_user()
117
  ```
118
+ 5. **Set your Google Cloud Project ID and Location:**
119
  ```python
120
  import os
 
 
121
  # Replace with your actual Google Cloud Project ID
122
  os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
123
  # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
124
  os.environ["VERTEX_AI_LOCATION"] = "us-central1"
125
+
126
+ # Critical: Adjust JAX memory allocation for Gemma
127
+ os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
128
  ```
129
+ 6. **Upload your PDF files:**
130
  You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
131
  ```python
132
  # Example for uploading
133
  from google.colab import files
134
  import os
135
+ from pathlib import Path
136
+
137
  PDF_DIRECTORY = Path("/content/docs")
138
  PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
139
  uploaded = files.upload()
140
  for filename in uploaded.keys():
141
  os.rename(filename, PDF_DIRECTORY / filename)
142
  ```
143
+ 7. **Copy and paste the code from `src/pdf_processor.py`, `src/embedding_utils.py` and `run_pipeline.py` into Colab cells and execute.** Make sure to execute `embedding_utils.py` content first, then `pdf_processor.py` content, then `run_pipeline.py` content, or combine them logically into your notebook.
144
 
145
  ## Output
146
 
 
194
 
195
  # Acknowledgments
196
  This pipeline leverages the power of:
197
+ - Gemma AI
198
  - Google AI Gemini Models
199
  - PyMuPDF
200
  - Camelot
requirements.txt CHANGED
@@ -1,8 +1,11 @@
1
  PyMuPDF
2
  camelot-py[cv]
3
  google-cloud-aiplatform
4
- google-generativeai
5
  tiktoken
6
  pandas
7
  beautifulsoup4
8
- Pillow
 
 
 
 
 
1
  PyMuPDF
2
  camelot-py[cv]
3
  google-cloud-aiplatform
 
4
  tiktoken
5
  pandas
6
  beautifulsoup4
7
+ Pillow
8
+ gemma
9
+ jax
10
+ jaxlib
11
+ numpy
run_pipeline.py CHANGED
@@ -1,32 +1,480 @@
1
  import os
2
  import json
3
  import traceback
 
 
 
4
  from pathlib import Path
5
  import tiktoken
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
- # Import functions from your src directory
8
- from src.pdf_processor import extract_page_data_pymupdf, clean_text
9
- from src.embedding_utils import initialize_clients, token_chunking, generate_multimodal_description, generate_text_embedding, ENCODING_NAME, MAX_TOKENS_NORMAL
10
 
11
  # --- Configuration ---
12
- # You can set these directly or get them from environment variables (recommended)
13
- PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
14
- LOCATION = os.getenv("VERTEX_AI_LOCATION")
15
- GENAI_API_KEY = os.getenv("GENAI_API_KEY") # For Gemini API
 
 
 
 
 
 
 
 
16
 
17
  # Path configuration
18
- BASE_DIR = Path.cwd() # Current working directory of the script
19
  PDF_DIRECTORY = BASE_DIR / "docs"
20
- OUTPUT_DIR = BASE_DIR / "output" # New output directory for generated files
21
- EMBEDDINGS_FILE_PATH = OUTPUT_DIR / "embeddings_statistiques_multimodal.json"
22
 
23
  # Directory to save extracted images and tables HTML (within output)
24
  IMAGE_SAVE_SUBDIR = "extracted_graphs"
25
  TABLE_SAVE_SUBDIR = "extracted_tables"
26
- # Absolute paths for saving
27
  IMAGE_SAVE_DIR = OUTPUT_DIR / IMAGE_SAVE_SUBDIR
28
  TABLE_SAVE_DIR = OUTPUT_DIR / TABLE_SAVE_SUBDIR
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  # --- Main Processing Function ---
31
 
32
  def process_pdfs_in_directory(directory):
@@ -44,7 +492,7 @@ def process_pdfs_in_directory(directory):
44
  processed_files += 1
45
  print(f"\nTraitement de {pdf_file_path.name} ({processed_files}/{total_files})...")
46
 
47
- page_data_list = extract_page_data_pymupdf(pdf_file_path, IMAGE_SAVE_DIR, TABLE_SAVE_DIR, IMAGE_SAVE_SUBDIR, TABLE_SAVE_SUBDIR)
48
 
49
  if not page_data_list:
50
  print(f" Aucune donnée extraite de {pdf_file_path.name}.")
@@ -54,8 +502,8 @@ def process_pdfs_in_directory(directory):
54
  pdf_file = page_data['pdf_file']
55
  page_num = page_data['page_number']
56
  page_text = page_data['text']
57
- images = page_data['images'] # List of non-table image dicts
58
- tables = page_data['tables'] # List of table dicts
59
  pdf_title = page_data.get('pdf_title')
60
  pdf_subject = page_data.get('pdf_subject')
61
  pdf_keywords = page_data.get('pdf_keywords')
@@ -74,25 +522,25 @@ def process_pdfs_in_directory(directory):
74
  print(f" Page {page_num}: Génération de la description multimodale pour le tableau {table_idx}...")
75
  description = generate_multimodal_description(table_image_bytes, prompt)
76
  elif table_text_repr:
77
- prompt = f"Décrivez en français le contenu et la structure de ce tableau basé sur sa représentation textuelle:\n{table_text_repr[:1000]}..."
78
- print(f" Page {page_num}: Génération de la description textuelle pour le tableau {table_idx} (fallback)...")
79
- # Use the multimodal model with text-only input (via google.generativeai)
80
- if GENAI_API_KEY:
81
  try:
82
- model = genai.GenerativeModel("models/gemini-1.5-flash-latest") # Explicitly use the model
83
- response = model.generate_content(prompt)
84
- description = response.text.strip()
85
  except Exception as e:
86
- print(f" Erreur lors de la génération de description textuelle pour le tableau {table_idx}: {e}")
87
  description = None
88
  else:
89
- print(" Skipping text description generation for table: GEMINI_API_KEY is not set.")
90
  description = None
91
 
92
 
93
  if description:
94
  print(f" Page {page_num}: Description générée pour le tableau {table_idx}.")
95
- embedding_vector = generate_text_embedding(description) # max_retries, delay are defaults
96
 
97
  if embedding_vector is not None:
98
  chunk_data = {
@@ -128,7 +576,7 @@ def process_pdfs_in_directory(directory):
128
 
129
  if description:
130
  print(f" Page {page_num}: Description générée pour l'image {img_idx}.")
131
- embedding_vector = generate_text_embedding(description) # max_retries, delay are defaults
132
 
133
  if embedding_vector is not None:
134
  chunk_data = {
@@ -163,7 +611,7 @@ def process_pdfs_in_directory(directory):
163
 
164
  for chunk_idx, chunk_content in enumerate(text_chunks):
165
  print(f" Page {page_num}: Génération de l'embedding pour le chunk de texte {chunk_idx}...")
166
- embedding_vector = generate_text_embedding(chunk_content) # max_retries, delay are defaults
167
 
168
  if embedding_vector is not None:
169
  chunk_data = {
@@ -190,11 +638,13 @@ def process_pdfs_in_directory(directory):
190
 
191
  # --- Main Execution ---
192
  if __name__ == "__main__":
193
- print("Démarrage du traitement PDF multimodal avec génération de descriptions et embeddings textuels multilingues...")
194
 
195
  # Validate and create directories
196
  if not PDF_DIRECTORY.is_dir():
197
- print(f"❌ ERREUR: Répertoire PDF non trouvé ou n'est pas un répertoire : {PDF_DIRECTORY}")
 
 
198
  exit(1)
199
 
200
  OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
@@ -204,8 +654,13 @@ if __name__ == "__main__":
204
  print(f"Répertoire de sauvegarde des images : {IMAGE_SAVE_DIR}")
205
  print(f"Répertoire de sauvegarde des tableaux (HTML) : {TABLE_SAVE_DIR}")
206
 
207
- # Initialize clients for Vertex AI and GenAI
208
- initialize_clients(PROJECT_ID, LOCATION, GENAI_API_KEY)
 
 
 
 
 
209
 
210
  final_embeddings = process_pdfs_in_directory(PDF_DIRECTORY)
211
 
 
1
  import os
2
  import json
3
  import traceback
4
+ import re
5
+ import time
6
+ import random
7
  from pathlib import Path
8
  import tiktoken
9
+ import numpy as np
10
+ from PIL import Image # Pillow for image handling
11
+ import io # To handle image bytes
12
+
13
+ # Gemma imports
14
+ import jax.numpy as jnp
15
+ # For Gemma models, we need a specific setup to load the model
16
+ # For JAX/GPU memory allocation
17
+ os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
18
+ from gemma import gm
19
+
20
+ # Sentence-Transformers for text embedding
21
+ from sentence_transformers import SentenceTransformer
22
 
 
 
 
23
 
24
  # --- Configuration ---
25
+ # Set the desired Gemma model
26
+ GEMMA_MULTIMODAL_MODEL = "gemma-3.4b-it" # You can choose other Gemma variants if available and suitable
27
+
28
+ # Set the desired Sentence-Transformers model for text embeddings
29
+ # This is a good free, multilingual model.
30
+ SENTENCE_TRANSFORMER_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
31
+ # The dimension of embeddings for this model
32
+ EMBEDDING_DIMENSION = 384 # MiniLM-L12-v2 produces 384-dimensional embeddings
33
+
34
+
35
+ MAX_TOKENS_NORMAL = 500
36
+ ENCODING_NAME = "cl100k_base" # Used for token chunking, consistent
37
 
38
  # Path configuration
39
+ BASE_DIR = Path("/content/") # Default for Colab environment
40
  PDF_DIRECTORY = BASE_DIR / "docs"
41
+ OUTPUT_DIR = BASE_DIR / "output"
42
+ EMBEDDINGS_FILE_PATH = OUTPUT_DIR / "embeddings_statistiques_multimodal_gemma_st.json"
43
 
44
  # Directory to save extracted images and tables HTML (within output)
45
  IMAGE_SAVE_SUBDIR = "extracted_graphs"
46
  TABLE_SAVE_SUBDIR = "extracted_tables"
 
47
  IMAGE_SAVE_DIR = OUTPUT_DIR / IMAGE_SAVE_SUBDIR
48
  TABLE_SAVE_DIR = OUTPUT_DIR / TABLE_SAVE_SUBDIR
49
 
50
+
51
+ # Global models
52
+ gemma_sampler = None
53
+ text_embedding_model = None
54
+
55
+ def initialize_models():
56
+ """Initializes Gemma and Sentence-Transformers models."""
57
+ global gemma_sampler, text_embedding_model
58
+
59
+ print("✓ Initializing Gemma Multimodal Model...")
60
+ try:
61
+ model = gm.nn.Gemma3_4B() # Initialize Gemma model
62
+ # Load Gemma parameters
63
+ params = gm.ckpts.load_params(gm.ckpts.CheckpointPath.GEMMA3_4B_IT)
64
+ gemma_sampler = gm.text.ChatSampler(model=model, params=params)
65
+ print(f"✓ Gemma Multimodal Model '{GEMMA_MULTIMODAL_MODEL}' loaded successfully.")
66
+ except Exception as e:
67
+ print(f"❌ ERREUR: Échec du chargement du modèle multimodal Gemma : {str(e)}")
68
+ print("⚠️ La génération de descriptions multimodales échouera.")
69
+ gemma_sampler = None
70
+
71
+ print(f"✓ Initializing Sentence-Transformers Model '{SENTENCE_TRANSFORMER_MODEL}'...")
72
+ try:
73
+ text_embedding_model = SentenceTransformer(SENTENCE_TRANSFORMER_MODEL)
74
+ print(f"✓ Modèle d'embedding textuel Sentence-Transformers '{SENTENCE_TRANSFORMER_MODEL}' chargé avec succès.")
75
+ except Exception as e:
76
+ print(f"❌ ERREUR: Échec du chargement du modèle d'embedding textuel Sentence-Transformers : {str(e)}")
77
+ print("⚠️ La génération d'embeddings textuels échouera.")
78
+ text_embedding_model = None
79
+
80
+
81
+ def clean_text(text):
82
+ """Normalize whitespace and clean text while preserving paragraph breaks"""
83
+ if not text:
84
+ return ""
85
+ text = text.replace('\t', ' ')
86
+ text = re.sub(r' +', ' ', text)
87
+ text = re.sub(r'\n{3,}', '\n\n', text)
88
+ return text.strip()
89
+
90
+ # --- PDF Processing Functions (Mostly unchanged from previous version, but updated to use global paths) ---
91
+ import fitz # PyMuPDF
92
+ import camelot # For table extraction
93
+ import pandas as pd
94
+ from bs4 import BeautifulSoup
95
+
96
+ IMAGE_MIN_WIDTH = 100
97
+ IMAGE_MIN_HEIGHT = 100
98
+
99
+ def extract_page_data_pymupdf(pdf_path):
100
+ """Extract text, tables and save images from each page using PyMuPDF and Camelot."""
101
+ page_data_list = []
102
+ try:
103
+ doc = fitz.open(pdf_path)
104
+ metadata = doc.metadata or {}
105
+ pdf_data = {
106
+ 'pdf_title': metadata.get('title', pdf_path.name),
107
+ 'pdf_subject': metadata.get('subject', 'Statistiques'),
108
+ 'pdf_keywords': metadata.get('keywords', '')
109
+ }
110
+
111
+ for page_num in range(len(doc)):
112
+ page = doc.load_page(page_num)
113
+ page_index = page_num + 1 # 1-based index
114
+
115
+ print(f" Extraction des données de la page {page_index}...")
116
+
117
+ # Extract tables first
118
+ table_data = extract_tables_and_images_from_page(pdf_path, page, page_index)
119
+
120
+ # Track table regions to avoid double-processing text
121
+ table_regions = []
122
+ for item in table_data:
123
+ if 'rect' in item and item['rect'] and len(item['rect']) == 4:
124
+ table_regions.append(fitz.Rect(item['rect']))
125
+ else:
126
+ print(f" Warning: Invalid rect for table on page {page_index}")
127
+
128
+ # Extract text excluding table regions
129
+ page_text = ""
130
+ if table_regions:
131
+ blocks = page.get_text("blocks")
132
+ for block in blocks:
133
+ block_rect = fitz.Rect(block[:4])
134
+ is_in_table = False
135
+ for table_rect in table_regions:
136
+ if block_rect.intersects(table_rect):
137
+ is_in_table = True
138
+ break
139
+ if not is_in_table:
140
+ page_text += block[4] + "\n"
141
+ else:
142
+ page_text = page.get_text("text")
143
+
144
+ page_text = clean_text(page_text)
145
+
146
+ # Extract and save images (excluding those identified as tables)
147
+ image_data = extract_images_from_page(pdf_path, page, page_index, excluded_rects=table_regions)
148
+
149
+ page_data_list.append({
150
+ 'pdf_file': pdf_path.name,
151
+ 'page_number': page_index,
152
+ 'text': page_text,
153
+ 'images': image_data,
154
+ 'tables': [item for item in table_data if item['content_type'] == 'table'],
155
+ 'pdf_title': pdf_data.get('pdf_title'),
156
+ 'pdf_subject': pdf_data.get('pdf_subject'),
157
+ 'pdf_keywords': pdf_data.get('pdf_keywords')
158
+ })
159
+ doc.close()
160
+ except Exception as e:
161
+ print(f"Erreur lors du traitement du PDF {pdf_path.name} avec PyMuPDF : {str(e)}")
162
+ traceback.print_exc()
163
+ return page_data_list
164
+
165
+
166
+ def extract_tables_and_images_from_page(pdf_path, page, page_num):
167
+ """Extract tables using Camelot and capture images of table areas."""
168
+ table_and_image_data = []
169
+ try:
170
+ tables = camelot.read_pdf(
171
+ str(pdf_path),
172
+ pages=str(page_num),
173
+ flavor='lattice',
174
+ )
175
+
176
+ if len(tables) == 0:
177
+ tables = camelot.read_pdf(
178
+ str(pdf_path),
179
+ pages=str(page_num),
180
+ flavor='stream'
181
+ )
182
+
183
+ for i, table in enumerate(tables):
184
+ if table.accuracy < 70:
185
+ print(f" Skipping low accuracy table ({table.accuracy:.2f}%) on page {page_num}")
186
+ continue
187
+
188
+ table_bbox = table.parsing_report.get('page_bbox', [0, 0, 0, 0])
189
+ if not table_bbox or len(table_bbox) != 4:
190
+ print(f" Warning: Invalid bounding box for table {i} on page {page_num}. Skipping image capture.")
191
+ table_rect = None
192
+ else:
193
+ table_rect = fitz.Rect(table_bbox)
194
+
195
+ safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
196
+ table_html_filename = f"{safe_pdf_name}_p{page_num}_table{i}.html"
197
+ table_html_save_path = TABLE_SAVE_DIR / table_html_filename
198
+ relative_html_url_path = f"/static/{TABLE_SAVE_SUBDIR}/{table_html_filename}"
199
+
200
+ table_image_filename = f"{safe_pdf_name}_p{page_num}_table{i}.png"
201
+ table_image_save_path = IMAGE_SAVE_DIR / table_image_filename
202
+ relative_image_url_path = f"/static/{IMAGE_SAVE_SUBDIR}/{table_image_filename}"
203
+
204
+
205
+ df = table.df
206
+ html = f"<caption>Table extrait de {pdf_path.name}, page {page_num}</caption>\n" + df.to_html(index=False)
207
+ soup = BeautifulSoup(html, 'html.parser')
208
+ table_tag = soup.find('table')
209
+ if table_tag:
210
+ table_tag['class'] = 'table table-bordered table-striped'
211
+ table_tag['style'] = 'width:100%; border-collapse:collapse;'
212
+
213
+ style_tag = soup.new_tag('style')
214
+ style_tag.string = """
215
+ .table { border-collapse: collapse; width: 100%; margin-bottom: 1rem;}
216
+ .table caption { caption-side: top; padding: 0.5rem; text-align: left; font-weight: bold; }
217
+ .table th, .table td { border: 1px solid #ddd; padding: 8px; text-align: left; }
218
+ .table th { background-color: #f2f2f2; font-weight: bold; }
219
+ .table-striped tbody tr:nth-of-type(odd) { background-color: rgba(0,0,0,.05); }
220
+ .table-responsive { overflow-x: auto; margin-bottom: 1rem; }
221
+ """
222
+ soup.insert(0, style_tag)
223
+
224
+ div = soup.new_tag('div')
225
+ div['class'] = 'table-responsive'
226
+ table_tag.wrap(div)
227
+
228
+ with open(table_html_save_path, 'w', encoding='utf-8') as f:
229
+ f.write(str(soup))
230
+ else:
231
+ print(f" Warning: Could not find table tag in HTML for table on page {page_num}. Skipping HTML save.")
232
+ continue
233
+
234
+ table_image_bytes = None
235
+ if table_rect:
236
+ try:
237
+ pix = page.get_pixmap(clip=table_rect)
238
+ table_image_bytes = pix.tobytes(format='png')
239
+
240
+ with open(table_image_save_path, "wb") as img_file:
241
+ img_file.write(table_image_bytes)
242
+
243
+ except Exception as img_capture_e:
244
+ print(f" Erreur lors de la capture d'image du tableau {i} page {page_num} : {img_capture_e}")
245
+ traceback.print_exc()
246
+ table_image_bytes = None
247
+
248
+ table_and_image_data.append({
249
+ 'content_type': 'table',
250
+ 'table_html_url': relative_html_url_path,
251
+ 'table_text_representation': df.to_string(index=False),
252
+ 'rect': [table_rect.x0, table_rect.y0, table_rect.x1, table_rect.y1] if table_rect else None,
253
+ 'accuracy': table.accuracy,
254
+ 'image_bytes': table_image_bytes,
255
+ 'image_url': relative_image_url_path if table_image_bytes else None
256
+ })
257
+
258
+ return table_and_image_data
259
+
260
+ except Exception as e:
261
+ print(f" Erreur lors de l'extraction des tableaux de la page {page_num} : {str(e)}")
262
+ traceback.print_exc()
263
+ return []
264
+
265
+
266
+ def extract_images_from_page(pdf_path, page, page_num, excluded_rects=[]):
267
+ """Extract and save images from a page, excluding specified regions (like tables)."""
268
+ image_data = []
269
+ image_list = page.get_images(full=True)
270
+
271
+ for img_index, img_info in enumerate(image_list):
272
+ xref = img_info[0]
273
+ try:
274
+ base_image = page.parent.extract_image(xref)
275
+ image_bytes = base_image["image"]
276
+ image_ext = base_image["ext"]
277
+ width = base_image["width"]
278
+ height = base_image["height"]
279
+
280
+ if width < IMAGE_MIN_WIDTH or height < IMAGE_MIN_HEIGHT:
281
+ continue
282
+
283
+ img_rect = None
284
+ img_rects = page.get_image_rects(xref)
285
+ if img_rects:
286
+ img_rect = img_rects[0]
287
+
288
+ if img_rect is None:
289
+ print(f" Warning: Could not find rectangle for image {img_index} on page {page_num}. Skipping.")
290
+ continue
291
+
292
+ is_excluded = False
293
+ for excluded_rect in excluded_rects:
294
+ if img_rect.intersects(excluded_rect):
295
+ is_excluded = True
296
+ break
297
+ if is_excluded:
298
+ print(f" Image {img_index} on page {page_num} is within an excluded region (e.g., table). Skipping.")
299
+ continue
300
+
301
+ safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
302
+ image_filename = f"{safe_pdf_name}_p{page_num}_img{img_index}.{image_ext}"
303
+ image_save_path = IMAGE_SAVE_DIR / image_filename
304
+ relative_url_path = f"/static/{IMAGE_SAVE_SUBDIR}/{image_filename}"
305
+
306
+ with open(image_save_path, "wb") as img_file:
307
+ img_file.write(image_bytes)
308
+
309
+ image_data.append({
310
+ 'content_type': 'image',
311
+ 'image_url': relative_url_path,
312
+ 'rect': [img_rect.x0, img_rect.y0, img_rect.x1, img_rect.y1],
313
+ 'image_bytes': image_bytes
314
+ })
315
+
316
+ except Exception as img_save_e:
317
+ print(f" Erreur lors du traitement de l'image {img_index} de la page {page_num} : {img_save_e}")
318
+ traceback.print_exc()
319
+
320
+ return image_data
321
+
322
+ # --- Embedding and Description Generation Functions (Modified for Gemma and Sentence-Transformers) ---
323
+
324
+ def token_chunking(text, max_tokens, encoding):
325
+ """Chunk text based on token count with smarter boundaries (sentences, paragraphs)"""
326
+ if not text:
327
+ return []
328
+
329
+ tokens = encoding.encode(text)
330
+ chunks = []
331
+ start_token_idx = 0
332
+
333
+ while start_token_idx < len(tokens):
334
+ end_token_idx = min(start_token_idx + max_tokens, len(tokens))
335
+
336
+ if end_token_idx < len(tokens):
337
+ look_ahead_limit = min(start_token_idx + max_tokens * 2, len(tokens))
338
+ text_segment_to_check = encoding.decode(tokens[start_token_idx:look_ahead_limit])
339
+
340
+ paragraph_break = text_segment_to_check.rfind('\n\n', 0, len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens)))
341
+ if paragraph_break != -1:
342
+ tokens_up_to_break = encoding.encode(text_segment_to_check[:paragraph_break])
343
+ end_token_idx = start_token_idx + len(tokens_up_to_break)
344
+ else:
345
+ sentence_end = re.search(r'[.!?]\s+', text_segment_to_check[:len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens))][::-1])
346
+ if sentence_end:
347
+ char_index_in_segment = len(text_segment_to_check) - 1 - sentence_end.start()
348
+ tokens_up_to_end = encoding.encode(text_segment_to_check[:char_index_in_segment + 1])
349
+ end_token_idx = start_token_idx + len(tokens_up_to_end)
350
+
351
+ current_chunk_tokens = tokens[start_token_idx:end_token_idx]
352
+ chunk_text = encoding.decode(current_chunk_tokens).strip()
353
+
354
+ if chunk_text:
355
+ chunks.append(chunk_text)
356
+
357
+ if start_token_idx == end_token_idx:
358
+ start_token_idx += 1
359
+ else:
360
+ start_token_idx = end_token_idx
361
+
362
+ return chunks
363
+
364
+
365
+ def generate_multimodal_description(image_bytes, prompt_text, max_retries=5, delay=10):
366
+ """
367
+ Generate a text description for an image using the Gemma multimodal model.
368
+ Returns description text or None if all retries fail or model is not initialized.
369
+ """
370
+ global gemma_sampler
371
+
372
+ if gemma_sampler is None:
373
+ print(" Skipping multimodal description generation: Gemma sampler is not initialized.")
374
+ return None
375
+
376
+ # Convert image bytes to PIL Image and then to JAX NumPy array
377
+ try:
378
+ pil_image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
379
+ # Gemma expects (H, W, C) numpy array, then converted to JAX numpy
380
+ image_np = np.asarray(pil_image)
381
+ gemma_image_input = jnp.asarray(image_np)
382
+ # Gemma also expects batch dimension, so add it
383
+ gemma_image_input = jnp.expand_dims(gemma_image_input, axis=0) # Shape: (1, H, W, C)
384
+ except Exception as e:
385
+ print(f" Erreur lors de la conversion de l'image pour Gemma : {e}")
386
+ return None
387
+
388
+ for attempt in range(max_retries):
389
+ try:
390
+ time.sleep(delay + random.uniform(0, 5))
391
+
392
+ # Gemma chat expects <img_token> special token for image insertion
393
+ full_prompt = f"{prompt_text} <img>"
394
+
395
+ # Use sampler.chat for turn-based interaction
396
+ # The images argument accepts a JAX numpy array with shape (batch, num_images, H, W, C)
397
+ # If a single image, it's (batch, 1, H, W, C)
398
+ # We are currently passing a single image, so gemma_image_input is (1, H, W, C) already.
399
+ # To pass it as `images`, it should be `(batch, num_images, H, W, C)`
400
+ # So, if gemma_image_input is (1, H, W, C), for sampler.chat(images=...) it should be (1, 1, H, W, C)
401
+ # Let's reshape it for the `images` argument.
402
+ final_gemma_image_input = jnp.expand_dims(gemma_image_input, axis=1) # Shape: (1, 1, H, W, C)
403
+
404
+ out = gemma_sampler.chat(
405
+ full_prompt,
406
+ images=final_gemma_image_input,
407
+ max_tokens=500 # Limit response length
408
+ )
409
+ description = out.strip()
410
+
411
+ if description:
412
+ return description
413
+ else:
414
+ print(f" Tentative {attempt+1}/{max_retries}: Réponse vide ou inattendue du modèle multimodal Gemma.")
415
+ if attempt < max_retries - 1:
416
+ retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
417
+ print(f" Réessai dans {retry_delay:.2f}s...")
418
+ time.sleep(retry_delay)
419
+ continue
420
+
421
+ except Exception as e:
422
+ error_msg = str(e)
423
+ print(f" Tentative {attempt+1}/{max_retries} échouée pour la description (Gemma) : {error_msg}")
424
+ # Gemma is local, so no API errors like 429. Focus on general errors.
425
+ if attempt < max_retries - 1:
426
+ retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
427
+ print(f" Réessai dans {retry_delay:.2f}s...")
428
+ time.sleep(retry_delay)
429
+ continue
430
+ else:
431
+ print(f" Toutes les {max_retries} tentatives ont échoué pour la description Gemma.")
432
+ return None
433
+ print(f" Toutes les {max_retries} tentatives ont échoué pour la description (fin de boucle).")
434
+ return None
435
+
436
+
437
+ def generate_text_embedding(text_content, max_retries=5, delay=5):
438
+ """
439
+ Generate text embedding using the Sentence-Transformers model.
440
+ Returns embedding vector (list) or None if all retries fail or model is not initialized.
441
+ """
442
+ global text_embedding_model
443
+
444
+ if text_embedding_model is None:
445
+ print(" Skipping text embedding generation: Sentence-Transformers model is not initialized.")
446
+ return None
447
+
448
+ if not text_content or not text_content.strip():
449
+ return None # Cannot embed empty text
450
+
451
+ for attempt in range(max_retries):
452
+ try:
453
+ time.sleep(delay + random.uniform(0, 0.5)) # Shorter delay for local model
454
+
455
+ # Sentence-Transformers encode method
456
+ embedding = text_embedding_model.encode(text_content, convert_to_numpy=True)
457
+ if embedding is not None and len(embedding) == EMBEDDING_DIMENSION:
458
+ return embedding.tolist() # Convert numpy array to list for JSON serialization
459
+ else:
460
+ print(f" Tentative {attempt+1}/{max_retries}: Format d'embedding Sentence-Transformers inattendu. Réponse : {embedding}")
461
+ return None
462
+
463
+ except Exception as e:
464
+ error_msg = str(e)
465
+ print(f" Tentative {attempt+1}/{max_retries} échouée pour l'embedding (Sentence-Transformers) : {error_msg}")
466
+ if attempt < max_retries - 1:
467
+ retry_delay = delay * (2 ** attempt) + random.uniform(0.5, 2)
468
+ print(f" Réessai dans {retry_delay:.2f}s...")
469
+ time.sleep(retry_delay)
470
+ continue
471
+ else:
472
+ print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding (Sentence-Transformers).")
473
+ return None
474
+ print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding (fin de boucle).")
475
+ return None
476
+
477
+
478
  # --- Main Processing Function ---
479
 
480
  def process_pdfs_in_directory(directory):
 
492
  processed_files += 1
493
  print(f"\nTraitement de {pdf_file_path.name} ({processed_files}/{total_files})...")
494
 
495
+ page_data_list = extract_page_data_pymupdf(pdf_file_path)
496
 
497
  if not page_data_list:
498
  print(f" Aucune donnée extraite de {pdf_file_path.name}.")
 
502
  pdf_file = page_data['pdf_file']
503
  page_num = page_data['page_number']
504
  page_text = page_data['text']
505
+ images = page_data['images']
506
+ tables = page_data['tables']
507
  pdf_title = page_data.get('pdf_title')
508
  pdf_subject = page_data.get('pdf_subject')
509
  pdf_keywords = page_data.get('pdf_keywords')
 
522
  print(f" Page {page_num}: Génération de la description multimodale pour le tableau {table_idx}...")
523
  description = generate_multimodal_description(table_image_bytes, prompt)
524
  elif table_text_repr:
525
+ # Fallback for text-only table description, using Gemma's text capabilities
526
+ if gemma_sampler:
527
+ prompt = f"Décrivez en français le contenu et la structure de ce tableau basé sur sa représentation textuelle:\n{table_text_repr[:1000]}..."
528
+ print(f" Page {page_num}: Génération de la description textuelle pour le tableau {table_idx} (fallback via Gemma)...")
529
  try:
530
+ # Gemma text-only generation
531
+ out = gemma_sampler.chat(prompt, max_tokens=500)
532
+ description = out.strip()
533
  except Exception as e:
534
+ print(f" Erreur lors de la génération de description textuelle pour le tableau {table_idx} via Gemma: {e}")
535
  description = None
536
  else:
537
+ print(" Skipping text description generation for table: Gemma sampler not initialized.")
538
  description = None
539
 
540
 
541
  if description:
542
  print(f" Page {page_num}: Description générée pour le tableau {table_idx}.")
543
+ embedding_vector = generate_text_embedding(description)
544
 
545
  if embedding_vector is not None:
546
  chunk_data = {
 
576
 
577
  if description:
578
  print(f" Page {page_num}: Description générée pour l'image {img_idx}.")
579
+ embedding_vector = generate_text_embedding(description)
580
 
581
  if embedding_vector is not None:
582
  chunk_data = {
 
611
 
612
  for chunk_idx, chunk_content in enumerate(text_chunks):
613
  print(f" Page {page_num}: Génération de l'embedding pour le chunk de texte {chunk_idx}...")
614
+ embedding_vector = generate_text_embedding(chunk_content)
615
 
616
  if embedding_vector is not None:
617
  chunk_data = {
 
638
 
639
  # --- Main Execution ---
640
  if __name__ == "__main__":
641
+ print("Démarrage du traitement PDF multimodal avec génération de descriptions (Gemma) et embeddings textuels multilingues (Sentence-Transformers)...")
642
 
643
  # Validate and create directories
644
  if not PDF_DIRECTORY.is_dir():
645
+ print(f"❌ ERREUR: Répertoire PDF non trouvé ou n'est pas un répertoire : {PDF_DIRECTORY}. Veuillez créer un répertoire 'docs' et y placer vos PDFs.")
646
+ # Create it if it doesn't exist, for example PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
647
+ # But for Colab, it's often better to instruct user to upload.
648
  exit(1)
649
 
650
  OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 
654
  print(f"Répertoire de sauvegarde des images : {IMAGE_SAVE_DIR}")
655
  print(f"Répertoire de sauvegarde des tableaux (HTML) : {TABLE_SAVE_DIR}")
656
 
657
+ # Initialize Gemma and Sentence-Transformers models
658
+ initialize_models()
659
+
660
+ # If models failed to initialize, exit
661
+ if gemma_sampler is None or text_embedding_model is None:
662
+ print("Impossible de continuer car un ou plusieurs modèles n'ont pas pu être initialisés.")
663
+ exit(1)
664
 
665
  final_embeddings = process_pdfs_in_directory(PDF_DIRECTORY)
666