Anonymous1223334444
commited on
Commit
·
2721ce7
1
Parent(s):
ad96b83
Update
Browse files- README.md +36 -34
- requirements.txt +5 -2
- run_pipeline.py +485 -30
README.md
CHANGED
@@ -7,48 +7,47 @@ tags:
|
|
7 |
- rag
|
8 |
- google-cloud
|
9 |
- vertex-ai
|
10 |
-
-
|
11 |
- python
|
12 |
datasets:
|
13 |
-
-
|
14 |
license: mit
|
15 |
---
|
16 |
|
17 |
-
# Multimodal & Multilingual PDF Embedding Pipeline
|
18 |
|
19 |
-
This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images)
|
20 |
|
21 |
**Key Features:**
|
22 |
-
- **Multimodal:** Processes
|
23 |
-
- **Multilingual:** Leverages Google's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
|
24 |
-
- **Contextual Descriptions:** Uses Google Gemini (Gemini 1.5 Flash) to generate descriptive text for tables and images in French.
|
25 |
- **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
|
26 |
|
27 |
## How it Works
|
28 |
|
29 |
1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
|
30 |
2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
|
31 |
-
3. **Multimodal Description (for Tables & Images):**
|
32 |
- For tables, the pipeline captures an image of the table and also uses its text representation.
|
33 |
- For standalone images (e.g., graphs, charts), it captures the image.
|
34 |
-
- These images are then
|
35 |
-
4. **Multilingual Text Embedding:**
|
36 |
- The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
|
37 |
-
- This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content.
|
38 |
5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
|
39 |
|
40 |
## Requirements & Setup
|
41 |
|
42 |
-
This pipeline
|
43 |
|
44 |
-
1. **
|
45 |
-
- **
|
46 |
- Enable the **Vertex AI API**.
|
47 |
-
|
48 |
-
|
49 |
-
-
|
50 |
-
|
51 |
-
|
52 |
|
53 |
### Local Setup
|
54 |
|
@@ -57,7 +56,7 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
|
|
57 |
git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
|
58 |
cd pdf-multimodal-multilingual-embedding-pipeline
|
59 |
```
|
60 |
-
2. **Install dependencies:**
|
61 |
```bash
|
62 |
pip install -r requirements.txt
|
63 |
```
|
@@ -75,13 +74,12 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
|
|
75 |
```
|
76 |
*Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
|
77 |
|
78 |
-
3. **Set up Environment Variables:**
|
79 |
```bash
|
80 |
export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
|
81 |
export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
|
82 |
-
export GEMINI_API_KEY="your-gemini-api-key"
|
83 |
```
|
84 |
-
Replace `your-gcp-project-id
|
85 |
|
86 |
4. **Place your PDF files:**
|
87 |
Create a `docs` directory in the root of the repository and place your PDF documents inside it.
|
@@ -89,7 +87,7 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
|
|
89 |
pdf-multimodal-multilingual-embedding-pipeline/
|
90 |
├── docs/
|
91 |
│ └── your_document.pdf
|
92 |
-
|
93 |
```
|
94 |
|
95 |
5. **Run the pipeline:**
|
@@ -100,45 +98,49 @@ This pipeline relies on **Google Cloud Platform** services and specific Python l
|
|
100 |
|
101 |
### Google Colab Usage
|
102 |
|
103 |
-
A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments.
|
104 |
|
105 |
1. **Open a new Google Colab notebook.**
|
106 |
-
2. **
|
|
|
107 |
```python
|
108 |
!pip uninstall -y camelot camelot-py # Ensure clean install
|
109 |
!pip install PyMuPDF
|
110 |
!apt-get update
|
111 |
!apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
|
112 |
-
!pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow
|
113 |
```
|
114 |
-
|
115 |
```python
|
116 |
from google.colab import auth
|
117 |
auth.authenticate_user()
|
118 |
```
|
119 |
-
|
120 |
```python
|
121 |
import os
|
122 |
-
# Replace with your actual Gemini API key
|
123 |
-
os.environ["GENAI_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE"
|
124 |
# Replace with your actual Google Cloud Project ID
|
125 |
os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
|
126 |
# Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
|
127 |
os.environ["VERTEX_AI_LOCATION"] = "us-central1"
|
|
|
|
|
|
|
128 |
```
|
129 |
-
|
130 |
You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
|
131 |
```python
|
132 |
# Example for uploading
|
133 |
from google.colab import files
|
134 |
import os
|
|
|
|
|
135 |
PDF_DIRECTORY = Path("/content/docs")
|
136 |
PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
|
137 |
uploaded = files.upload()
|
138 |
for filename in uploaded.keys():
|
139 |
os.rename(filename, PDF_DIRECTORY / filename)
|
140 |
```
|
141 |
-
|
142 |
|
143 |
## Output
|
144 |
|
@@ -192,7 +194,7 @@ The pipeline will generate:
|
|
192 |
|
193 |
# Acknowledgments
|
194 |
This pipeline leverages the power of:
|
195 |
-
-
|
196 |
- Google AI Gemini Models
|
197 |
- PyMuPDF
|
198 |
- Camelot
|
|
|
7 |
- rag
|
8 |
- google-cloud
|
9 |
- vertex-ai
|
10 |
+
- gemma
|
11 |
- python
|
12 |
datasets:
|
13 |
+
- no_dataset
|
14 |
license: mit
|
15 |
---
|
16 |
|
17 |
+
# Multimodal & Multilingual PDF Embedding Pipeline with Gemma and Vertex AI
|
18 |
|
19 |
+
This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images) using **Google's Gemma model (running locally)**, and then creates multilingual text embeddings for all extracted information using **Google Cloud Vertex AI's `text-multilingual-embedding-002` model**. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
|
20 |
|
21 |
**Key Features:**
|
22 |
+
- **Multimodal Descriptions (via Gemma):** Processes tables and images from PDFs, generating rich descriptive text in French using the open-source Gemma 3.4B-IT model, which runs locally on your machine/Colab GPU.
|
23 |
+
- **Multilingual Text Embeddings (via Vertex AI):** Leverages Google Cloud's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
|
|
|
24 |
- **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
|
25 |
|
26 |
## How it Works
|
27 |
|
28 |
1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
|
29 |
2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
|
30 |
+
3. **Multimodal Description (for Tables & Images using Gemma):**
|
31 |
- For tables, the pipeline captures an image of the table and also uses its text representation.
|
32 |
- For standalone images (e.g., graphs, charts), it captures the image.
|
33 |
+
- These images (and optionally table text) are then passed to the **Gemma 3.4B-IT model** (via the `gemma` Python library) with specific prompts to generate rich, descriptive text in French. **This step runs locally and does not incur direct API costs.**
|
34 |
+
4. **Multilingual Text Embedding (via Vertex AI):**
|
35 |
- The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
|
36 |
+
- This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content. **This step connects to Google Cloud Vertex AI and will incur costs.**
|
37 |
5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
|
38 |
|
39 |
## Requirements & Setup
|
40 |
|
41 |
+
This pipeline uses a combination of local models (Gemma) and **Google Cloud Platform** services (Vertex AI).
|
42 |
|
43 |
+
1. **Google Cloud Project with Billing Enabled (for Text Embeddings):**
|
44 |
+
- **CRITICAL:** The text embedding generation step uses Google Cloud Vertex AI. This **will incur costs** on your Google Cloud Platform account. Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project.
|
45 |
- Enable the **Vertex AI API**.
|
46 |
+
2. **Authentication for Google Cloud (for Text Embeddings):**
|
47 |
+
- The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`.
|
48 |
+
- For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
|
49 |
+
3. **Hardware Requirements (for Gemma):**
|
50 |
+
- Running the Gemma 3.4B-IT model requires a **GPU with sufficient VRAM** (e.g., a Colab T4 or V100 GPU, or a local GPU with at least ~8-10GB VRAM is recommended). If a GPU is not available, Gemma will likely run on CPU but will be significantly slower.
|
51 |
|
52 |
### Local Setup
|
53 |
|
|
|
56 |
git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
|
57 |
cd pdf-multimodal-multilingual-embedding-pipeline
|
58 |
```
|
59 |
+
2. **Install Python dependencies:**
|
60 |
```bash
|
61 |
pip install -r requirements.txt
|
62 |
```
|
|
|
74 |
```
|
75 |
*Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
|
76 |
|
77 |
+
3. **Set up Environment Variables (for Vertex AI Text Embeddings):**
|
78 |
```bash
|
79 |
export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
|
80 |
export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
|
|
|
81 |
```
|
82 |
+
Replace `your-gcp-project-id` and `us-central1` with your actual Google Cloud Project ID and Vertex AI region.
|
83 |
|
84 |
4. **Place your PDF files:**
|
85 |
Create a `docs` directory in the root of the repository and place your PDF documents inside it.
|
|
|
87 |
pdf-multimodal-multilingual-embedding-pipeline/
|
88 |
├── docs/
|
89 |
│ └── your_document.pdf
|
90 |
+
└── another_document.pdf
|
91 |
```
|
92 |
|
93 |
5. **Run the pipeline:**
|
|
|
98 |
|
99 |
### Google Colab Usage
|
100 |
|
101 |
+
A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments and GPU access.
|
102 |
|
103 |
1. **Open a new Google Colab notebook.**
|
104 |
+
2. **Change runtime to GPU:** Go to `Runtime > Change runtime type` and select `T4 GPU` or `V100 GPU`.
|
105 |
+
3. **Install system and Python dependencies:**
|
106 |
```python
|
107 |
!pip uninstall -y camelot camelot-py # Ensure clean install
|
108 |
!pip install PyMuPDF
|
109 |
!apt-get update
|
110 |
!apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
|
111 |
+
!pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow gemma jax jaxlib numpy
|
112 |
```
|
113 |
+
4. **Authenticate to Google Cloud (for Vertex AI):**
|
114 |
```python
|
115 |
from google.colab import auth
|
116 |
auth.authenticate_user()
|
117 |
```
|
118 |
+
5. **Set your Google Cloud Project ID and Location:**
|
119 |
```python
|
120 |
import os
|
|
|
|
|
121 |
# Replace with your actual Google Cloud Project ID
|
122 |
os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
|
123 |
# Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
|
124 |
os.environ["VERTEX_AI_LOCATION"] = "us-central1"
|
125 |
+
|
126 |
+
# Critical: Adjust JAX memory allocation for Gemma
|
127 |
+
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
|
128 |
```
|
129 |
+
6. **Upload your PDF files:**
|
130 |
You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
|
131 |
```python
|
132 |
# Example for uploading
|
133 |
from google.colab import files
|
134 |
import os
|
135 |
+
from pathlib import Path
|
136 |
+
|
137 |
PDF_DIRECTORY = Path("/content/docs")
|
138 |
PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
|
139 |
uploaded = files.upload()
|
140 |
for filename in uploaded.keys():
|
141 |
os.rename(filename, PDF_DIRECTORY / filename)
|
142 |
```
|
143 |
+
7. **Copy and paste the code from `src/pdf_processor.py`, `src/embedding_utils.py` and `run_pipeline.py` into Colab cells and execute.** Make sure to execute `embedding_utils.py` content first, then `pdf_processor.py` content, then `run_pipeline.py` content, or combine them logically into your notebook.
|
144 |
|
145 |
## Output
|
146 |
|
|
|
194 |
|
195 |
# Acknowledgments
|
196 |
This pipeline leverages the power of:
|
197 |
+
- Gemma AI
|
198 |
- Google AI Gemini Models
|
199 |
- PyMuPDF
|
200 |
- Camelot
|
requirements.txt
CHANGED
@@ -1,8 +1,11 @@
|
|
1 |
PyMuPDF
|
2 |
camelot-py[cv]
|
3 |
google-cloud-aiplatform
|
4 |
-
google-generativeai
|
5 |
tiktoken
|
6 |
pandas
|
7 |
beautifulsoup4
|
8 |
-
Pillow
|
|
|
|
|
|
|
|
|
|
1 |
PyMuPDF
|
2 |
camelot-py[cv]
|
3 |
google-cloud-aiplatform
|
|
|
4 |
tiktoken
|
5 |
pandas
|
6 |
beautifulsoup4
|
7 |
+
Pillow
|
8 |
+
gemma
|
9 |
+
jax
|
10 |
+
jaxlib
|
11 |
+
numpy
|
run_pipeline.py
CHANGED
@@ -1,32 +1,480 @@
|
|
1 |
import os
|
2 |
import json
|
3 |
import traceback
|
|
|
|
|
|
|
4 |
from pathlib import Path
|
5 |
import tiktoken
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
-
# Import functions from your src directory
|
8 |
-
from src.pdf_processor import extract_page_data_pymupdf, clean_text
|
9 |
-
from src.embedding_utils import initialize_clients, token_chunking, generate_multimodal_description, generate_text_embedding, ENCODING_NAME, MAX_TOKENS_NORMAL
|
10 |
|
11 |
# --- Configuration ---
|
12 |
-
#
|
13 |
-
|
14 |
-
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
# Path configuration
|
18 |
-
BASE_DIR = Path
|
19 |
PDF_DIRECTORY = BASE_DIR / "docs"
|
20 |
-
OUTPUT_DIR = BASE_DIR / "output"
|
21 |
-
EMBEDDINGS_FILE_PATH = OUTPUT_DIR / "
|
22 |
|
23 |
# Directory to save extracted images and tables HTML (within output)
|
24 |
IMAGE_SAVE_SUBDIR = "extracted_graphs"
|
25 |
TABLE_SAVE_SUBDIR = "extracted_tables"
|
26 |
-
# Absolute paths for saving
|
27 |
IMAGE_SAVE_DIR = OUTPUT_DIR / IMAGE_SAVE_SUBDIR
|
28 |
TABLE_SAVE_DIR = OUTPUT_DIR / TABLE_SAVE_SUBDIR
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
# --- Main Processing Function ---
|
31 |
|
32 |
def process_pdfs_in_directory(directory):
|
@@ -44,7 +492,7 @@ def process_pdfs_in_directory(directory):
|
|
44 |
processed_files += 1
|
45 |
print(f"\nTraitement de {pdf_file_path.name} ({processed_files}/{total_files})...")
|
46 |
|
47 |
-
page_data_list = extract_page_data_pymupdf(pdf_file_path
|
48 |
|
49 |
if not page_data_list:
|
50 |
print(f" Aucune donnée extraite de {pdf_file_path.name}.")
|
@@ -54,8 +502,8 @@ def process_pdfs_in_directory(directory):
|
|
54 |
pdf_file = page_data['pdf_file']
|
55 |
page_num = page_data['page_number']
|
56 |
page_text = page_data['text']
|
57 |
-
images = page_data['images']
|
58 |
-
tables = page_data['tables']
|
59 |
pdf_title = page_data.get('pdf_title')
|
60 |
pdf_subject = page_data.get('pdf_subject')
|
61 |
pdf_keywords = page_data.get('pdf_keywords')
|
@@ -74,25 +522,25 @@ def process_pdfs_in_directory(directory):
|
|
74 |
print(f" Page {page_num}: Génération de la description multimodale pour le tableau {table_idx}...")
|
75 |
description = generate_multimodal_description(table_image_bytes, prompt)
|
76 |
elif table_text_repr:
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
try:
|
82 |
-
|
83 |
-
|
84 |
-
description =
|
85 |
except Exception as e:
|
86 |
-
print(f" Erreur lors de la génération de description textuelle pour le tableau {table_idx}: {e}")
|
87 |
description = None
|
88 |
else:
|
89 |
-
print(" Skipping text description generation for table:
|
90 |
description = None
|
91 |
|
92 |
|
93 |
if description:
|
94 |
print(f" Page {page_num}: Description générée pour le tableau {table_idx}.")
|
95 |
-
embedding_vector = generate_text_embedding(description)
|
96 |
|
97 |
if embedding_vector is not None:
|
98 |
chunk_data = {
|
@@ -128,7 +576,7 @@ def process_pdfs_in_directory(directory):
|
|
128 |
|
129 |
if description:
|
130 |
print(f" Page {page_num}: Description générée pour l'image {img_idx}.")
|
131 |
-
embedding_vector = generate_text_embedding(description)
|
132 |
|
133 |
if embedding_vector is not None:
|
134 |
chunk_data = {
|
@@ -163,7 +611,7 @@ def process_pdfs_in_directory(directory):
|
|
163 |
|
164 |
for chunk_idx, chunk_content in enumerate(text_chunks):
|
165 |
print(f" Page {page_num}: Génération de l'embedding pour le chunk de texte {chunk_idx}...")
|
166 |
-
embedding_vector = generate_text_embedding(chunk_content)
|
167 |
|
168 |
if embedding_vector is not None:
|
169 |
chunk_data = {
|
@@ -190,11 +638,13 @@ def process_pdfs_in_directory(directory):
|
|
190 |
|
191 |
# --- Main Execution ---
|
192 |
if __name__ == "__main__":
|
193 |
-
print("Démarrage du traitement PDF multimodal avec génération de descriptions et embeddings textuels multilingues...")
|
194 |
|
195 |
# Validate and create directories
|
196 |
if not PDF_DIRECTORY.is_dir():
|
197 |
-
print(f"❌ ERREUR: Répertoire PDF non trouvé ou n'est pas un répertoire : {PDF_DIRECTORY}")
|
|
|
|
|
198 |
exit(1)
|
199 |
|
200 |
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
@@ -204,8 +654,13 @@ if __name__ == "__main__":
|
|
204 |
print(f"Répertoire de sauvegarde des images : {IMAGE_SAVE_DIR}")
|
205 |
print(f"Répertoire de sauvegarde des tableaux (HTML) : {TABLE_SAVE_DIR}")
|
206 |
|
207 |
-
# Initialize
|
208 |
-
|
|
|
|
|
|
|
|
|
|
|
209 |
|
210 |
final_embeddings = process_pdfs_in_directory(PDF_DIRECTORY)
|
211 |
|
|
|
1 |
import os
|
2 |
import json
|
3 |
import traceback
|
4 |
+
import re
|
5 |
+
import time
|
6 |
+
import random
|
7 |
from pathlib import Path
|
8 |
import tiktoken
|
9 |
+
import numpy as np
|
10 |
+
from PIL import Image # Pillow for image handling
|
11 |
+
import io # To handle image bytes
|
12 |
+
|
13 |
+
# Gemma imports
|
14 |
+
import jax.numpy as jnp
|
15 |
+
# For Gemma models, we need a specific setup to load the model
|
16 |
+
# For JAX/GPU memory allocation
|
17 |
+
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
|
18 |
+
from gemma import gm
|
19 |
+
|
20 |
+
# Sentence-Transformers for text embedding
|
21 |
+
from sentence_transformers import SentenceTransformer
|
22 |
|
|
|
|
|
|
|
23 |
|
24 |
# --- Configuration ---
|
25 |
+
# Set the desired Gemma model
|
26 |
+
GEMMA_MULTIMODAL_MODEL = "gemma-3.4b-it" # You can choose other Gemma variants if available and suitable
|
27 |
+
|
28 |
+
# Set the desired Sentence-Transformers model for text embeddings
|
29 |
+
# This is a good free, multilingual model.
|
30 |
+
SENTENCE_TRANSFORMER_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
|
31 |
+
# The dimension of embeddings for this model
|
32 |
+
EMBEDDING_DIMENSION = 384 # MiniLM-L12-v2 produces 384-dimensional embeddings
|
33 |
+
|
34 |
+
|
35 |
+
MAX_TOKENS_NORMAL = 500
|
36 |
+
ENCODING_NAME = "cl100k_base" # Used for token chunking, consistent
|
37 |
|
38 |
# Path configuration
|
39 |
+
BASE_DIR = Path("/content/") # Default for Colab environment
|
40 |
PDF_DIRECTORY = BASE_DIR / "docs"
|
41 |
+
OUTPUT_DIR = BASE_DIR / "output"
|
42 |
+
EMBEDDINGS_FILE_PATH = OUTPUT_DIR / "embeddings_statistiques_multimodal_gemma_st.json"
|
43 |
|
44 |
# Directory to save extracted images and tables HTML (within output)
|
45 |
IMAGE_SAVE_SUBDIR = "extracted_graphs"
|
46 |
TABLE_SAVE_SUBDIR = "extracted_tables"
|
|
|
47 |
IMAGE_SAVE_DIR = OUTPUT_DIR / IMAGE_SAVE_SUBDIR
|
48 |
TABLE_SAVE_DIR = OUTPUT_DIR / TABLE_SAVE_SUBDIR
|
49 |
|
50 |
+
|
51 |
+
# Global models
|
52 |
+
gemma_sampler = None
|
53 |
+
text_embedding_model = None
|
54 |
+
|
55 |
+
def initialize_models():
|
56 |
+
"""Initializes Gemma and Sentence-Transformers models."""
|
57 |
+
global gemma_sampler, text_embedding_model
|
58 |
+
|
59 |
+
print("✓ Initializing Gemma Multimodal Model...")
|
60 |
+
try:
|
61 |
+
model = gm.nn.Gemma3_4B() # Initialize Gemma model
|
62 |
+
# Load Gemma parameters
|
63 |
+
params = gm.ckpts.load_params(gm.ckpts.CheckpointPath.GEMMA3_4B_IT)
|
64 |
+
gemma_sampler = gm.text.ChatSampler(model=model, params=params)
|
65 |
+
print(f"✓ Gemma Multimodal Model '{GEMMA_MULTIMODAL_MODEL}' loaded successfully.")
|
66 |
+
except Exception as e:
|
67 |
+
print(f"❌ ERREUR: Échec du chargement du modèle multimodal Gemma : {str(e)}")
|
68 |
+
print("⚠️ La génération de descriptions multimodales échouera.")
|
69 |
+
gemma_sampler = None
|
70 |
+
|
71 |
+
print(f"✓ Initializing Sentence-Transformers Model '{SENTENCE_TRANSFORMER_MODEL}'...")
|
72 |
+
try:
|
73 |
+
text_embedding_model = SentenceTransformer(SENTENCE_TRANSFORMER_MODEL)
|
74 |
+
print(f"✓ Modèle d'embedding textuel Sentence-Transformers '{SENTENCE_TRANSFORMER_MODEL}' chargé avec succès.")
|
75 |
+
except Exception as e:
|
76 |
+
print(f"❌ ERREUR: Échec du chargement du modèle d'embedding textuel Sentence-Transformers : {str(e)}")
|
77 |
+
print("⚠️ La génération d'embeddings textuels échouera.")
|
78 |
+
text_embedding_model = None
|
79 |
+
|
80 |
+
|
81 |
+
def clean_text(text):
|
82 |
+
"""Normalize whitespace and clean text while preserving paragraph breaks"""
|
83 |
+
if not text:
|
84 |
+
return ""
|
85 |
+
text = text.replace('\t', ' ')
|
86 |
+
text = re.sub(r' +', ' ', text)
|
87 |
+
text = re.sub(r'\n{3,}', '\n\n', text)
|
88 |
+
return text.strip()
|
89 |
+
|
90 |
+
# --- PDF Processing Functions (Mostly unchanged from previous version, but updated to use global paths) ---
|
91 |
+
import fitz # PyMuPDF
|
92 |
+
import camelot # For table extraction
|
93 |
+
import pandas as pd
|
94 |
+
from bs4 import BeautifulSoup
|
95 |
+
|
96 |
+
IMAGE_MIN_WIDTH = 100
|
97 |
+
IMAGE_MIN_HEIGHT = 100
|
98 |
+
|
99 |
+
def extract_page_data_pymupdf(pdf_path):
|
100 |
+
"""Extract text, tables and save images from each page using PyMuPDF and Camelot."""
|
101 |
+
page_data_list = []
|
102 |
+
try:
|
103 |
+
doc = fitz.open(pdf_path)
|
104 |
+
metadata = doc.metadata or {}
|
105 |
+
pdf_data = {
|
106 |
+
'pdf_title': metadata.get('title', pdf_path.name),
|
107 |
+
'pdf_subject': metadata.get('subject', 'Statistiques'),
|
108 |
+
'pdf_keywords': metadata.get('keywords', '')
|
109 |
+
}
|
110 |
+
|
111 |
+
for page_num in range(len(doc)):
|
112 |
+
page = doc.load_page(page_num)
|
113 |
+
page_index = page_num + 1 # 1-based index
|
114 |
+
|
115 |
+
print(f" Extraction des données de la page {page_index}...")
|
116 |
+
|
117 |
+
# Extract tables first
|
118 |
+
table_data = extract_tables_and_images_from_page(pdf_path, page, page_index)
|
119 |
+
|
120 |
+
# Track table regions to avoid double-processing text
|
121 |
+
table_regions = []
|
122 |
+
for item in table_data:
|
123 |
+
if 'rect' in item and item['rect'] and len(item['rect']) == 4:
|
124 |
+
table_regions.append(fitz.Rect(item['rect']))
|
125 |
+
else:
|
126 |
+
print(f" Warning: Invalid rect for table on page {page_index}")
|
127 |
+
|
128 |
+
# Extract text excluding table regions
|
129 |
+
page_text = ""
|
130 |
+
if table_regions:
|
131 |
+
blocks = page.get_text("blocks")
|
132 |
+
for block in blocks:
|
133 |
+
block_rect = fitz.Rect(block[:4])
|
134 |
+
is_in_table = False
|
135 |
+
for table_rect in table_regions:
|
136 |
+
if block_rect.intersects(table_rect):
|
137 |
+
is_in_table = True
|
138 |
+
break
|
139 |
+
if not is_in_table:
|
140 |
+
page_text += block[4] + "\n"
|
141 |
+
else:
|
142 |
+
page_text = page.get_text("text")
|
143 |
+
|
144 |
+
page_text = clean_text(page_text)
|
145 |
+
|
146 |
+
# Extract and save images (excluding those identified as tables)
|
147 |
+
image_data = extract_images_from_page(pdf_path, page, page_index, excluded_rects=table_regions)
|
148 |
+
|
149 |
+
page_data_list.append({
|
150 |
+
'pdf_file': pdf_path.name,
|
151 |
+
'page_number': page_index,
|
152 |
+
'text': page_text,
|
153 |
+
'images': image_data,
|
154 |
+
'tables': [item for item in table_data if item['content_type'] == 'table'],
|
155 |
+
'pdf_title': pdf_data.get('pdf_title'),
|
156 |
+
'pdf_subject': pdf_data.get('pdf_subject'),
|
157 |
+
'pdf_keywords': pdf_data.get('pdf_keywords')
|
158 |
+
})
|
159 |
+
doc.close()
|
160 |
+
except Exception as e:
|
161 |
+
print(f"Erreur lors du traitement du PDF {pdf_path.name} avec PyMuPDF : {str(e)}")
|
162 |
+
traceback.print_exc()
|
163 |
+
return page_data_list
|
164 |
+
|
165 |
+
|
166 |
+
def extract_tables_and_images_from_page(pdf_path, page, page_num):
|
167 |
+
"""Extract tables using Camelot and capture images of table areas."""
|
168 |
+
table_and_image_data = []
|
169 |
+
try:
|
170 |
+
tables = camelot.read_pdf(
|
171 |
+
str(pdf_path),
|
172 |
+
pages=str(page_num),
|
173 |
+
flavor='lattice',
|
174 |
+
)
|
175 |
+
|
176 |
+
if len(tables) == 0:
|
177 |
+
tables = camelot.read_pdf(
|
178 |
+
str(pdf_path),
|
179 |
+
pages=str(page_num),
|
180 |
+
flavor='stream'
|
181 |
+
)
|
182 |
+
|
183 |
+
for i, table in enumerate(tables):
|
184 |
+
if table.accuracy < 70:
|
185 |
+
print(f" Skipping low accuracy table ({table.accuracy:.2f}%) on page {page_num}")
|
186 |
+
continue
|
187 |
+
|
188 |
+
table_bbox = table.parsing_report.get('page_bbox', [0, 0, 0, 0])
|
189 |
+
if not table_bbox or len(table_bbox) != 4:
|
190 |
+
print(f" Warning: Invalid bounding box for table {i} on page {page_num}. Skipping image capture.")
|
191 |
+
table_rect = None
|
192 |
+
else:
|
193 |
+
table_rect = fitz.Rect(table_bbox)
|
194 |
+
|
195 |
+
safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
|
196 |
+
table_html_filename = f"{safe_pdf_name}_p{page_num}_table{i}.html"
|
197 |
+
table_html_save_path = TABLE_SAVE_DIR / table_html_filename
|
198 |
+
relative_html_url_path = f"/static/{TABLE_SAVE_SUBDIR}/{table_html_filename}"
|
199 |
+
|
200 |
+
table_image_filename = f"{safe_pdf_name}_p{page_num}_table{i}.png"
|
201 |
+
table_image_save_path = IMAGE_SAVE_DIR / table_image_filename
|
202 |
+
relative_image_url_path = f"/static/{IMAGE_SAVE_SUBDIR}/{table_image_filename}"
|
203 |
+
|
204 |
+
|
205 |
+
df = table.df
|
206 |
+
html = f"<caption>Table extrait de {pdf_path.name}, page {page_num}</caption>\n" + df.to_html(index=False)
|
207 |
+
soup = BeautifulSoup(html, 'html.parser')
|
208 |
+
table_tag = soup.find('table')
|
209 |
+
if table_tag:
|
210 |
+
table_tag['class'] = 'table table-bordered table-striped'
|
211 |
+
table_tag['style'] = 'width:100%; border-collapse:collapse;'
|
212 |
+
|
213 |
+
style_tag = soup.new_tag('style')
|
214 |
+
style_tag.string = """
|
215 |
+
.table { border-collapse: collapse; width: 100%; margin-bottom: 1rem;}
|
216 |
+
.table caption { caption-side: top; padding: 0.5rem; text-align: left; font-weight: bold; }
|
217 |
+
.table th, .table td { border: 1px solid #ddd; padding: 8px; text-align: left; }
|
218 |
+
.table th { background-color: #f2f2f2; font-weight: bold; }
|
219 |
+
.table-striped tbody tr:nth-of-type(odd) { background-color: rgba(0,0,0,.05); }
|
220 |
+
.table-responsive { overflow-x: auto; margin-bottom: 1rem; }
|
221 |
+
"""
|
222 |
+
soup.insert(0, style_tag)
|
223 |
+
|
224 |
+
div = soup.new_tag('div')
|
225 |
+
div['class'] = 'table-responsive'
|
226 |
+
table_tag.wrap(div)
|
227 |
+
|
228 |
+
with open(table_html_save_path, 'w', encoding='utf-8') as f:
|
229 |
+
f.write(str(soup))
|
230 |
+
else:
|
231 |
+
print(f" Warning: Could not find table tag in HTML for table on page {page_num}. Skipping HTML save.")
|
232 |
+
continue
|
233 |
+
|
234 |
+
table_image_bytes = None
|
235 |
+
if table_rect:
|
236 |
+
try:
|
237 |
+
pix = page.get_pixmap(clip=table_rect)
|
238 |
+
table_image_bytes = pix.tobytes(format='png')
|
239 |
+
|
240 |
+
with open(table_image_save_path, "wb") as img_file:
|
241 |
+
img_file.write(table_image_bytes)
|
242 |
+
|
243 |
+
except Exception as img_capture_e:
|
244 |
+
print(f" Erreur lors de la capture d'image du tableau {i} page {page_num} : {img_capture_e}")
|
245 |
+
traceback.print_exc()
|
246 |
+
table_image_bytes = None
|
247 |
+
|
248 |
+
table_and_image_data.append({
|
249 |
+
'content_type': 'table',
|
250 |
+
'table_html_url': relative_html_url_path,
|
251 |
+
'table_text_representation': df.to_string(index=False),
|
252 |
+
'rect': [table_rect.x0, table_rect.y0, table_rect.x1, table_rect.y1] if table_rect else None,
|
253 |
+
'accuracy': table.accuracy,
|
254 |
+
'image_bytes': table_image_bytes,
|
255 |
+
'image_url': relative_image_url_path if table_image_bytes else None
|
256 |
+
})
|
257 |
+
|
258 |
+
return table_and_image_data
|
259 |
+
|
260 |
+
except Exception as e:
|
261 |
+
print(f" Erreur lors de l'extraction des tableaux de la page {page_num} : {str(e)}")
|
262 |
+
traceback.print_exc()
|
263 |
+
return []
|
264 |
+
|
265 |
+
|
266 |
+
def extract_images_from_page(pdf_path, page, page_num, excluded_rects=[]):
|
267 |
+
"""Extract and save images from a page, excluding specified regions (like tables)."""
|
268 |
+
image_data = []
|
269 |
+
image_list = page.get_images(full=True)
|
270 |
+
|
271 |
+
for img_index, img_info in enumerate(image_list):
|
272 |
+
xref = img_info[0]
|
273 |
+
try:
|
274 |
+
base_image = page.parent.extract_image(xref)
|
275 |
+
image_bytes = base_image["image"]
|
276 |
+
image_ext = base_image["ext"]
|
277 |
+
width = base_image["width"]
|
278 |
+
height = base_image["height"]
|
279 |
+
|
280 |
+
if width < IMAGE_MIN_WIDTH or height < IMAGE_MIN_HEIGHT:
|
281 |
+
continue
|
282 |
+
|
283 |
+
img_rect = None
|
284 |
+
img_rects = page.get_image_rects(xref)
|
285 |
+
if img_rects:
|
286 |
+
img_rect = img_rects[0]
|
287 |
+
|
288 |
+
if img_rect is None:
|
289 |
+
print(f" Warning: Could not find rectangle for image {img_index} on page {page_num}. Skipping.")
|
290 |
+
continue
|
291 |
+
|
292 |
+
is_excluded = False
|
293 |
+
for excluded_rect in excluded_rects:
|
294 |
+
if img_rect.intersects(excluded_rect):
|
295 |
+
is_excluded = True
|
296 |
+
break
|
297 |
+
if is_excluded:
|
298 |
+
print(f" Image {img_index} on page {page_num} is within an excluded region (e.g., table). Skipping.")
|
299 |
+
continue
|
300 |
+
|
301 |
+
safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
|
302 |
+
image_filename = f"{safe_pdf_name}_p{page_num}_img{img_index}.{image_ext}"
|
303 |
+
image_save_path = IMAGE_SAVE_DIR / image_filename
|
304 |
+
relative_url_path = f"/static/{IMAGE_SAVE_SUBDIR}/{image_filename}"
|
305 |
+
|
306 |
+
with open(image_save_path, "wb") as img_file:
|
307 |
+
img_file.write(image_bytes)
|
308 |
+
|
309 |
+
image_data.append({
|
310 |
+
'content_type': 'image',
|
311 |
+
'image_url': relative_url_path,
|
312 |
+
'rect': [img_rect.x0, img_rect.y0, img_rect.x1, img_rect.y1],
|
313 |
+
'image_bytes': image_bytes
|
314 |
+
})
|
315 |
+
|
316 |
+
except Exception as img_save_e:
|
317 |
+
print(f" Erreur lors du traitement de l'image {img_index} de la page {page_num} : {img_save_e}")
|
318 |
+
traceback.print_exc()
|
319 |
+
|
320 |
+
return image_data
|
321 |
+
|
322 |
+
# --- Embedding and Description Generation Functions (Modified for Gemma and Sentence-Transformers) ---
|
323 |
+
|
324 |
+
def token_chunking(text, max_tokens, encoding):
|
325 |
+
"""Chunk text based on token count with smarter boundaries (sentences, paragraphs)"""
|
326 |
+
if not text:
|
327 |
+
return []
|
328 |
+
|
329 |
+
tokens = encoding.encode(text)
|
330 |
+
chunks = []
|
331 |
+
start_token_idx = 0
|
332 |
+
|
333 |
+
while start_token_idx < len(tokens):
|
334 |
+
end_token_idx = min(start_token_idx + max_tokens, len(tokens))
|
335 |
+
|
336 |
+
if end_token_idx < len(tokens):
|
337 |
+
look_ahead_limit = min(start_token_idx + max_tokens * 2, len(tokens))
|
338 |
+
text_segment_to_check = encoding.decode(tokens[start_token_idx:look_ahead_limit])
|
339 |
+
|
340 |
+
paragraph_break = text_segment_to_check.rfind('\n\n', 0, len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens)))
|
341 |
+
if paragraph_break != -1:
|
342 |
+
tokens_up_to_break = encoding.encode(text_segment_to_check[:paragraph_break])
|
343 |
+
end_token_idx = start_token_idx + len(tokens_up_to_break)
|
344 |
+
else:
|
345 |
+
sentence_end = re.search(r'[.!?]\s+', text_segment_to_check[:len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens))][::-1])
|
346 |
+
if sentence_end:
|
347 |
+
char_index_in_segment = len(text_segment_to_check) - 1 - sentence_end.start()
|
348 |
+
tokens_up_to_end = encoding.encode(text_segment_to_check[:char_index_in_segment + 1])
|
349 |
+
end_token_idx = start_token_idx + len(tokens_up_to_end)
|
350 |
+
|
351 |
+
current_chunk_tokens = tokens[start_token_idx:end_token_idx]
|
352 |
+
chunk_text = encoding.decode(current_chunk_tokens).strip()
|
353 |
+
|
354 |
+
if chunk_text:
|
355 |
+
chunks.append(chunk_text)
|
356 |
+
|
357 |
+
if start_token_idx == end_token_idx:
|
358 |
+
start_token_idx += 1
|
359 |
+
else:
|
360 |
+
start_token_idx = end_token_idx
|
361 |
+
|
362 |
+
return chunks
|
363 |
+
|
364 |
+
|
365 |
+
def generate_multimodal_description(image_bytes, prompt_text, max_retries=5, delay=10):
|
366 |
+
"""
|
367 |
+
Generate a text description for an image using the Gemma multimodal model.
|
368 |
+
Returns description text or None if all retries fail or model is not initialized.
|
369 |
+
"""
|
370 |
+
global gemma_sampler
|
371 |
+
|
372 |
+
if gemma_sampler is None:
|
373 |
+
print(" Skipping multimodal description generation: Gemma sampler is not initialized.")
|
374 |
+
return None
|
375 |
+
|
376 |
+
# Convert image bytes to PIL Image and then to JAX NumPy array
|
377 |
+
try:
|
378 |
+
pil_image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
|
379 |
+
# Gemma expects (H, W, C) numpy array, then converted to JAX numpy
|
380 |
+
image_np = np.asarray(pil_image)
|
381 |
+
gemma_image_input = jnp.asarray(image_np)
|
382 |
+
# Gemma also expects batch dimension, so add it
|
383 |
+
gemma_image_input = jnp.expand_dims(gemma_image_input, axis=0) # Shape: (1, H, W, C)
|
384 |
+
except Exception as e:
|
385 |
+
print(f" Erreur lors de la conversion de l'image pour Gemma : {e}")
|
386 |
+
return None
|
387 |
+
|
388 |
+
for attempt in range(max_retries):
|
389 |
+
try:
|
390 |
+
time.sleep(delay + random.uniform(0, 5))
|
391 |
+
|
392 |
+
# Gemma chat expects <img_token> special token for image insertion
|
393 |
+
full_prompt = f"{prompt_text} <img>"
|
394 |
+
|
395 |
+
# Use sampler.chat for turn-based interaction
|
396 |
+
# The images argument accepts a JAX numpy array with shape (batch, num_images, H, W, C)
|
397 |
+
# If a single image, it's (batch, 1, H, W, C)
|
398 |
+
# We are currently passing a single image, so gemma_image_input is (1, H, W, C) already.
|
399 |
+
# To pass it as `images`, it should be `(batch, num_images, H, W, C)`
|
400 |
+
# So, if gemma_image_input is (1, H, W, C), for sampler.chat(images=...) it should be (1, 1, H, W, C)
|
401 |
+
# Let's reshape it for the `images` argument.
|
402 |
+
final_gemma_image_input = jnp.expand_dims(gemma_image_input, axis=1) # Shape: (1, 1, H, W, C)
|
403 |
+
|
404 |
+
out = gemma_sampler.chat(
|
405 |
+
full_prompt,
|
406 |
+
images=final_gemma_image_input,
|
407 |
+
max_tokens=500 # Limit response length
|
408 |
+
)
|
409 |
+
description = out.strip()
|
410 |
+
|
411 |
+
if description:
|
412 |
+
return description
|
413 |
+
else:
|
414 |
+
print(f" Tentative {attempt+1}/{max_retries}: Réponse vide ou inattendue du modèle multimodal Gemma.")
|
415 |
+
if attempt < max_retries - 1:
|
416 |
+
retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
|
417 |
+
print(f" Réessai dans {retry_delay:.2f}s...")
|
418 |
+
time.sleep(retry_delay)
|
419 |
+
continue
|
420 |
+
|
421 |
+
except Exception as e:
|
422 |
+
error_msg = str(e)
|
423 |
+
print(f" Tentative {attempt+1}/{max_retries} échouée pour la description (Gemma) : {error_msg}")
|
424 |
+
# Gemma is local, so no API errors like 429. Focus on general errors.
|
425 |
+
if attempt < max_retries - 1:
|
426 |
+
retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
|
427 |
+
print(f" Réessai dans {retry_delay:.2f}s...")
|
428 |
+
time.sleep(retry_delay)
|
429 |
+
continue
|
430 |
+
else:
|
431 |
+
print(f" Toutes les {max_retries} tentatives ont échoué pour la description Gemma.")
|
432 |
+
return None
|
433 |
+
print(f" Toutes les {max_retries} tentatives ont échoué pour la description (fin de boucle).")
|
434 |
+
return None
|
435 |
+
|
436 |
+
|
437 |
+
def generate_text_embedding(text_content, max_retries=5, delay=5):
|
438 |
+
"""
|
439 |
+
Generate text embedding using the Sentence-Transformers model.
|
440 |
+
Returns embedding vector (list) or None if all retries fail or model is not initialized.
|
441 |
+
"""
|
442 |
+
global text_embedding_model
|
443 |
+
|
444 |
+
if text_embedding_model is None:
|
445 |
+
print(" Skipping text embedding generation: Sentence-Transformers model is not initialized.")
|
446 |
+
return None
|
447 |
+
|
448 |
+
if not text_content or not text_content.strip():
|
449 |
+
return None # Cannot embed empty text
|
450 |
+
|
451 |
+
for attempt in range(max_retries):
|
452 |
+
try:
|
453 |
+
time.sleep(delay + random.uniform(0, 0.5)) # Shorter delay for local model
|
454 |
+
|
455 |
+
# Sentence-Transformers encode method
|
456 |
+
embedding = text_embedding_model.encode(text_content, convert_to_numpy=True)
|
457 |
+
if embedding is not None and len(embedding) == EMBEDDING_DIMENSION:
|
458 |
+
return embedding.tolist() # Convert numpy array to list for JSON serialization
|
459 |
+
else:
|
460 |
+
print(f" Tentative {attempt+1}/{max_retries}: Format d'embedding Sentence-Transformers inattendu. Réponse : {embedding}")
|
461 |
+
return None
|
462 |
+
|
463 |
+
except Exception as e:
|
464 |
+
error_msg = str(e)
|
465 |
+
print(f" Tentative {attempt+1}/{max_retries} échouée pour l'embedding (Sentence-Transformers) : {error_msg}")
|
466 |
+
if attempt < max_retries - 1:
|
467 |
+
retry_delay = delay * (2 ** attempt) + random.uniform(0.5, 2)
|
468 |
+
print(f" Réessai dans {retry_delay:.2f}s...")
|
469 |
+
time.sleep(retry_delay)
|
470 |
+
continue
|
471 |
+
else:
|
472 |
+
print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding (Sentence-Transformers).")
|
473 |
+
return None
|
474 |
+
print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding (fin de boucle).")
|
475 |
+
return None
|
476 |
+
|
477 |
+
|
478 |
# --- Main Processing Function ---
|
479 |
|
480 |
def process_pdfs_in_directory(directory):
|
|
|
492 |
processed_files += 1
|
493 |
print(f"\nTraitement de {pdf_file_path.name} ({processed_files}/{total_files})...")
|
494 |
|
495 |
+
page_data_list = extract_page_data_pymupdf(pdf_file_path)
|
496 |
|
497 |
if not page_data_list:
|
498 |
print(f" Aucune donnée extraite de {pdf_file_path.name}.")
|
|
|
502 |
pdf_file = page_data['pdf_file']
|
503 |
page_num = page_data['page_number']
|
504 |
page_text = page_data['text']
|
505 |
+
images = page_data['images']
|
506 |
+
tables = page_data['tables']
|
507 |
pdf_title = page_data.get('pdf_title')
|
508 |
pdf_subject = page_data.get('pdf_subject')
|
509 |
pdf_keywords = page_data.get('pdf_keywords')
|
|
|
522 |
print(f" Page {page_num}: Génération de la description multimodale pour le tableau {table_idx}...")
|
523 |
description = generate_multimodal_description(table_image_bytes, prompt)
|
524 |
elif table_text_repr:
|
525 |
+
# Fallback for text-only table description, using Gemma's text capabilities
|
526 |
+
if gemma_sampler:
|
527 |
+
prompt = f"Décrivez en français le contenu et la structure de ce tableau basé sur sa représentation textuelle:\n{table_text_repr[:1000]}..."
|
528 |
+
print(f" Page {page_num}: Génération de la description textuelle pour le tableau {table_idx} (fallback via Gemma)...")
|
529 |
try:
|
530 |
+
# Gemma text-only generation
|
531 |
+
out = gemma_sampler.chat(prompt, max_tokens=500)
|
532 |
+
description = out.strip()
|
533 |
except Exception as e:
|
534 |
+
print(f" Erreur lors de la génération de description textuelle pour le tableau {table_idx} via Gemma: {e}")
|
535 |
description = None
|
536 |
else:
|
537 |
+
print(" Skipping text description generation for table: Gemma sampler not initialized.")
|
538 |
description = None
|
539 |
|
540 |
|
541 |
if description:
|
542 |
print(f" Page {page_num}: Description générée pour le tableau {table_idx}.")
|
543 |
+
embedding_vector = generate_text_embedding(description)
|
544 |
|
545 |
if embedding_vector is not None:
|
546 |
chunk_data = {
|
|
|
576 |
|
577 |
if description:
|
578 |
print(f" Page {page_num}: Description générée pour l'image {img_idx}.")
|
579 |
+
embedding_vector = generate_text_embedding(description)
|
580 |
|
581 |
if embedding_vector is not None:
|
582 |
chunk_data = {
|
|
|
611 |
|
612 |
for chunk_idx, chunk_content in enumerate(text_chunks):
|
613 |
print(f" Page {page_num}: Génération de l'embedding pour le chunk de texte {chunk_idx}...")
|
614 |
+
embedding_vector = generate_text_embedding(chunk_content)
|
615 |
|
616 |
if embedding_vector is not None:
|
617 |
chunk_data = {
|
|
|
638 |
|
639 |
# --- Main Execution ---
|
640 |
if __name__ == "__main__":
|
641 |
+
print("Démarrage du traitement PDF multimodal avec génération de descriptions (Gemma) et embeddings textuels multilingues (Sentence-Transformers)...")
|
642 |
|
643 |
# Validate and create directories
|
644 |
if not PDF_DIRECTORY.is_dir():
|
645 |
+
print(f"❌ ERREUR: Répertoire PDF non trouvé ou n'est pas un répertoire : {PDF_DIRECTORY}. Veuillez créer un répertoire 'docs' et y placer vos PDFs.")
|
646 |
+
# Create it if it doesn't exist, for example PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
|
647 |
+
# But for Colab, it's often better to instruct user to upload.
|
648 |
exit(1)
|
649 |
|
650 |
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
|
|
654 |
print(f"Répertoire de sauvegarde des images : {IMAGE_SAVE_DIR}")
|
655 |
print(f"Répertoire de sauvegarde des tableaux (HTML) : {TABLE_SAVE_DIR}")
|
656 |
|
657 |
+
# Initialize Gemma and Sentence-Transformers models
|
658 |
+
initialize_models()
|
659 |
+
|
660 |
+
# If models failed to initialize, exit
|
661 |
+
if gemma_sampler is None or text_embedding_model is None:
|
662 |
+
print("Impossible de continuer car un ou plusieurs modèles n'ont pas pu être initialisés.")
|
663 |
+
exit(1)
|
664 |
|
665 |
final_embeddings = process_pdfs_in_directory(PDF_DIRECTORY)
|
666 |
|