Anonymous1223334444
commited on
Commit
·
c2e3cf5
1
Parent(s):
2d00ebd
Initial commit of multimodal multilingual PDF embedding pipeline
Browse files- .gitignore +28 -0
- LICENSE +21 -0
- README.md +183 -5
- requirements.txt +8 -0
- run_pipeline.py +222 -0
- src/embedding_utils.py +205 -0
- src/pdf_processor.py +261 -0
.gitignore
ADDED
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Python
|
2 |
+
__pycache__/
|
3 |
+
*.pyc
|
4 |
+
*.pyo
|
5 |
+
*.pyd
|
6 |
+
.Python
|
7 |
+
env/
|
8 |
+
venv/
|
9 |
+
*.egg
|
10 |
+
*.egg-info/
|
11 |
+
.env
|
12 |
+
|
13 |
+
# Jupyter Notebook
|
14 |
+
.ipynb_checkpoints
|
15 |
+
|
16 |
+
# IDEs
|
17 |
+
.idea/
|
18 |
+
.vscode/
|
19 |
+
|
20 |
+
# Output files
|
21 |
+
output/
|
22 |
+
embeddings_statistiques_multimodal.json
|
23 |
+
extracted_graphs/
|
24 |
+
extracted_tables/
|
25 |
+
|
26 |
+
# API Keys (IMPORTANT!)
|
27 |
+
*.env
|
28 |
+
*.key
|
LICENSE
CHANGED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2025 Andre Sarr
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -1,5 +1,183 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Multimodal & Multilingual PDF Embedding Pipeline
|
2 |
+
|
3 |
+
This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images), and then creates multilingual text embeddings for all extracted information. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
|
4 |
+
|
5 |
+
**Key Features:**
|
6 |
+
- **Multimodal:** Processes text, tables, and images from PDFs.
|
7 |
+
- **Multilingual:** Leverages Google's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
|
8 |
+
- **Contextual Descriptions:** Uses Google Gemini (Gemini 1.5 Flash) to generate descriptive text for tables and images in French.
|
9 |
+
- **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
|
10 |
+
|
11 |
+
## How it Works
|
12 |
+
|
13 |
+
1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
|
14 |
+
2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
|
15 |
+
3. **Multimodal Description (for Tables & Images):**
|
16 |
+
- For tables, the pipeline captures an image of the table and also uses its text representation.
|
17 |
+
- For standalone images (e.g., graphs, charts), it captures the image.
|
18 |
+
- These images are then sent to the `gemini-1.5-flash-latest` model (via `google.generativeai`) with specific prompts to generate rich, descriptive text in French.
|
19 |
+
4. **Multilingual Text Embedding:**
|
20 |
+
- The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
|
21 |
+
- This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content.
|
22 |
+
5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
|
23 |
+
|
24 |
+
## Requirements & Setup
|
25 |
+
|
26 |
+
This pipeline relies on **Google Cloud Platform** services and specific Python libraries. You will need:
|
27 |
+
|
28 |
+
1. **A Google Cloud Project:**
|
29 |
+
- Enable the **Vertex AI API**.
|
30 |
+
- Enable the **Generative Language API** (for Gemini 1.5 Flash descriptions).
|
31 |
+
2. **Authentication:**
|
32 |
+
- **Google Cloud Authentication:** The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`. For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
|
33 |
+
- **Gemini API Key:** An API key for the Google AI Gemini models. You can get one from [Google AI Studio](https://aistudio.google.com/app/apikey). Set this as an environment variable or directly in the code (though environment variables are recommended for security).
|
34 |
+
|
35 |
+
### Local Setup
|
36 |
+
|
37 |
+
1. **Clone the repository:**
|
38 |
+
```bash
|
39 |
+
git clone [https://huggingface.co/](https://huggingface.co/)Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
|
40 |
+
cd pdf-multimodal-multilingual-embedding-pipeline
|
41 |
+
```
|
42 |
+
2. **Install dependencies:**
|
43 |
+
```bash
|
44 |
+
pip install -r requirements.txt
|
45 |
+
```
|
46 |
+
**System-level dependencies for Camelot/PyMuPDF (Linux/Colab):**
|
47 |
+
You might need to install these system packages for `PyMuPDF` and `Camelot` to function correctly.
|
48 |
+
```bash
|
49 |
+
# Update package list
|
50 |
+
sudo apt-get update
|
51 |
+
# Install Ghostscript (required by Camelot)
|
52 |
+
sudo apt-get install -y ghostscript
|
53 |
+
# Install python3-tk (required by some PyMuPDF functionalities)
|
54 |
+
sudo apt-get install -y python3-tk
|
55 |
+
# Install OpenCV (via apt, for camelot-py[cv])
|
56 |
+
sudo apt-get install -y libopencv-dev python3-opencv
|
57 |
+
```
|
58 |
+
*Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
|
59 |
+
|
60 |
+
3. **Set up Environment Variables:**
|
61 |
+
```bash
|
62 |
+
export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
|
63 |
+
export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
|
64 |
+
export GEMINI_API_KEY="your-gemini-api-key"
|
65 |
+
```
|
66 |
+
Replace `your-gcp-project-id`, `us-central1`, and `your-gemini-api-key` with your actual values.
|
67 |
+
|
68 |
+
4. **Place your PDF files:**
|
69 |
+
Create a `docs` directory in the root of the repository and place your PDF documents inside it.
|
70 |
+
```
|
71 |
+
pdf-multimodal-multilingual-embedding-pipeline/
|
72 |
+
├── docs/
|
73 |
+
│ └── your_document.pdf
|
74 |
+
│ └── another_document.pdf
|
75 |
+
```
|
76 |
+
|
77 |
+
5. **Run the pipeline:**
|
78 |
+
```bash
|
79 |
+
python run_pipeline.py
|
80 |
+
```
|
81 |
+
The generated embedding file (`embeddings_statistiques_multimodal.json`) and extracted assets will be saved in the `output/` directory.
|
82 |
+
|
83 |
+
### Google Colab Usage
|
84 |
+
|
85 |
+
A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments.
|
86 |
+
|
87 |
+
1. **Open a new Google Colab notebook.**
|
88 |
+
2. **Install system dependencies:**
|
89 |
+
```python
|
90 |
+
!pip uninstall -y camelot camelot-py # Ensure clean install
|
91 |
+
!pip install PyMuPDF
|
92 |
+
!apt-get update
|
93 |
+
!apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
|
94 |
+
!pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow
|
95 |
+
```
|
96 |
+
3. **Authenticate:**
|
97 |
+
```python
|
98 |
+
from google.colab import auth
|
99 |
+
auth.authenticate_user()
|
100 |
+
```
|
101 |
+
4. **Set your API Key and Project/Location:**
|
102 |
+
```python
|
103 |
+
import os
|
104 |
+
# Replace with your actual Gemini API key
|
105 |
+
os.environ["GENAI_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE"
|
106 |
+
# Replace with your actual Google Cloud Project ID
|
107 |
+
os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
|
108 |
+
# Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
|
109 |
+
os.environ["VERTEX_AI_LOCATION"] = "us-central1"
|
110 |
+
```
|
111 |
+
5. **Upload your PDF files:**
|
112 |
+
You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
|
113 |
+
```python
|
114 |
+
# Example for uploading
|
115 |
+
from google.colab import files
|
116 |
+
import os
|
117 |
+
PDF_DIRECTORY = Path("/content/docs")
|
118 |
+
PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
|
119 |
+
uploaded = files.upload()
|
120 |
+
for filename in uploaded.keys():
|
121 |
+
os.rename(filename, PDF_DIRECTORY / filename)
|
122 |
+
```
|
123 |
+
6. **Copy and paste the code from `run_pipeline.py` (and `src/` files if you don't use modules) into Colab cells and execute.**
|
124 |
+
|
125 |
+
## Output
|
126 |
+
|
127 |
+
The pipeline will generate:
|
128 |
+
- `embeddings_statistiques_multimodal.json`: A JSON file containing all generated embeddings and their metadata.
|
129 |
+
- `output/extracted_graphs/`: Directory containing extracted images (PNG format).
|
130 |
+
- `output/extracted_tables/`: Directory containing HTML representations of extracted tables.
|
131 |
+
|
132 |
+
## Example `embeddings_statistiques_multimodal.json` Entry
|
133 |
+
|
134 |
+
```json
|
135 |
+
[
|
136 |
+
{
|
137 |
+
"pdf_file": "sample.pdf",
|
138 |
+
"page_number": 1,
|
139 |
+
"chunk_id": "text_0",
|
140 |
+
"content_type": "text",
|
141 |
+
"text_content": "This is a chunk of text extracted from the first page of the document...",
|
142 |
+
"embedding": [0.123, -0.456, ..., 0.789],
|
143 |
+
"pdf_title": "Sample Document",
|
144 |
+
"pdf_subject": "Data Analysis",
|
145 |
+
"pdf_keywords": "statistics, report"
|
146 |
+
},
|
147 |
+
{
|
148 |
+
"pdf_file": "sample.pdf",
|
149 |
+
"page_number": 2,
|
150 |
+
"chunk_id": "table_0",
|
151 |
+
"content_type": "table",
|
152 |
+
"text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...",
|
153 |
+
"embedding": [-0.987, 0.654, ..., 0.321],
|
154 |
+
"table_html_url": "/static/extracted_tables/sample_p2_table0.html",
|
155 |
+
"image_url": "/static/extracted_graphs/sample_p2_table0.png",
|
156 |
+
"pdf_title": "Sample Document",
|
157 |
+
"pdf_subject": "Data Analysis",
|
158 |
+
"pdf_keywords": "statistics, report"
|
159 |
+
},
|
160 |
+
{
|
161 |
+
"pdf_file": "sample.pdf",
|
162 |
+
"page_number": 3,
|
163 |
+
"chunk_id": "image_0",
|
164 |
+
"content_type": "image",
|
165 |
+
"text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.",
|
166 |
+
"embedding": [0.456, -0.789, ..., 0.123],
|
167 |
+
"image_url": "/static/extracted_graphs/sample_p3_img0.png",
|
168 |
+
"pdf_title": "Sample Document",
|
169 |
+
"pdf_subject": "Data Analysis",
|
170 |
+
"pdf_keywords": "statistics, report"
|
171 |
+
}
|
172 |
+
]
|
173 |
+
```
|
174 |
+
|
175 |
+
# Acknowledgments
|
176 |
+
This pipeline leverages the power of:
|
177 |
+
- Google Cloud Vertex AI
|
178 |
+
- Google AI Gemini Models
|
179 |
+
- PyMuPDF
|
180 |
+
- Camelot
|
181 |
+
- Tiktoken
|
182 |
+
- Pandas
|
183 |
+
- BeautifulSoup
|
requirements.txt
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
PyMuPDF
|
2 |
+
camelot-py[cv]
|
3 |
+
google-cloud-aiplatform
|
4 |
+
google-generativeai
|
5 |
+
tiktoken
|
6 |
+
pandas
|
7 |
+
beautifulsoup4
|
8 |
+
Pillow
|
run_pipeline.py
ADDED
@@ -0,0 +1,222 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import json
|
3 |
+
import traceback
|
4 |
+
from pathlib import Path
|
5 |
+
import tiktoken
|
6 |
+
|
7 |
+
# Import functions from your src directory
|
8 |
+
from src.pdf_processor import extract_page_data_pymupdf, clean_text
|
9 |
+
from src.embedding_utils import initialize_clients, token_chunking, generate_multimodal_description, generate_text_embedding, ENCODING_NAME, MAX_TOKENS_NORMAL
|
10 |
+
|
11 |
+
# --- Configuration ---
|
12 |
+
# You can set these directly or get them from environment variables (recommended)
|
13 |
+
PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
|
14 |
+
LOCATION = os.getenv("VERTEX_AI_LOCATION")
|
15 |
+
GENAI_API_KEY = os.getenv("GENAI_API_KEY") # For Gemini API
|
16 |
+
|
17 |
+
# Path configuration
|
18 |
+
BASE_DIR = Path.cwd() # Current working directory of the script
|
19 |
+
PDF_DIRECTORY = BASE_DIR / "docs"
|
20 |
+
OUTPUT_DIR = BASE_DIR / "output" # New output directory for generated files
|
21 |
+
EMBEDDINGS_FILE_PATH = OUTPUT_DIR / "embeddings_statistiques_multimodal.json"
|
22 |
+
|
23 |
+
# Directory to save extracted images and tables HTML (within output)
|
24 |
+
IMAGE_SAVE_SUBDIR = "extracted_graphs"
|
25 |
+
TABLE_SAVE_SUBDIR = "extracted_tables"
|
26 |
+
# Absolute paths for saving
|
27 |
+
IMAGE_SAVE_DIR = OUTPUT_DIR / IMAGE_SAVE_SUBDIR
|
28 |
+
TABLE_SAVE_DIR = OUTPUT_DIR / TABLE_SAVE_SUBDIR
|
29 |
+
|
30 |
+
# --- Main Processing Function ---
|
31 |
+
|
32 |
+
def process_pdfs_in_directory(directory):
|
33 |
+
"""Main processing pipeline for all PDFs in a directory."""
|
34 |
+
all_embeddings_data = []
|
35 |
+
processed_files = 0
|
36 |
+
pdf_files = list(directory.glob("*.pdf"))
|
37 |
+
total_files = len(pdf_files)
|
38 |
+
|
39 |
+
if total_files == 0:
|
40 |
+
print(f"Aucun fichier PDF trouvé dans le répertoire : {directory}")
|
41 |
+
return []
|
42 |
+
|
43 |
+
for pdf_file_path in pdf_files:
|
44 |
+
processed_files += 1
|
45 |
+
print(f"\nTraitement de {pdf_file_path.name} ({processed_files}/{total_files})...")
|
46 |
+
|
47 |
+
page_data_list = extract_page_data_pymupdf(pdf_file_path, IMAGE_SAVE_DIR, TABLE_SAVE_DIR, IMAGE_SAVE_SUBDIR, TABLE_SAVE_SUBDIR)
|
48 |
+
|
49 |
+
if not page_data_list:
|
50 |
+
print(f" Aucune donnée extraite de {pdf_file_path.name}.")
|
51 |
+
continue
|
52 |
+
|
53 |
+
for page_data in page_data_list:
|
54 |
+
pdf_file = page_data['pdf_file']
|
55 |
+
page_num = page_data['page_number']
|
56 |
+
page_text = page_data['text']
|
57 |
+
images = page_data['images'] # List of non-table image dicts
|
58 |
+
tables = page_data['tables'] # List of table dicts
|
59 |
+
pdf_title = page_data.get('pdf_title')
|
60 |
+
pdf_subject = page_data.get('pdf_subject')
|
61 |
+
pdf_keywords = page_data.get('pdf_keywords')
|
62 |
+
|
63 |
+
print(f" Génération des descriptions et embeddings pour la page {page_num}...")
|
64 |
+
|
65 |
+
# Process tables: Generate description and then embedding
|
66 |
+
for table_idx, table in enumerate(tables):
|
67 |
+
table_image_bytes = table.get('image_bytes')
|
68 |
+
table_text_repr = table.get('table_text_representation', '')
|
69 |
+
table_html_url = table.get('table_html_url')
|
70 |
+
|
71 |
+
description = None
|
72 |
+
if table_image_bytes:
|
73 |
+
prompt = "Décrivez en français le contenu et la structure de ce tableau. Mettez l'accent sur les données principales et les tendances si visibles."
|
74 |
+
print(f" Page {page_num}: Génération de la description multimodale pour le tableau {table_idx}...")
|
75 |
+
description = generate_multimodal_description(table_image_bytes, prompt)
|
76 |
+
elif table_text_repr:
|
77 |
+
prompt = f"Décrivez en français le contenu et la structure de ce tableau basé sur sa représentation textuelle:\n{table_text_repr[:1000]}..."
|
78 |
+
print(f" Page {page_num}: Génération de la description textuelle pour le tableau {table_idx} (fallback)...")
|
79 |
+
# Use the multimodal model with text-only input (via google.generativeai)
|
80 |
+
if GENAI_API_KEY:
|
81 |
+
try:
|
82 |
+
model = genai.GenerativeModel("models/gemini-1.5-flash-latest") # Explicitly use the model
|
83 |
+
response = model.generate_content(prompt)
|
84 |
+
description = response.text.strip()
|
85 |
+
except Exception as e:
|
86 |
+
print(f" Erreur lors de la génération de description textuelle pour le tableau {table_idx}: {e}")
|
87 |
+
description = None
|
88 |
+
else:
|
89 |
+
print(" Skipping text description generation for table: GEMINI_API_KEY is not set.")
|
90 |
+
description = None
|
91 |
+
|
92 |
+
|
93 |
+
if description:
|
94 |
+
print(f" Page {page_num}: Description générée pour le tableau {table_idx}.")
|
95 |
+
embedding_vector = generate_text_embedding(description) # max_retries, delay are defaults
|
96 |
+
|
97 |
+
if embedding_vector is not None:
|
98 |
+
chunk_data = {
|
99 |
+
"pdf_file": pdf_file,
|
100 |
+
"page_number": page_num,
|
101 |
+
"chunk_id": f"table_{table_idx}",
|
102 |
+
"content_type": "table",
|
103 |
+
"text_content": description,
|
104 |
+
"embedding": embedding_vector,
|
105 |
+
"table_html_url": table_html_url,
|
106 |
+
"image_url": table.get('image_url'),
|
107 |
+
"pdf_title": pdf_title,
|
108 |
+
"pdf_subject": pdf_subject,
|
109 |
+
"pdf_keywords": pdf_keywords
|
110 |
+
}
|
111 |
+
all_embeddings_data.append(chunk_data)
|
112 |
+
print(f" Page {page_num}: Embedding généré pour la description du tableau {table_idx}.")
|
113 |
+
else:
|
114 |
+
print(f" Page {page_num}: Échec de la génération de l'embedding pour la description du tableau {table_idx}. Chunk ignoré.")
|
115 |
+
else:
|
116 |
+
print(f" Page {page_num}: Aucune description générée pour le tableau {table_idx}. Chunk ignoré.")
|
117 |
+
|
118 |
+
|
119 |
+
# Process images (non-table): Generate description and then embedding
|
120 |
+
for img_idx, image in enumerate(images):
|
121 |
+
image_bytes = image.get('image_bytes')
|
122 |
+
image_url = image.get('image_url')
|
123 |
+
|
124 |
+
if image_bytes:
|
125 |
+
prompt = "Décrivez en français le contenu de cette image. S'il s'agit d'un graphique, décrivez le type de graphique (histogramme, courbe, etc.), les axes, les légendes et les principales informations ou tendances visibles."
|
126 |
+
print(f" Page {page_num}: Génération de la description multimodale pour l'image {img_idx}...")
|
127 |
+
description = generate_multimodal_description(image_bytes, prompt)
|
128 |
+
|
129 |
+
if description:
|
130 |
+
print(f" Page {page_num}: Description générée pour l'image {img_idx}.")
|
131 |
+
embedding_vector = generate_text_embedding(description) # max_retries, delay are defaults
|
132 |
+
|
133 |
+
if embedding_vector is not None:
|
134 |
+
chunk_data = {
|
135 |
+
"pdf_file": pdf_file,
|
136 |
+
"page_number": page_num,
|
137 |
+
"chunk_id": f"image_{img_idx}",
|
138 |
+
"content_type": "image",
|
139 |
+
"text_content": description,
|
140 |
+
"embedding": embedding_vector,
|
141 |
+
"image_url": image_url,
|
142 |
+
"pdf_title": pdf_title,
|
143 |
+
"pdf_subject": pdf_subject,
|
144 |
+
"pdf_keywords": pdf_keywords
|
145 |
+
}
|
146 |
+
all_embeddings_data.append(chunk_data)
|
147 |
+
print(f" Page {page_num}: Embedding généré pour la description de l'image {img_idx}.")
|
148 |
+
else:
|
149 |
+
print(f" Page {page_num}: Échec de la génération de l'embedding pour la description de l'image {img_idx}. Chunk ignoré.")
|
150 |
+
else:
|
151 |
+
print(f" Page {page_num}: Aucune description générée pour l'image {img_idx}. Chunk ignoré.")
|
152 |
+
|
153 |
+
|
154 |
+
# Process regular text: Chunk and then generate embeddings
|
155 |
+
if page_text:
|
156 |
+
try:
|
157 |
+
encoding = tiktoken.get_encoding(ENCODING_NAME)
|
158 |
+
text_chunks = token_chunking(page_text, MAX_TOKENS_NORMAL, encoding)
|
159 |
+
except Exception as e:
|
160 |
+
print(f"Erreur lors du chunking du texte de la page {page_num} : {e}. Utilisation du chunking simple.")
|
161 |
+
text_chunks = [page_text]
|
162 |
+
|
163 |
+
|
164 |
+
for chunk_idx, chunk_content in enumerate(text_chunks):
|
165 |
+
print(f" Page {page_num}: Génération de l'embedding pour le chunk de texte {chunk_idx}...")
|
166 |
+
embedding_vector = generate_text_embedding(chunk_content) # max_retries, delay are defaults
|
167 |
+
|
168 |
+
if embedding_vector is not None:
|
169 |
+
chunk_data = {
|
170 |
+
"pdf_file": pdf_file,
|
171 |
+
"page_number": page_num,
|
172 |
+
"chunk_id": f"text_{chunk_idx}",
|
173 |
+
"content_type": "text",
|
174 |
+
"text_content": chunk_content,
|
175 |
+
"embedding": embedding_vector,
|
176 |
+
"pdf_title": pdf_title,
|
177 |
+
"pdf_subject": pdf_subject,
|
178 |
+
"pdf_keywords": pdf_keywords
|
179 |
+
}
|
180 |
+
all_embeddings_data.append(chunk_data)
|
181 |
+
print(f" Page {page_num}: Chunk de texte {chunk_idx} traité avec succès.")
|
182 |
+
else:
|
183 |
+
print(f" Page {page_num}: Échec de la génération de l'embedding pour le chunk de texte {chunk_idx}. Chunk ignoré.")
|
184 |
+
|
185 |
+
|
186 |
+
print(f" Page {page_num} terminée. Éléments traités : {len(tables)} tableaux, {len(images)} images, {len(text_chunks)} chunks de texte.")
|
187 |
+
|
188 |
+
|
189 |
+
return all_embeddings_data
|
190 |
+
|
191 |
+
# --- Main Execution ---
|
192 |
+
if __name__ == "__main__":
|
193 |
+
print("Démarrage du traitement PDF multimodal avec génération de descriptions et embeddings textuels multilingues...")
|
194 |
+
|
195 |
+
# Validate and create directories
|
196 |
+
if not PDF_DIRECTORY.is_dir():
|
197 |
+
print(f"❌ ERREUR: Répertoire PDF non trouvé ou n'est pas un répertoire : {PDF_DIRECTORY}")
|
198 |
+
exit(1)
|
199 |
+
|
200 |
+
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
201 |
+
IMAGE_SAVE_DIR.mkdir(parents=True, exist_ok=True)
|
202 |
+
TABLE_SAVE_DIR.mkdir(parents=True, exist_ok=True)
|
203 |
+
print(f"Répertoire de sortie : {OUTPUT_DIR}")
|
204 |
+
print(f"Répertoire de sauvegarde des images : {IMAGE_SAVE_DIR}")
|
205 |
+
print(f"Répertoire de sauvegarde des tableaux (HTML) : {TABLE_SAVE_DIR}")
|
206 |
+
|
207 |
+
# Initialize clients for Vertex AI and GenAI
|
208 |
+
initialize_clients(PROJECT_ID, LOCATION, GENAI_API_KEY)
|
209 |
+
|
210 |
+
final_embeddings = process_pdfs_in_directory(PDF_DIRECTORY)
|
211 |
+
|
212 |
+
if final_embeddings:
|
213 |
+
print(f"\nTotal d'embeddings générés : {len(final_embeddings)}.")
|
214 |
+
try:
|
215 |
+
with EMBEDDINGS_FILE_PATH.open('w', encoding='utf-8') as f:
|
216 |
+
json.dump(final_embeddings, f, indent=2, ensure_ascii=False)
|
217 |
+
print(f"Embeddings sauvegardés avec succès dans : {EMBEDDINGS_FILE_PATH}")
|
218 |
+
except Exception as e:
|
219 |
+
print(f"\nErreur lors de la sauvegarde du fichier JSON d'embeddings : {e}")
|
220 |
+
traceback.print_exc()
|
221 |
+
else:
|
222 |
+
print("\nAucun embedding n'a été généré.")
|
src/embedding_utils.py
ADDED
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import time
|
3 |
+
import random
|
4 |
+
import traceback
|
5 |
+
import tiktoken
|
6 |
+
|
7 |
+
import google.generativeai as genai
|
8 |
+
import vertexai
|
9 |
+
from vertexai.language_models import TextEmbeddingModel
|
10 |
+
|
11 |
+
# Configuration (will be initialized from run_pipeline.py)
|
12 |
+
# For module, these should ideally be arguments or imported from a config
|
13 |
+
# GENAI_API_KEY = os.getenv("GENAI_API_KEY")
|
14 |
+
# PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
|
15 |
+
# LOCATION = os.getenv("VERTEX_AI_LOCATION")
|
16 |
+
|
17 |
+
MULTIMODAL_MODEL_GENAI = "models/gemini-1.5-flash-latest"
|
18 |
+
TEXT_EMBEDDING_MODEL_VERTEXAI = "text-multilingual-embedding-002"
|
19 |
+
EMBEDDING_DIMENSION = 768 # text-multilingual-embedding-002 has 768 dimensions
|
20 |
+
|
21 |
+
MAX_TOKENS_NORMAL = 500
|
22 |
+
ENCODING_NAME = "cl100k_base"
|
23 |
+
|
24 |
+
# Global client for Vertex AI Text Embedding Model
|
25 |
+
text_embedding_model_client = None
|
26 |
+
|
27 |
+
def initialize_clients(project_id, location, genai_api_key):
|
28 |
+
"""Initializes Vertex AI and GenAI clients."""
|
29 |
+
global text_embedding_model_client
|
30 |
+
|
31 |
+
if genai_api_key:
|
32 |
+
genai.configure(api_key=genai_api_key)
|
33 |
+
print("✓ Google Generative AI configured.")
|
34 |
+
else:
|
35 |
+
print("⚠️ AVERTISSEMENT: La clé API Gemini n'est pas définie. La génération de descriptions multimodales échouera.")
|
36 |
+
|
37 |
+
if project_id and location:
|
38 |
+
try:
|
39 |
+
vertexai.init(project=project_id, location=location)
|
40 |
+
print(f"✓ Vertex AI SDK initialisé pour le projet {project_id} dans la région {location}.")
|
41 |
+
text_embedding_model_client = TextEmbeddingModel.from_pretrained(TEXT_EMBEDDING_MODEL_VERTEXAI)
|
42 |
+
print(f"✓ Modèle d'embedding textuel Vertex AI '{TEXT_EMBEDDING_MODEL_VERTEXAI}' chargé avec succès.")
|
43 |
+
except Exception as e:
|
44 |
+
print(f"❌ ERREUR: Échec de l'initialisation du Vertex AI SDK ou du chargement du modèle d'embedding textuel : {str(e)}")
|
45 |
+
print("⚠️ La génération d'embeddings textuels échouera.")
|
46 |
+
text_embedding_model_client = None
|
47 |
+
else:
|
48 |
+
print("⚠️ Vertex AI SDK non initialisé car l'ID du projet Google Cloud ou la localisation sont manquants.")
|
49 |
+
print("⚠️ La génération d'embeddings textuels échouera.")
|
50 |
+
text_embedding_model_client = None
|
51 |
+
|
52 |
+
|
53 |
+
def token_chunking(text, max_tokens, encoding):
|
54 |
+
"""Chunk text based on token count with smarter boundaries (sentences, paragraphs)"""
|
55 |
+
if not text:
|
56 |
+
return []
|
57 |
+
|
58 |
+
tokens = encoding.encode(text)
|
59 |
+
chunks = []
|
60 |
+
start_token_idx = 0
|
61 |
+
|
62 |
+
while start_token_idx < len(tokens):
|
63 |
+
end_token_idx = min(start_token_idx + max_tokens, len(tokens))
|
64 |
+
|
65 |
+
if end_token_idx < len(tokens):
|
66 |
+
look_ahead_limit = min(start_token_idx + max_tokens * 2, len(tokens))
|
67 |
+
text_segment_to_check = encoding.decode(tokens[start_token_idx:look_ahead_limit])
|
68 |
+
|
69 |
+
paragraph_break = text_segment_to_check.rfind('\n\n', 0, len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens)))
|
70 |
+
if paragraph_break != -1:
|
71 |
+
tokens_up_to_break = encoding.encode(text_segment_to_check[:paragraph_break])
|
72 |
+
end_token_idx = start_token_idx + len(tokens_up_to_break)
|
73 |
+
else:
|
74 |
+
sentence_end = re.search(r'[.!?]\s+', text_segment_to_check[:len(text_segment_to_check) - (look_ahead_limit - (start_token_idx + max_tokens))][::-1])
|
75 |
+
if sentence_end:
|
76 |
+
char_index_in_segment = len(text_segment_to_check) - 1 - sentence_end.start()
|
77 |
+
tokens_up_to_end = encoding.encode(text_segment_to_check[:char_index_in_segment + 1])
|
78 |
+
end_token_idx = start_token_idx + len(tokens_up_to_end)
|
79 |
+
|
80 |
+
current_chunk_tokens = tokens[start_token_idx:end_token_idx]
|
81 |
+
chunk_text = encoding.decode(current_chunk_tokens).strip()
|
82 |
+
|
83 |
+
if chunk_text:
|
84 |
+
chunks.append(chunk_text)
|
85 |
+
|
86 |
+
if start_token_idx == end_token_idx:
|
87 |
+
start_token_idx += 1
|
88 |
+
else:
|
89 |
+
start_token_idx = end_token_idx
|
90 |
+
|
91 |
+
return chunks
|
92 |
+
|
93 |
+
|
94 |
+
def generate_multimodal_description(image_bytes, prompt_text, multimodal_model_genai_name=MULTIMODAL_MODEL_GENAI, max_retries=5, delay=10):
|
95 |
+
"""
|
96 |
+
Generate a text description for an image using a multimodal model (google.generativeai).
|
97 |
+
Returns description text or None if all retries fail or API key is missing.
|
98 |
+
"""
|
99 |
+
if not genai.api_key: # Check if API key is configured
|
100 |
+
print(" Skipping multimodal description generation: GEMINI_API_KEY is not set.")
|
101 |
+
return None
|
102 |
+
|
103 |
+
for attempt in range(max_retries):
|
104 |
+
try:
|
105 |
+
time.sleep(delay + random.uniform(0, 5))
|
106 |
+
|
107 |
+
content = [
|
108 |
+
prompt_text,
|
109 |
+
{
|
110 |
+
'mime_type': 'image/png',
|
111 |
+
'data': image_bytes
|
112 |
+
}
|
113 |
+
]
|
114 |
+
|
115 |
+
model = genai.GenerativeModel(multimodal_model_genai_name)
|
116 |
+
response = model.generate_content(content)
|
117 |
+
|
118 |
+
description = response.text.strip()
|
119 |
+
|
120 |
+
if description:
|
121 |
+
return description
|
122 |
+
else:
|
123 |
+
print(f" Tentative {attempt+1}/{max_retries}: Réponse vide ou inattendue du modèle multimodal.")
|
124 |
+
if attempt < max_retries - 1:
|
125 |
+
retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
|
126 |
+
print(f" Réessai dans {retry_delay:.2f}s...")
|
127 |
+
time.sleep(retry_delay)
|
128 |
+
continue
|
129 |
+
# else:
|
130 |
+
# print(f" Toutes les {max_retries} tentatives ont échoué pour générer la description.")
|
131 |
+
# return None
|
132 |
+
|
133 |
+
|
134 |
+
except Exception as e:
|
135 |
+
error_msg = str(e)
|
136 |
+
print(f" Tentative {attempt+1}/{max_retries} échouée pour la description : {error_msg}")
|
137 |
+
|
138 |
+
if "429" in error_msg or "quota" in error_msg.lower() or "rate limit" in error_msg.lower() or "unavailable" in error_msg.lower() or "internal error" in error_msg.lower():
|
139 |
+
if attempt < max_retries - 1:
|
140 |
+
retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
|
141 |
+
print(f" Erreur d'API retryable détectée. Réessai dans {retry_delay:.2f}s...")
|
142 |
+
time.sleep(retry_delay)
|
143 |
+
continue
|
144 |
+
# else:
|
145 |
+
# print(f" Toutes les {max_retries} tentatives ont échoué pour la description.")
|
146 |
+
# return None
|
147 |
+
|
148 |
+
else:
|
149 |
+
print(f" Erreur d'API non retryable détectée : {error_msg}")
|
150 |
+
traceback.print_exc()
|
151 |
+
return None
|
152 |
+
|
153 |
+
print(f" Toutes les {max_retries} tentatives ont échoué pour la description (fin de boucle).")
|
154 |
+
return None
|
155 |
+
|
156 |
+
|
157 |
+
def generate_text_embedding(text_content, max_retries=5, delay=5):
|
158 |
+
"""
|
159 |
+
Generate text embedding using the Vertex AI multilingual embedding model.
|
160 |
+
Returns embedding vector (list) or None if all retries fail or client is not initialized.
|
161 |
+
"""
|
162 |
+
global text_embedding_model_client # Ensure we are using the global client
|
163 |
+
|
164 |
+
if not text_embedding_model_client:
|
165 |
+
print(" Skipping text embedding generation: Vertex AI embedding client is not initialized.")
|
166 |
+
return None
|
167 |
+
|
168 |
+
if not text_content or not text_content.strip():
|
169 |
+
return None # Cannot embed empty text
|
170 |
+
|
171 |
+
for attempt in range(max_retries):
|
172 |
+
try:
|
173 |
+
time.sleep(delay + random.uniform(0, 2))
|
174 |
+
|
175 |
+
embeddings = text_embedding_model_client.get_embeddings( # Corrected method name
|
176 |
+
[text_content] # Removed task_type
|
177 |
+
)
|
178 |
+
|
179 |
+
if embeddings and len(embeddings) > 0 and hasattr(embeddings[0], 'values') and isinstance(embeddings[0].values, list) and len(embeddings[0].values) == EMBEDDING_DIMENSION:
|
180 |
+
return embeddings[0].values
|
181 |
+
else:
|
182 |
+
print(f" Tentative {attempt+1}/{max_retries}: Format d'embedding Vertex AI inattendu. Réponse : {embeddings}")
|
183 |
+
return None
|
184 |
+
|
185 |
+
except Exception as e:
|
186 |
+
error_msg = str(e)
|
187 |
+
print(f" Tentative {attempt+1}/{max_retries} échouée pour l'embedding Vertex AI : {error_msg}")
|
188 |
+
|
189 |
+
if "429" in error_msg or "quota" in error_msg.lower() or "rate limit" in error_msg.lower() or "unavailable" in error_msg.lower() or "internal error" in error_msg.lower():
|
190 |
+
if attempt < max_retries - 1:
|
191 |
+
retry_delay = delay * (2 ** attempt) + random.uniform(1, 5)
|
192 |
+
print(f" Erreur d'API Vertex AI retryable détectée. Réessai dans {retry_delay:.2f}s...")
|
193 |
+
time.sleep(retry_delay)
|
194 |
+
continue
|
195 |
+
# else:
|
196 |
+
# print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding Vertex AI.")
|
197 |
+
# return None
|
198 |
+
|
199 |
+
else:
|
200 |
+
print(f" Erreur d'API Vertex AI non retryable détectée : {error_msg}")
|
201 |
+
traceback.print_exc()
|
202 |
+
return None
|
203 |
+
|
204 |
+
print(f" Toutes les {max_retries} tentatives ont échoué pour l'embedding Vertex AI (fin de boucle).")
|
205 |
+
return None
|
src/pdf_processor.py
ADDED
@@ -0,0 +1,261 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import fitz # PyMuPDF
|
2 |
+
import camelot # For table extraction
|
3 |
+
import pandas as pd
|
4 |
+
from bs4 import BeautifulSoup
|
5 |
+
import re
|
6 |
+
from pathlib import Path
|
7 |
+
import traceback
|
8 |
+
|
9 |
+
# Path configuration (assuming these are passed or relative to run_pipeline.py)
|
10 |
+
# For module, these should ideally be arguments or imported from a config
|
11 |
+
# BASE_DIR = Path("/content/")
|
12 |
+
# PDF_DIRECTORY = BASE_DIR / "docs"
|
13 |
+
# IMAGE_SAVE_SUBDIR = "extracted_graphs"
|
14 |
+
# TABLE_SAVE_SUBDIR = "extracted_tables"
|
15 |
+
# STATIC_DIR = BASE_DIR / "static"
|
16 |
+
# IMAGE_SAVE_DIR = STATIC_DIR / IMAGE_SAVE_SUBDIR
|
17 |
+
# TABLE_SAVE_DIR = STATIC_DIR / TABLE_SAVE_SUBDIR
|
18 |
+
|
19 |
+
# These should be passed as arguments or configured at a higher level
|
20 |
+
IMAGE_MIN_WIDTH = 100 # Ignore very small images (likely logos/icons)
|
21 |
+
IMAGE_MIN_HEIGHT = 100
|
22 |
+
|
23 |
+
def clean_text(text):
|
24 |
+
"""Normalize whitespace and clean text while preserving paragraph breaks"""
|
25 |
+
if not text:
|
26 |
+
return ""
|
27 |
+
# Replace tabs with spaces, but preserve paragraph breaks
|
28 |
+
text = text.replace('\t', ' ')
|
29 |
+
# Normalize multiple spaces to single spaces
|
30 |
+
text = re.sub(r' +', ' ', text)
|
31 |
+
# Preserve paragraph breaks but normalize them
|
32 |
+
text = re.sub(r'\n{3,}', '\n\n', text)
|
33 |
+
return text.strip()
|
34 |
+
|
35 |
+
def extract_page_data_pymupdf(pdf_path, image_save_dir, table_save_dir, image_save_subdir, table_save_subdir):
|
36 |
+
"""Extract text, tables and save images from each page using PyMuPDF and Camelot."""
|
37 |
+
page_data_list = []
|
38 |
+
try:
|
39 |
+
doc = fitz.open(pdf_path)
|
40 |
+
metadata = doc.metadata or {}
|
41 |
+
pdf_data = {
|
42 |
+
'pdf_title': metadata.get('title', pdf_path.name),
|
43 |
+
'pdf_subject': metadata.get('subject', 'Statistiques'),
|
44 |
+
'pdf_keywords': metadata.get('keywords', '')
|
45 |
+
}
|
46 |
+
|
47 |
+
for page_num in range(len(doc)):
|
48 |
+
page = doc.load_page(page_num)
|
49 |
+
page_index = page_num + 1 # 1-based index
|
50 |
+
|
51 |
+
print(f" Extraction des données de la page {page_index}...")
|
52 |
+
|
53 |
+
# Extract tables first
|
54 |
+
table_data = extract_tables_and_images_from_page(pdf_path, page, page_index, table_save_dir, image_save_dir, image_save_subdir, table_save_subdir)
|
55 |
+
|
56 |
+
# Track table regions to avoid double-processing text
|
57 |
+
table_regions = []
|
58 |
+
for item in table_data:
|
59 |
+
if 'rect' in item and item['rect'] and len(item['rect']) == 4:
|
60 |
+
table_regions.append(fitz.Rect(item['rect']))
|
61 |
+
else:
|
62 |
+
print(f" Warning: Invalid rect for table on page {page_index}")
|
63 |
+
|
64 |
+
|
65 |
+
# Extract text excluding table regions
|
66 |
+
page_text = ""
|
67 |
+
if table_regions:
|
68 |
+
# Get text blocks
|
69 |
+
blocks = page.get_text("blocks")
|
70 |
+
for block in blocks:
|
71 |
+
block_rect = fitz.Rect(block[:4])
|
72 |
+
is_in_table = False
|
73 |
+
for table_rect in table_regions:
|
74 |
+
if block_rect.intersects(table_rect):
|
75 |
+
is_in_table = True
|
76 |
+
break
|
77 |
+
if not is_in_table:
|
78 |
+
page_text += block[4] + "\n" # Add text content
|
79 |
+
else:
|
80 |
+
# If no tables, get all text
|
81 |
+
page_text = page.get_text("text")
|
82 |
+
|
83 |
+
page_text = clean_text(page_text)
|
84 |
+
|
85 |
+
|
86 |
+
# Extract and save images (excluding those identified as tables)
|
87 |
+
image_data = extract_images_from_page(pdf_path, page, page_index, image_save_dir, image_save_subdir, excluded_rects=table_regions)
|
88 |
+
|
89 |
+
|
90 |
+
page_data_list.append({
|
91 |
+
'pdf_file': pdf_path.name,
|
92 |
+
'page_number': page_index,
|
93 |
+
'text': page_text,
|
94 |
+
'images': image_data, # Includes non-table images
|
95 |
+
'tables': [item for item in table_data if item['content_type'] == 'table'], # Only table data here
|
96 |
+
'pdf_title': pdf_data.get('pdf_title'),
|
97 |
+
'pdf_subject': pdf_data.get('pdf_subject'),
|
98 |
+
'pdf_keywords': pdf_data.get('pdf_keywords')
|
99 |
+
})
|
100 |
+
doc.close()
|
101 |
+
except Exception as e:
|
102 |
+
print(f"Erreur lors du traitement du PDF {pdf_path.name} avec PyMuPDF : {str(e)}")
|
103 |
+
traceback.print_exc() # Print traceback for debugging
|
104 |
+
return page_data_list
|
105 |
+
|
106 |
+
|
107 |
+
def extract_tables_and_images_from_page(pdf_path, page, page_num, table_save_dir, image_save_dir, image_save_subdir, table_save_subdir):
|
108 |
+
"""Extract tables using Camelot and capture images of table areas."""
|
109 |
+
table_and_image_data = []
|
110 |
+
try:
|
111 |
+
tables = camelot.read_pdf(
|
112 |
+
str(pdf_path),
|
113 |
+
pages=str(page_num),
|
114 |
+
flavor='lattice',
|
115 |
+
)
|
116 |
+
|
117 |
+
if len(tables) == 0:
|
118 |
+
tables = camelot.read_pdf(
|
119 |
+
str(pdf_path),
|
120 |
+
pages=str(page_num),
|
121 |
+
flavor='stream'
|
122 |
+
)
|
123 |
+
|
124 |
+
for i, table in enumerate(tables):
|
125 |
+
if table.accuracy < 70:
|
126 |
+
print(f" Skipping low accuracy table ({table.accuracy:.2f}%) on page {page_num}")
|
127 |
+
continue
|
128 |
+
|
129 |
+
table_bbox = table.parsing_report.get('page_bbox', [0, 0, 0, 0])
|
130 |
+
if not table_bbox or len(table_bbox) != 4:
|
131 |
+
print(f" Warning: Invalid bounding box for table {i} on page {page_num}. Skipping image capture.")
|
132 |
+
table_rect = None
|
133 |
+
else:
|
134 |
+
table_rect = fitz.Rect(table_bbox)
|
135 |
+
|
136 |
+
safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
|
137 |
+
table_html_filename = f"{safe_pdf_name}_p{page_num}_table{i}.html"
|
138 |
+
table_html_save_path = table_save_dir / table_html_filename
|
139 |
+
relative_html_url_path = f"/static/{table_save_subdir}/{table_html_filename}"
|
140 |
+
|
141 |
+
table_image_filename = f"{safe_pdf_name}_p{page_num}_table{i}.png"
|
142 |
+
table_image_save_path = image_save_dir / table_image_filename
|
143 |
+
relative_image_url_path = f"/static/{image_save_subdir}/{table_image_filename}"
|
144 |
+
|
145 |
+
|
146 |
+
df = table.df
|
147 |
+
html = f"<caption>Table extrait de {pdf_path.name}, page {page_num}</caption>\n" + df.to_html(index=False)
|
148 |
+
soup = BeautifulSoup(html, 'html.parser')
|
149 |
+
table_tag = soup.find('table')
|
150 |
+
if table_tag:
|
151 |
+
table_tag['class'] = 'table table-bordered table-striped'
|
152 |
+
table_tag['style'] = 'width:100%; border-collapse:collapse;'
|
153 |
+
|
154 |
+
style_tag = soup.new_tag('style')
|
155 |
+
style_tag.string = """
|
156 |
+
.table { border-collapse: collapse; width: 100%; margin-bottom: 1rem;}
|
157 |
+
.table caption { caption-side: top; padding: 0.5rem; text-align: left; font-weight: bold; }
|
158 |
+
.table th, .table td { border: 1px solid #ddd; padding: 8px; text-align: left; }
|
159 |
+
.table th { background-color: #f2f2f2; font-weight: bold; }
|
160 |
+
.table-striped tbody tr:nth-of-type(odd) { background-color: rgba(0,0,0,.05); }
|
161 |
+
.table-responsive { overflow-x: auto; margin-bottom: 1rem; }
|
162 |
+
"""
|
163 |
+
soup.insert(0, style_tag)
|
164 |
+
|
165 |
+
div = soup.new_tag('div')
|
166 |
+
div['class'] = 'table-responsive'
|
167 |
+
table_tag.wrap(div)
|
168 |
+
|
169 |
+
with open(table_html_save_path, 'w', encoding='utf-8') as f:
|
170 |
+
f.write(str(soup))
|
171 |
+
else:
|
172 |
+
print(f" Warning: Could not find table tag in HTML for table on page {page_num}. Skipping HTML save.")
|
173 |
+
continue
|
174 |
+
|
175 |
+
table_image_bytes = None
|
176 |
+
if table_rect:
|
177 |
+
try:
|
178 |
+
pix = page.get_pixmap(clip=table_rect)
|
179 |
+
table_image_bytes = pix.tobytes(format='png')
|
180 |
+
|
181 |
+
with open(table_image_save_path, "wb") as img_file:
|
182 |
+
img_file.write(table_image_bytes)
|
183 |
+
|
184 |
+
except Exception as img_capture_e:
|
185 |
+
print(f" Erreur lors de la capture d'image du tableau {i} page {page_num} : {img_capture_e}")
|
186 |
+
traceback.print_exc()
|
187 |
+
table_image_bytes = None
|
188 |
+
|
189 |
+
|
190 |
+
table_and_image_data.append({
|
191 |
+
'content_type': 'table',
|
192 |
+
'table_html_url': relative_html_url_path,
|
193 |
+
'table_text_representation': df.to_string(index=False),
|
194 |
+
'rect': [table_rect.x0, table_rect.y0, table_rect.x1, table_rect.y1] if table_rect else None,
|
195 |
+
'accuracy': table.accuracy,
|
196 |
+
'image_bytes': table_image_bytes,
|
197 |
+
'image_url': relative_image_url_path if table_image_bytes else None
|
198 |
+
})
|
199 |
+
|
200 |
+
return table_and_image_data
|
201 |
+
|
202 |
+
except Exception as e:
|
203 |
+
print(f" Erreur lors de l'extraction des tableaux de la page {page_num} : {str(e)}")
|
204 |
+
traceback.print_exc()
|
205 |
+
return []
|
206 |
+
|
207 |
+
def extract_images_from_page(pdf_path, page, page_num, image_save_dir, image_save_subdir, excluded_rects=[]):
|
208 |
+
"""Extract and save images from a page, excluding specified regions (like tables)."""
|
209 |
+
image_data = []
|
210 |
+
image_list = page.get_images(full=True)
|
211 |
+
|
212 |
+
for img_index, img_info in enumerate(image_list):
|
213 |
+
xref = img_info[0]
|
214 |
+
try:
|
215 |
+
base_image = page.parent.extract_image(xref)
|
216 |
+
image_bytes = base_image["image"]
|
217 |
+
image_ext = base_image["ext"]
|
218 |
+
width = base_image["width"]
|
219 |
+
height = base_image["height"]
|
220 |
+
|
221 |
+
if width < IMAGE_MIN_WIDTH or height < IMAGE_MIN_HEIGHT:
|
222 |
+
continue
|
223 |
+
|
224 |
+
img_rect = None
|
225 |
+
img_rects = page.get_image_rects(xref)
|
226 |
+
if img_rects:
|
227 |
+
img_rect = img_rects[0]
|
228 |
+
|
229 |
+
if img_rect is None:
|
230 |
+
print(f" Warning: Could not find rectangle for image {img_index} on page {page_num}. Skipping.")
|
231 |
+
continue
|
232 |
+
|
233 |
+
is_excluded = False
|
234 |
+
for excluded_rect in excluded_rects:
|
235 |
+
if img_rect.intersects(excluded_rect):
|
236 |
+
is_excluded = True
|
237 |
+
break
|
238 |
+
if is_excluded:
|
239 |
+
print(f" Image {img_index} on page {page_num} is within an excluded region (e.g., table). Skipping.")
|
240 |
+
continue
|
241 |
+
|
242 |
+
safe_pdf_name = "".join(c if c.isalnum() else "_" for c in pdf_path.stem)
|
243 |
+
image_filename = f"{safe_pdf_name}_p{page_num}_img{img_index}.{image_ext}"
|
244 |
+
image_save_path = image_save_dir / image_filename
|
245 |
+
relative_url_path = f"/static/{image_save_subdir}/{image_filename}"
|
246 |
+
|
247 |
+
with open(image_save_path, "wb") as img_file:
|
248 |
+
img_file.write(image_bytes)
|
249 |
+
|
250 |
+
image_data.append({
|
251 |
+
'content_type': 'image',
|
252 |
+
'image_url': relative_url_path,
|
253 |
+
'rect': [img_rect.x0, img_rect.y0, img_rect.x1, img_rect.y1],
|
254 |
+
'image_bytes': image_bytes
|
255 |
+
})
|
256 |
+
|
257 |
+
except Exception as img_save_e:
|
258 |
+
print(f" Erreur lors du traitement de l'image {img_index} de la page {page_num} : {img_save_e}")
|
259 |
+
traceback.print_exc()
|
260 |
+
|
261 |
+
return image_data
|