File size: 10,188 Bytes
94b27ef
 
 
 
 
 
 
 
 
2721ce7
94b27ef
 
2721ce7
94b27ef
 
 
2721ce7
c2e3cf5
2721ce7
c2e3cf5
 
2721ce7
 
c2e3cf5
 
 
 
 
 
2721ce7
c2e3cf5
 
2721ce7
 
c2e3cf5
2721ce7
c2e3cf5
 
 
 
2721ce7
c2e3cf5
2721ce7
 
ad96b83
2721ce7
 
 
 
 
c2e3cf5
 
 
 
 
06df9a4
c2e3cf5
 
2721ce7
c2e3cf5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2721ce7
c2e3cf5
 
 
 
2721ce7
c2e3cf5
 
 
 
 
 
 
2721ce7
c2e3cf5
 
 
 
 
 
 
 
 
 
2721ce7
c2e3cf5
 
2721ce7
 
c2e3cf5
 
 
 
 
2721ce7
c2e3cf5
2721ce7
c2e3cf5
 
 
 
2721ce7
c2e3cf5
 
 
 
 
 
2721ce7
 
 
c2e3cf5
2721ce7
c2e3cf5
 
 
 
 
2721ce7
 
c2e3cf5
 
 
 
 
 
2721ce7
c2e3cf5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2721ce7
c2e3cf5
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---

tags:
- multimodal
- multilingual
- pdf
- embeddings
- rag
- google-cloud
- vertex-ai
- gemma
- python
datasets:
  - no_dataset
license: mit
---


# Multimodal & Multilingual PDF Embedding Pipeline with Gemma and Vertex AI

This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images) using **Google's Gemma model (running locally)**, and then creates multilingual text embeddings for all extracted information using **Google Cloud Vertex AI's `text-multilingual-embedding-002` model**. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.

**Key Features:**
- **Multimodal Descriptions (via Gemma):** Processes tables and images from PDFs, generating rich descriptive text in French using the open-source Gemma 3.4B-IT model, which runs locally on your machine/Colab GPU.
- **Multilingual Text Embeddings (via Vertex AI):** Leverages Google Cloud's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
- **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.

## How it Works

1.  **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
2.  **Content Separation:** Distinguishes between plain text, tables, and non-table images.
3.  **Multimodal Description (for Tables & Images using Gemma):**
    - For tables, the pipeline captures an image of the table and also uses its text representation.
    - For standalone images (e.g., graphs, charts), it captures the image.
    - These images (and optionally table text) are then passed to the **Gemma 3.4B-IT model** (via the `gemma` Python library) with specific prompts to generate rich, descriptive text in French. **This step runs locally and does not incur direct API costs.**
4.  **Multilingual Text Embedding (via Vertex AI):**
    - The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
    - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content. **This step connects to Google Cloud Vertex AI and will incur costs.**
5.  **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.

## Requirements & Setup

This pipeline uses a combination of local models (Gemma) and **Google Cloud Platform** services (Vertex AI).

1.  **Google Cloud Project with Billing Enabled (for Text Embeddings):**
    -   **CRITICAL:** The text embedding generation step uses Google Cloud Vertex AI. This **will incur costs** on your Google Cloud Platform account. Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project.
    -   Enable the **Vertex AI API**.
2.  **Authentication for Google Cloud (for Text Embeddings):**
    -   The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`.
    -   For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
3.  **Hardware Requirements (for Gemma):**
    -   Running the Gemma 3.4B-IT model requires a **GPU with sufficient VRAM** (e.g., a Colab T4 or V100 GPU, or a local GPU with at least ~8-10GB VRAM is recommended). If a GPU is not available, Gemma will likely run on CPU but will be significantly slower.

### Local Setup

1.  **Clone the repository:**
    ```bash

    git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline

    cd pdf-multimodal-multilingual-embedding-pipeline

    ```

2.  **Install Python dependencies:**

    ```bash

    pip install -r requirements.txt

    ```

    **System-level dependencies for Camelot/PyMuPDF (Linux/Colab):**

    You might need to install these system packages for `PyMuPDF` and `Camelot` to function correctly.

    ```bash

    # Update package list

    sudo apt-get update

    # Install Ghostscript (required by Camelot)

    sudo apt-get install -y ghostscript

    # Install python3-tk (required by some PyMuPDF functionalities)

    sudo apt-get install -y python3-tk

    # Install OpenCV (via apt, for camelot-py[cv])

    sudo apt-get install -y libopencv-dev python3-opencv

    ```

    *Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*


3.  **Set up Environment Variables (for Vertex AI Text Embeddings):**
    ```bash

    export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"

    export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)

    ```

    Replace `your-gcp-project-id` and `us-central1` with your actual Google Cloud Project ID and Vertex AI region.


4.  **Place your PDF files:**
    Create a `docs` directory in the root of the repository and place your PDF documents inside it.

    ```

    pdf-multimodal-multilingual-embedding-pipeline/

    ├── docs/

    │   └── your_document.pdf

    └── another_document.pdf

    ```


5.  **Run the pipeline:**
    ```bash

    python run_pipeline.py

    ```

    The generated embedding file (`embeddings_statistiques_multimodal.json`) and extracted assets will be saved in the `output/` directory.


### Google Colab Usage

A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments and GPU access.

1.  **Open a new Google Colab notebook.**
2.  **Change runtime to GPU:** Go to `Runtime > Change runtime type` and select `T4 GPU` or `V100 GPU`.
3.  **Install system and Python dependencies:**
    ```python

    !pip uninstall -y camelot camelot-py # Ensure clean install

    !pip install PyMuPDF

    !apt-get update

    !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv

    !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow gemma jax jaxlib numpy

    ```

4.  **Authenticate to Google Cloud (for Vertex AI):**

    ```python

    from google.colab import auth

    auth.authenticate_user()

    ```

5.  **Set your Google Cloud Project ID and Location:**

    ```python

    import os

    # Replace with your actual Google Cloud Project ID

    os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"

    # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")

    os.environ["VERTEX_AI_LOCATION"] = "us-central1"

    

    # Critical: Adjust JAX memory allocation for Gemma

    os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

    ```

6.  **Upload your PDF files:**

    You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.

    ```python

    # Example for uploading

    from google.colab import files

    import os

    from pathlib import Path


    PDF_DIRECTORY = Path("/content/docs")

    PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)

    uploaded = files.upload()

    for filename in uploaded.keys():

        os.rename(filename, PDF_DIRECTORY / filename)

    ```

7.  **Copy and paste the code from `src/pdf_processor.py`, `src/embedding_utils.py` and `run_pipeline.py` into Colab cells and execute.** Make sure to execute `embedding_utils.py` content first, then `pdf_processor.py` content, then `run_pipeline.py` content, or combine them logically into your notebook.


## Output

The pipeline will generate:
- `embeddings_statistiques_multimodal.json`: A JSON file containing all generated embeddings and their metadata.
- `output/extracted_graphs/`: Directory containing extracted images (PNG format).
- `output/extracted_tables/`: Directory containing HTML representations of extracted tables.

## Example `embeddings_statistiques_multimodal.json` Entry

```json

[

  {

    "pdf_file": "sample.pdf",

    "page_number": 1,

    "chunk_id": "text_0",

    "content_type": "text",

    "text_content": "This is a chunk of text extracted from the first page of the document...",

    "embedding": [0.123, -0.456, ..., 0.789],

    "pdf_title": "Sample Document",

    "pdf_subject": "Data Analysis",

    "pdf_keywords": "statistics, report"

  },

  {

    "pdf_file": "sample.pdf",

    "page_number": 2,

    "chunk_id": "table_0",

    "content_type": "table",

    "text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...",

    "embedding": [-0.987, 0.654, ..., 0.321],

    "table_html_url": "/static/extracted_tables/sample_p2_table0.html",

    "image_url": "/static/extracted_graphs/sample_p2_table0.png",

    "pdf_title": "Sample Document",

    "pdf_subject": "Data Analysis",

    "pdf_keywords": "statistics, report"

  },

  {

    "pdf_file": "sample.pdf",

    "page_number": 3,

    "chunk_id": "image_0",

    "content_type": "image",

    "text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.",

    "embedding": [0.456, -0.789, ..., 0.123],

    "image_url": "/static/extracted_graphs/sample_p3_img0.png",

    "pdf_title": "Sample Document",

    "pdf_subject": "Data Analysis",

    "pdf_keywords": "statistics, report"

  }

]

```

# Acknowledgments
This pipeline leverages the power of:
- Gemma AI
- Google AI Gemini Models
- PyMuPDF
- Camelot
- Tiktoken
- Pandas
- BeautifulSoup