|
---
|
|
tags:
|
|
- multimodal
|
|
- multilingual
|
|
- pdf
|
|
- embeddings
|
|
- rag
|
|
- google-cloud
|
|
- vertex-ai
|
|
- gemma
|
|
- python
|
|
datasets:
|
|
- no_dataset
|
|
license: mit
|
|
---
|
|
|
|
# Multimodal & Multilingual PDF Embedding Pipeline with Gemma and Vertex AI
|
|
|
|
This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images) using **Google's Gemma model (running locally)**, and then creates multilingual text embeddings for all extracted information using **Google Cloud Vertex AI's `text-multilingual-embedding-002` model**. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
|
|
|
|
**Key Features:**
|
|
- **Multimodal Descriptions (via Gemma):** Processes tables and images from PDFs, generating rich descriptive text in French using the open-source Gemma 3.4B-IT model, which runs locally on your machine/Colab GPU.
|
|
- **Multilingual Text Embeddings (via Vertex AI):** Leverages Google Cloud's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
|
|
- **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
|
|
|
|
## How it Works
|
|
|
|
1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
|
|
2. **Content Separation:** Distinguishes between plain text, tables, and non-table images.
|
|
3. **Multimodal Description (for Tables & Images using Gemma):**
|
|
- For tables, the pipeline captures an image of the table and also uses its text representation.
|
|
- For standalone images (e.g., graphs, charts), it captures the image.
|
|
- These images (and optionally table text) are then passed to the **Gemma 3.4B-IT model** (via the `gemma` Python library) with specific prompts to generate rich, descriptive text in French. **This step runs locally and does not incur direct API costs.**
|
|
4. **Multilingual Text Embedding (via Vertex AI):**
|
|
- The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
|
|
- This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content. **This step connects to Google Cloud Vertex AI and will incur costs.**
|
|
5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
|
|
|
|
## Requirements & Setup
|
|
|
|
This pipeline uses a combination of local models (Gemma) and **Google Cloud Platform** services (Vertex AI).
|
|
|
|
1. **Google Cloud Project with Billing Enabled (for Text Embeddings):**
|
|
- **CRITICAL:** The text embedding generation step uses Google Cloud Vertex AI. This **will incur costs** on your Google Cloud Platform account. Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project.
|
|
- Enable the **Vertex AI API**.
|
|
2. **Authentication for Google Cloud (for Text Embeddings):**
|
|
- The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`.
|
|
- For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
|
|
3. **Hardware Requirements (for Gemma):**
|
|
- Running the Gemma 3.4B-IT model requires a **GPU with sufficient VRAM** (e.g., a Colab T4 or V100 GPU, or a local GPU with at least ~8-10GB VRAM is recommended). If a GPU is not available, Gemma will likely run on CPU but will be significantly slower.
|
|
|
|
### Local Setup
|
|
|
|
1. **Clone the repository:**
|
|
```bash
|
|
git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
|
|
cd pdf-multimodal-multilingual-embedding-pipeline
|
|
```
|
|
2. **Install Python dependencies:**
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
**System-level dependencies for Camelot/PyMuPDF (Linux/Colab):**
|
|
You might need to install these system packages for `PyMuPDF` and `Camelot` to function correctly.
|
|
```bash
|
|
# Update package list
|
|
sudo apt-get update
|
|
# Install Ghostscript (required by Camelot)
|
|
sudo apt-get install -y ghostscript
|
|
# Install python3-tk (required by some PyMuPDF functionalities)
|
|
sudo apt-get install -y python3-tk
|
|
# Install OpenCV (via apt, for camelot-py[cv])
|
|
sudo apt-get install -y libopencv-dev python3-opencv
|
|
```
|
|
*Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*
|
|
|
|
3. **Set up Environment Variables (for Vertex AI Text Embeddings):**
|
|
```bash
|
|
export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
|
|
export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
|
|
```
|
|
Replace `your-gcp-project-id` and `us-central1` with your actual Google Cloud Project ID and Vertex AI region.
|
|
|
|
4. **Place your PDF files:**
|
|
Create a `docs` directory in the root of the repository and place your PDF documents inside it.
|
|
```
|
|
pdf-multimodal-multilingual-embedding-pipeline/
|
|
├── docs/
|
|
│ └── your_document.pdf
|
|
└── another_document.pdf
|
|
```
|
|
|
|
5. **Run the pipeline:**
|
|
```bash
|
|
python run_pipeline.py
|
|
```
|
|
The generated embedding file (`embeddings_statistiques_multimodal.json`) and extracted assets will be saved in the `output/` directory.
|
|
|
|
### Google Colab Usage
|
|
|
|
A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments and GPU access.
|
|
|
|
1. **Open a new Google Colab notebook.**
|
|
2. **Change runtime to GPU:** Go to `Runtime > Change runtime type` and select `T4 GPU` or `V100 GPU`.
|
|
3. **Install system and Python dependencies:**
|
|
```python
|
|
!pip uninstall -y camelot camelot-py # Ensure clean install
|
|
!pip install PyMuPDF
|
|
!apt-get update
|
|
!apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
|
|
!pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow gemma jax jaxlib numpy
|
|
```
|
|
4. **Authenticate to Google Cloud (for Vertex AI):**
|
|
```python
|
|
from google.colab import auth
|
|
auth.authenticate_user()
|
|
```
|
|
5. **Set your Google Cloud Project ID and Location:**
|
|
```python
|
|
import os
|
|
# Replace with your actual Google Cloud Project ID
|
|
os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
|
|
# Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
|
|
os.environ["VERTEX_AI_LOCATION"] = "us-central1"
|
|
|
|
# Critical: Adjust JAX memory allocation for Gemma
|
|
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
|
|
```
|
|
6. **Upload your PDF files:**
|
|
You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
|
|
```python
|
|
# Example for uploading
|
|
from google.colab import files
|
|
import os
|
|
from pathlib import Path
|
|
|
|
PDF_DIRECTORY = Path("/content/docs")
|
|
PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
|
|
uploaded = files.upload()
|
|
for filename in uploaded.keys():
|
|
os.rename(filename, PDF_DIRECTORY / filename)
|
|
```
|
|
7. **Copy and paste the code from `src/pdf_processor.py`, `src/embedding_utils.py` and `run_pipeline.py` into Colab cells and execute.** Make sure to execute `embedding_utils.py` content first, then `pdf_processor.py` content, then `run_pipeline.py` content, or combine them logically into your notebook.
|
|
|
|
## Output
|
|
|
|
The pipeline will generate:
|
|
- `embeddings_statistiques_multimodal.json`: A JSON file containing all generated embeddings and their metadata.
|
|
- `output/extracted_graphs/`: Directory containing extracted images (PNG format).
|
|
- `output/extracted_tables/`: Directory containing HTML representations of extracted tables.
|
|
|
|
## Example `embeddings_statistiques_multimodal.json` Entry
|
|
|
|
```json
|
|
[
|
|
{
|
|
"pdf_file": "sample.pdf",
|
|
"page_number": 1,
|
|
"chunk_id": "text_0",
|
|
"content_type": "text",
|
|
"text_content": "This is a chunk of text extracted from the first page of the document...",
|
|
"embedding": [0.123, -0.456, ..., 0.789],
|
|
"pdf_title": "Sample Document",
|
|
"pdf_subject": "Data Analysis",
|
|
"pdf_keywords": "statistics, report"
|
|
},
|
|
{
|
|
"pdf_file": "sample.pdf",
|
|
"page_number": 2,
|
|
"chunk_id": "table_0",
|
|
"content_type": "table",
|
|
"text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...",
|
|
"embedding": [-0.987, 0.654, ..., 0.321],
|
|
"table_html_url": "/static/extracted_tables/sample_p2_table0.html",
|
|
"image_url": "/static/extracted_graphs/sample_p2_table0.png",
|
|
"pdf_title": "Sample Document",
|
|
"pdf_subject": "Data Analysis",
|
|
"pdf_keywords": "statistics, report"
|
|
},
|
|
{
|
|
"pdf_file": "sample.pdf",
|
|
"page_number": 3,
|
|
"chunk_id": "image_0",
|
|
"content_type": "image",
|
|
"text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.",
|
|
"embedding": [0.456, -0.789, ..., 0.123],
|
|
"image_url": "/static/extracted_graphs/sample_p3_img0.png",
|
|
"pdf_title": "Sample Document",
|
|
"pdf_subject": "Data Analysis",
|
|
"pdf_keywords": "statistics, report"
|
|
}
|
|
]
|
|
```
|
|
|
|
# Acknowledgments
|
|
This pipeline leverages the power of:
|
|
- Gemma AI
|
|
- Google AI Gemini Models
|
|
- PyMuPDF
|
|
- Camelot
|
|
- Tiktoken
|
|
- Pandas
|
|
- BeautifulSoup |