---
language: en
tags:
- vqa
- engineering-drawing
- visual-question-answering
license: mit
metrics:
- accuracy
- f1
model_categories:
- text-to-image
- image-to-text
base_model: microsoft/Florence-2-base-ft
task: Visual Question Answering (VQA)
architecture: Causal Language Model (CLM)
framework: Hugging Face Transformers
---

# Florence 2 VQA - Engineering Drawings

## Model Overview
The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the image.

---

## Model Details
- **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)  
- **Task**: Visual Question Answering (VQA)  
- **Architecture**: Causal Language Model (CLM)  
- **Framework**: Hugging Face Transformers  

---

## How to Use the Model

### **Install Dependencies**
Make sure you have the required libraries installed:
```bash
pip install transformers torch datasets pillow gradio

```

### **Load the Model**

To load the model and processor for inference, use the following code:

```python
from transformers import AutoConfig, AutoModelForCausalLM
import torch

# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load configuration from the base model
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)

# Load the model using the base model's configuration
model = AutoModelForCausalLM.from_pretrained(
    "fauzail/Florence-2-VQA",
    config=config,
    trust_remote_code=True
).to(device)

```
### **Load the Processor**

```python
from transformers import AutoProcessor

# Load the processor for the model
processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True)

```

### **Define the Prediction Function**

Once the model and processor are loaded, define a prediction function that takes an image and question as input:

```python
def predict(image_path, question):
    from PIL import Image

    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")

    # Prepare inputs using the processor
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)

    # Generate the output from the model
    outputs = model.generate(**inputs)

    # Decode the output tokens into a human-readable format
    answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer
```

### **Test It for Example**

Now, test the model using an image and a question:

```python
image_path = "test.png"  # Replace with your image path
question = "Tell me in detail about the image?"

# Call the prediction function
answer = predict(image_path, question)
print("Answer:", answer)
```

### **Alternative: Use Gradio for Interactive Web Interface**

If you prefer an interactive interface, you can use Gradio to deploy the model:

```python
import gradio as gr
from PIL import Image

# Define the prediction function for Gradio
def predict(image, question):
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
    outputs = model.generate(**inputs)
    return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Create the Gradio interface
interface = gr.Interface(
    fn=predict,
    inputs=["image", "text"],
    outputs="text",
    title="Florence 2 VQA - Engineering Drawings",
    description="Upload an engineering drawing and ask a related question."
)

# Launch the Gradio interface
interface.launch()
```

---

## Training Details
- **Preprocessing**:
  - Images were resized and normalized.
  - Text data (questions and answers) was tokenized using the Florence tokenizer.  
- **Hyperparameters**:
  - **Learning Rate**: `1e-6`  
  - **Batch Size**: `2`  
  - **Gradient Accumulation Steps**: `4`  
  - **Epochs**: `10`  

Training was performed using mixed precision for efficiency.


---