--- language: en tags: - vqa - engineering-drawing - visual-question-answering license: mit metrics: - accuracy - f1 model_categories: - text-to-image - image-to-text base_model: microsoft/Florence-2-base-ft task: Visual Question Answering (VQA) architecture: Causal Language Model (CLM) framework: Hugging Face Transformers --- # Florence 2 VQA - Engineering Drawings ## Model Overview The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the image. --- ## Model Details - **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft) - **Task**: Visual Question Answering (VQA) - **Architecture**: Causal Language Model (CLM) - **Framework**: Hugging Face Transformers --- ## How to Use the Model ### **Install Dependencies** Make sure you have the required libraries installed: ```bash pip install transformers torch datasets pillow gradio ``` ### **Load the Model** To load the model and processor for inference, use the following code: ```python from transformers import AutoConfig, AutoModelForCausalLM import torch # Determine if a GPU is available and set the device accordingly device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load configuration from the base model config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True) # Load the model using the base model's configuration model = AutoModelForCausalLM.from_pretrained( "fauzail/Florence-2-VQA", config=config, trust_remote_code=True ).to(device) ``` ### **Load the Processor** ```python from transformers import AutoProcessor # Load the processor for the model processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True) ``` ### **Define the Prediction Function** Once the model and processor are loaded, define a prediction function that takes an image and question as input: ```python def predict(image_path, question): from PIL import Image # Load and preprocess the image image = Image.open(image_path).convert("RGB") # Prepare inputs using the processor inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device) # Generate the output from the model outputs = model.generate(**inputs) # Decode the output tokens into a human-readable format answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True) return answer ``` ### **Test It for Example** Now, test the model using an image and a question: ```python image_path = "test.png" # Replace with your image path question = "Tell me in detail about the image?" # Call the prediction function answer = predict(image_path, question) print("Answer:", answer) ``` ### **Alternative: Use Gradio for Interactive Web Interface** If you prefer an interactive interface, you can use Gradio to deploy the model: ```python import gradio as gr from PIL import Image # Define the prediction function for Gradio def predict(image, question): inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device) outputs = model.generate(**inputs) return processor.tokenizer.decode(outputs[0], skip_special_tokens=True) # Create the Gradio interface interface = gr.Interface( fn=predict, inputs=["image", "text"], outputs="text", title="Florence 2 VQA - Engineering Drawings", description="Upload an engineering drawing and ask a related question." ) # Launch the Gradio interface interface.launch() ``` --- ## Training Details - **Preprocessing**: - Images were resized and normalized. - Text data (questions and answers) was tokenized using the Florence tokenizer. - **Hyperparameters**: - **Learning Rate**: `1e-6` - **Batch Size**: `2` - **Gradient Accumulation Steps**: `4` - **Epochs**: `10` Training was performed using mixed precision for efficiency. ---