File size: 3,384 Bytes
8684b75 e2fa1a5 8684b75 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
license: apache-2.0
language:
- en
---
# Llama-3.2-11B-Vision-Instruct
This is a model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.
## Model Description
This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:
* **Image captioning:** Generating descriptive captions for images.
* **Visual question answering:** Answering questions about the content of images.
* **Image-based dialogue:** Engaging in conversations based on visual input.
## Intended Uses & Limitations
This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.
**Limitations:**
* The model may not always accurately interpret the content of images.
* It may be biased towards certain types of images or concepts.
* It may generate inappropriate or offensive content.
## How to Use
Here's an example of how to use this model in Python with the `transformers` library:
```python
import gradio as gr
from transformers import AutoProcessor, MllamaForConditionalGeneration
# Use GPU if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model and processor
model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = MllamaForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Function to generate model response
def predict(message, image):
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": message}
]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(device)
response = model.generate(**inputs, max_new_tokens=100)
return processor.decode(response[0], skip_special_tokens=True)
# Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# Simple Multimodal Chatbot")
with gr.Row():
with gr.Column(): # Message input on the left
text_input = gr.Textbox(label="Message")
submit_button = gr.Button("Send")
with gr.Column(): # Image input on the right
image_input = gr.Image(type="pil", label="Upload an Image")
chatbot = gr.Chatbot() # Chatbot output at the bottom
def respond(message, image, history):
history = history + [(message, "")]
response = predict(message, image)
history[-1] = (message, response)
return history
submit_button.click(
fn=respond,
inputs=[text_input, image_input, chatbot],
outputs=chatbot
)
demo.launch()
```
This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.
## More Information
For more details and examples, please visit [ruslanmv.com](https://ruslanmv.com).
## License
This model is licensed under the [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). |