File size: 3,384 Bytes
8684b75
 
 
 
 
e2fa1a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8684b75
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
language:
- en
---

# Llama-3.2-11B-Vision-Instruct

This is a  model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.

## Model Description

This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:

* **Image captioning:** Generating descriptive captions for images.
* **Visual question answering:** Answering questions about the content of images.
* **Image-based dialogue:** Engaging in conversations based on visual input.

## Intended Uses & Limitations

This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.

**Limitations:**

* The model may not always accurately interpret the content of images.
* It may be biased towards certain types of images or concepts.
* It may generate inappropriate or offensive content.

## How to Use

Here's an example of how to use this model in Python with the `transformers` library:

```python
import gradio as gr
from transformers import AutoProcessor, MllamaForConditionalGeneration

# Use GPU if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and processor
model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct" 
processor = AutoProcessor.from_pretrained(model_name)
model = MllamaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Function to generate model response
def predict(message, image):
    messages = [{"role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "text": message}
    ]}]
    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(device)
    response = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(response[0], skip_special_tokens=True)

# Gradio interface
with gr.Blocks() as demo:
    gr.Markdown("# Simple Multimodal Chatbot")
    with gr.Row():
        with gr.Column():  # Message input on the left
            text_input = gr.Textbox(label="Message")
            submit_button = gr.Button("Send") 
        with gr.Column():  # Image input on the right
            image_input = gr.Image(type="pil", label="Upload an Image") 
    chatbot = gr.Chatbot()  # Chatbot output at the bottom

    def respond(message, image, history):
        history = history + [(message, "")]
        response = predict(message, image)
        history[-1] = (message, response)
        return history

    submit_button.click(
        fn=respond, 
        inputs=[text_input, image_input, chatbot], 
        outputs=chatbot
    )

demo.launch()
```

This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.

## More Information

For more details and examples, please visit [ruslanmv.com](https://ruslanmv.com).

## License

This model is licensed under the [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).