patricksf commited on
Commit
8909c4f
·
verified ·
1 Parent(s): 3810fde

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -0
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - de
6
+ - es
7
+ - fr
8
+ - it
9
+ - pt
10
+ - pl
11
+ - nl
12
+ - tr
13
+ - sv
14
+ - cs
15
+ - el
16
+ - hu
17
+ - ro
18
+ - fi
19
+ - uk
20
+ - sl
21
+ - sk
22
+ - da
23
+ - lt
24
+ - lv
25
+ - et
26
+ - bg
27
+ - 'no'
28
+ - ca
29
+ - hr
30
+ - ga
31
+ - mt
32
+ - gl
33
+ - zh
34
+ - ru
35
+ - ko
36
+ - ja
37
+ - ar
38
+ - hi
39
+ library_name: transformers
40
+ ---
41
+
42
+ # Model Card for EuroVLM-1.7B-Instruct
43
+
44
+ **⚠️ PREVIEW RELEASE**: *This is a preview version of EuroVLM-1.7B. The model is still under development and may have limitations in performance and stability. Use with caution in production environments.*
45
+
46
+ This is the model card for EuroVLM-1.7B-Preview, a multimodal vision-language model based on long-context version of EuroLLM-1.7B.
47
+
48
+ - **Developed by:** Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université.
49
+ - **Funded by:** European Union.
50
+ - **Model type:** A 1.7B+400M parameter multilingual multimodal transformer VLM (Vision-Language Model).
51
+ - **Language(s) (NLP):** Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.
52
+ - **Modalities:** Text and Vision (images).
53
+ - **License:** Apache License 2.0.
54
+
55
+ ## Model Details
56
+
57
+ EuroVLM-1.7B is a 1.7B+400M parameter vision-language model that combines the multilingual capabilities of EuroLLM-1.7B with vision encoding components.
58
+
59
+ EuroVLM-1.7B was (visually) instruction tuned on a combination of multilingual vision-language datasets, including image captioning, visual question answering, and multimodal reasoning tasks across the supported languages.
60
+
61
+ ### Model Description
62
+
63
+ EuroVLM uses a multimodal architecture combining a vision encoder with the EuroLLM language model:
64
+
65
+ **Language Model Component:**
66
+ - Based on the standard, dense Transformer architecture from EuroLLM-1.7B
67
+ - Grouped query attention (GQA) with 8 key-value heads for efficient inference
68
+ - Pre-layer normalization with RMSNorm for training stability
69
+ - SwiGLU activation function for optimal downstream performance
70
+ - Rotary positional embeddings (RoPE) in every layer
71
+ - Extended context size supporting up to 32K tokens
72
+
73
+ **Vision Component:**
74
+ - Vision Transformer (ViT) encoder, based on [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384)
75
+ - Multimodal projector mapping vision representations to token embeddings
76
+ - Support for high-resolution image inputs
77
+
78
+ ## Run the model
79
+
80
+ To use the model with HuggingFace's [Transformers](https://huggingface.co/docs/transformers/en/index) library
81
+
82
+ ```python
83
+ from PIL import Image
84
+ from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
85
+
86
+ model_id = "utter-project/EuroVLM-1.7B-Preview"
87
+ processor = LlavaNextProcessor.from_pretrained(model_id)
88
+ model = LlavaNextForConditionalGeneration.from_pretrained(model_id)
89
+
90
+ # Load an image
91
+ image = Image.open("/path/to/image.jpg")
92
+
93
+ messages = [
94
+ {
95
+ "role": "system",
96
+ "content": "You are EuroVLM --- a multimodal AI assistant specialized in European languages that provides safe, educational and helpful answers about images and text.",
97
+ },
98
+ {
99
+ "role": "user",
100
+ "content": [
101
+ {"type": "image"},
102
+ {"type": "text", "text": "What do you see in this image? Please describe it in Portuguese."}
103
+ ]
104
+ },
105
+ ]
106
+
107
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
108
+ inputs = processor(images=image, text=prompt, return_tensors="pt")
109
+ outputs = model.generate(**inputs, max_new_tokens=1024)
110
+ print(processor.decode(outputs[0], skip_special_tokens=True))
111
+ ```
112
+
113
+ You can also run EuroVLM with [vLLM](https://docs.vllm.ai/en/latest/)!
114
+
115
+ ```python
116
+ from vllm import LLM, SamplingParams
117
+
118
+ # Initialize the model
119
+ model_id = "utter-project/EuroVLM-1.7B-Preview"
120
+ llm = LLM(model=model_id)
121
+
122
+ # Set up sampling parameters
123
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)
124
+
125
+ # Image and prompt
126
+ image_url = "/url/of/image.jpg"
127
+
128
+ messages = [
129
+ {
130
+ "role": "system",
131
+ "content": "You are EuroVLM --- a multimodal AI assistant specialized in European languages that provides safe, educational and helpful answers about images and text.",
132
+ },
133
+ {
134
+ "role": "user",
135
+ "content": [
136
+ {"type": "image_url", "image_url": {"url": image_url}},
137
+ {"type": "text", "text": "What do you see in this image? Please describe it in Portuguese in one sentence."}
138
+ ]
139
+ },
140
+ ]
141
+
142
+ # Generate response
143
+ outputs = llm.chat(messages, sampling_params=sampling_params)
144
+ print(outputs[0].outputs[0].text)
145
+ ```
146
+
147
+ ## Capabilities
148
+
149
+ EuroVLM-1.7B-Instruct supports a wide range of vision-language tasks across multiple languages:
150
+
151
+ - **Multilingual Image Captioning:** Generate detailed descriptions of images in any of the supported languages
152
+ - **Visual Question Answering:** Answer questions about image content in multilingual contexts
153
+ - **Visual Instruction Following:** Execute complex instructions that involve both visual analysis and text generation
154
+ - **Multimodal Translation:** Translate image captions and descriptions between supported languages
155
+ - **Document Understanding:** Process and analyze documents, charts, and diagrams with multilingual text
156
+
157
+ ## Bias, Risks, and Limitations
158
+
159
+ EuroVLM-1.7B has not been fully aligned to human preferences, so the model may generate problematic outputs in both text and image understanding contexts (e.g., hallucinations about image content, harmful content, biased interpretations, or false statements about visual information).
160
+
161
+ Additional considerations for multimodal models include:
162
+ - Potential biases in visual interpretation across different cultural contexts
163
+ - Limitations in understanding complex visual scenes or unusual image compositions
164
+ - Possible inconsistencies between visual understanding and textual generation across languages
165
+ - Privacy considerations when processing images that may contain personal information
166
+
167
+ Users should exercise caution and implement appropriate safety measures when deploying this model in production environments.