Introducing Command A Vision: Multimodal AI built for Business

Today we introduce Command A Vision: a cutting-edge vision-language model with open weights. Command A Vision delivers exceptional performance across multimodal vision tasks while maintaining the strong text capabilities of Command A. As Cohere's latest flagship model, Command A Vision is a 112B dense model built upon Command A. We are proud to release its weights to the community here.
Command A Vision empowers businesses to automate tedious tasks, unlock valuable insights from visual data, and make highly accurate, data-driven decisions through document optical character recognition (OCR) and image analysis. Whether it's interpreting product manuals with complex diagrams or analyzing photographs of real-world scenes for risk detection, Command A Vision excels at tackling the most demanding enterprise vision challenges.
Metrics
Command A Vision delivers leading multimodal performance, surpassing models such as GPT 4.1, Llama 4 Maverick, Mistral Medium, and Pixtral Large across standard vision benchmarks. We selected a diverse set of nine benchmarks to represent both typical enterprise use cases and generalist, standard academic evaluations. Command A Vision demonstrates particular strength in charts, documents and OCR analysis, while also excelling in mathematical or proto-reasoning evaluations such as MathVista (73.5%). Overall, it surpasses the leading non-thinking vision-language models, as evidenced in the table below. (Note: When data was unavailable from other providers' reports or public leaderboards, missing numbers were replicated using best-effort internal evaluation, typically via VLMEvalKit.)
Model | ChartQA | InfoVQA | AI2D | MMMU (CoT) | MathVista | DocVQA | TextVQA | OCRBench | RealWorldQA | Avg |
---|---|---|---|---|---|---|---|---|---|---|
Command A Vision | 90.9% | 82.9% | 94.0% | 65.3% | 73.5% | 95.9% | 84.8% | 86.9% | 73.6% | 83.1% |
GPT-4.1 (2025-04-14) | 82.7% | 70.0% | 86.5% | 74.8% | 72.2% | 88.6% | 71.1% | 83.4% | 78.0% | 78.6% |
Pixtral Large | 88.1% | 59.9% | 93.8% | 64.0% | 69.4% | 93.3% | 79.3% | 74.1% | 69.3% | 76.8% |
Mistral Medium 3 | 82.6% | 71.5% | 93.7% | 65.0% | 70.5% | 95.3% | 83.5% | 75.7% | 67.2% | 78.3% |
Llama 3.2V 90B | 85.8% | 56.8% | 92.3% | 60.6% | 57.3% | 90.1% | 83.4% | 78.3% | 69.8% | 74.9% |
Llama 4 Maverick | 90.0% | 77.1% | 84.4% | 73.4% | 73.7% | 94.4% | 81.6% | 80.0% | 70.4% | 80.5% |
Training process and architectural details
Our model follows the Llava architecture, i.e., uses an MLP connector that turns visual features from the SigLIP2-patch16-512 vision encoder into (soft) vision tokens. Each image is divided into up to 12 tiles, each with a resolution of 512x512, based on its dimensions (targeting the nearest aspect ratio). Additionally, a single global summary thumbnail of size 512x512 is included. The resulting features, post-MLP and pixel shuffle - ensuring each tile corresponds to 256 tokens - are then passed into the Command A text tower, a dense, 111B parameters textual LLM. In this manner, a single image consumes up to 3328 tokens.
We trained Command A Vision in three stages – vision-language alignment, supervised fine-tuning (SFT), and post-training using reinforcement learning (RL). In the first stage (alignment), the vision encoder and language model weights remain frozen. This approach enables the mapping of image encoder features to the language model embedding space. In contrast, during the SFT stage, we simultaneously trained the vision encoder, the vision adapter, and the language model on a diverse set of instruction-following multimodal tasks. We then performed multimodal model merging over several experts, similar to Command A, so as to balance out various parts of our data mixture, thus reflecting the relative importance of our experts and enterprise use cases. Finally, in the post-training stage, we employed regularization methods, as well as multiple RLHF algorithms such as online Contrastive Policy Gradient to align the model with enterprise and safety needs, while further enhancing its performance.
Capabilities and efficiency suited for enterprise
Command A Vision was built to serve enterprises across the capabilities that matter most to them. It preserves many of the textual capacities of Command A, and combines other of its key, enterprise-specific, text features, like advanced retrieval-augmented generation (RAG) and multilingual performance across several key business languages. Besides, Command A Vision can be deployed privately with just two or fewer GPUs. It only requires two A100s, or one H100 for 4-bit quantization.
Getting Started with Command A Vision
Try Command A Vision using our Hugging Face Space or on the Cohere platform.
To run locally install transformers and run
# pip install "transformers[dev-torch]@git+https://github.com/huggingface/transformers.git"
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "CohereLabs/command-a-vision-07-2025"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
)
# Format message with the Command-A-Vision chat template
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg",
},
{"type": "text", "text": "what is in this image?"},
],
},
]
inputs = processor.apply_chat_template(
messages,
padding=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
gen_tokens = model.generate(**inputs, max_new_tokens=300)
print(
processor.tokenizer.decode(
gen_tokens[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
)
)
See CohereLabs/command-a-vision-07-2025 on the huggingface hub for more information.
You can also use the model through Hugging Face Inference Providers:
import os
from huggingface_hub import InferenceClient
client = InferenceClient(
provider="cohere",
api_key=os.environ["HF_TOKEN"],
)
completion = client.chat.completions.create(
model="CohereLabs/command-a-vision-07-2025",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
],
)
print(completion.choices[0].message)
Finally, this work was made possible by the core Multimodal team in Cohere, including : Alexis Chevalier, Bharat Venkitesh, Evgenia Rusak, Hugo Dalla-Torre, Julian Mack, Kyle Duffy, Sebastian Hofstätter, Victor Machado, Viraat Aryabumi, Vlad Shmyhlo, Yongshuo Zong, Cassie Cao, and Pierre Harvey Richemond.
References
[1] Command A: An Enterprise-Ready Large Language Model
[2] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
[3] Visual Instruction Tuning
[4] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
[5] Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion