Introducing Command A Vision: Multimodal AI built for Business

Community Article Published July 31, 2025

image/png

Today we introduce Command A Vision: a cutting-edge vision-language model with open weights. Command A Vision delivers exceptional performance across multimodal vision tasks while maintaining the strong text capabilities of Command A. As Cohere's latest flagship model, Command A Vision is a 112B dense model built upon Command A. We are proud to release its weights to the community here.

Command A Vision empowers businesses to automate tedious tasks, unlock valuable insights from visual data, and make highly accurate, data-driven decisions through document optical character recognition (OCR) and image analysis. Whether it's interpreting product manuals with complex diagrams or analyzing photographs of real-world scenes for risk detection, Command A Vision excels at tackling the most demanding enterprise vision challenges.

Metrics

Command A Vision delivers leading multimodal performance, surpassing models such as GPT 4.1, Llama 4 Maverick, Mistral Medium, and Pixtral Large across standard vision benchmarks. We selected a diverse set of nine benchmarks to represent both typical enterprise use cases and generalist, standard academic evaluations. Command A Vision demonstrates particular strength in charts, documents and OCR analysis, while also excelling in mathematical or proto-reasoning evaluations such as MathVista (73.5%). Overall, it surpasses the leading non-thinking vision-language models, as evidenced in the table below. (Note: When data was unavailable from other providers' reports or public leaderboards, missing numbers were replicated using best-effort internal evaluation, typically via VLMEvalKit.)

Model ChartQA InfoVQA AI2D MMMU (CoT) MathVista DocVQA TextVQA OCRBench RealWorldQA Avg
Command A Vision 90.9% 82.9% 94.0% 65.3% 73.5% 95.9% 84.8% 86.9% 73.6% 83.1%
GPT-4.1 (2025-04-14) 82.7% 70.0% 86.5% 74.8% 72.2% 88.6% 71.1% 83.4% 78.0% 78.6%
Pixtral Large 88.1% 59.9% 93.8% 64.0% 69.4% 93.3% 79.3% 74.1% 69.3% 76.8%
Mistral Medium 3 82.6% 71.5% 93.7% 65.0% 70.5% 95.3% 83.5% 75.7% 67.2% 78.3%
Llama 3.2V 90B 85.8% 56.8% 92.3% 60.6% 57.3% 90.1% 83.4% 78.3% 69.8% 74.9%
Llama 4 Maverick 90.0% 77.1% 84.4% 73.4% 73.7% 94.4% 81.6% 80.0% 70.4% 80.5%

Training process and architectural details

Our model follows the Llava architecture, i.e., uses an MLP connector that turns visual features from the SigLIP2-patch16-512 vision encoder into (soft) vision tokens. Each image is divided into up to 12 tiles, each with a resolution of 512x512, based on its dimensions (targeting the nearest aspect ratio). Additionally, a single global summary thumbnail of size 512x512 is included. The resulting features, post-MLP and pixel shuffle - ensuring each tile corresponds to 256 tokens - are then passed into the Command A text tower, a dense, 111B parameters textual LLM. In this manner, a single image consumes up to 3328 tokens.

We trained Command A Vision in three stages – vision-language alignment, supervised fine-tuning (SFT), and post-training using reinforcement learning (RL). In the first stage (alignment), the vision encoder and language model weights remain frozen. This approach enables the mapping of image encoder features to the language model embedding space. In contrast, during the SFT stage, we simultaneously trained the vision encoder, the vision adapter, and the language model on a diverse set of instruction-following multimodal tasks. We then performed multimodal model merging over several experts, similar to Command A, so as to balance out various parts of our data mixture, thus reflecting the relative importance of our experts and enterprise use cases. Finally, in the post-training stage, we employed regularization methods, as well as multiple RLHF algorithms such as online Contrastive Policy Gradient to align the model with enterprise and safety needs, while further enhancing its performance.

Capabilities and efficiency suited for enterprise

Command A Vision was built to serve enterprises across the capabilities that matter most to them. It preserves many of the textual capacities of Command A, and combines other of its key, enterprise-specific, text features, like advanced retrieval-augmented generation (RAG) and multilingual performance across several key business languages. Besides, Command A Vision can be deployed privately with just two or fewer GPUs. It only requires two A100s, or one H100 for 4-bit quantization.

Getting Started with Command A Vision

Try Command A Vision using our Hugging Face Space or on the Cohere platform.

To run locally install transformers and run

# pip install "transformers[dev-torch]@git+https://github.com/huggingface/transformers.git"

import torch

from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "CohereLabs/command-a-vision-07-2025"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)

# Format message with the Command-A-Vision chat template
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg",
            },
            {"type": "text", "text": "what is in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    padding=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

gen_tokens = model.generate(**inputs, max_new_tokens=300)

print(
    processor.tokenizer.decode(
        gen_tokens[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
    )
)

See CohereLabs/command-a-vision-07-2025 on the huggingface hub for more information.

You can also use the model through Hugging Face Inference Providers:

import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="CohereLabs/command-a-vision-07-2025",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in one sentence."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ],
)

print(completion.choices[0].message)

Finally, this work was made possible by the core Multimodal team in Cohere, including : Alexis Chevalier, Bharat Venkitesh, Evgenia Rusak, Hugo Dalla-Torre, Julian Mack, Kyle Duffy, Sebastian Hofstätter, Victor Machado, Viraat Aryabumi, Vlad Shmyhlo, Yongshuo Zong, Cassie Cao, and Pierre Harvey Richemond.

References

[1] Command A: An Enterprise-Ready Large Language Model
[2] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
[3] Visual Instruction Tuning
[4] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
[5] Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

Community

Super impressive!

Amazing release, congrats! 🥳

This is awesome! 🙌

Super cool release, and day 1 availability via Inference Providers is amazing! 😻

That is Amazing. Kudos 🎉...

Amazing 😘

Looking forward to test . The specs are promising.

Sign up or log in to comment