--- license: cc-by-nc-4.0 language: - en pipeline_tag: image-text-to-text arxiv: 2503.02597 --- # AKI Model Card `AKI` is the official checkpoint for the paper "[Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs](https://arxiv.org/abs/2503.02597)". AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality-mutual attention (MMA), which enables the earlier modality (images) to incorporate information from the latter modality (text) for addressing vision-language misalignment without introducing additional parameters and increasing training time. ## Model Details ### Model Descriptions - Vision Encoder: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) - Vision-Language Connector: [Perceiver Resampler](https://arxiv.org/abs/2204.14198) - Language Decoder (LLM): [microsoft/Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) - Pretraining Datasets: [Blip3-kale](https://huggingface.co/datasets/Salesforce/blip3-kale) and [Blip3-OCR-200m](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m) - SFT Datasets: VQAv2, GQA, VSR, OCRVQA, A-OKVQA, ScienceQA, RefCOCO, RefCOCOg, RefCOCO+, VisualGnome, LLaVA-150k ### Model Sources - Repository: [GitHub](https://github.com/sony/aki) - Paper: [arXiv](https://arxiv.org/abs/2503.02597) ## How to Use ### Input Format Given the nature of the training data, the AKI model is best suited for prompts using the chat format as follows: ``` <|system|> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|> <|user|> Describe the scene of this image. <|end|> <|assistant|> ``` > The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ... ### Inference Example Please refer to the [notebook](demo.ipynb) for the zero-shot inference. To build a local demo website, please refer to [local_demo.py](https://github.com/sony/aki/blob/main/codes/open_flamingo/local_demo.py). > For the training scripts, please refer to the [GitHub repo](https://github.com/sony/aki). ## Evaluation Results ### Main Comparisons with the Same Configurations (Table 1) | | MMEP | MMEC | MMB | SEEDI | LLaVAW | MMMU | MathVmini | POPE | MM-Vet | RealWorldQA | CV-Bench2D | CV-Bench3D | |--------------------------|------------|------------|------|-------------|---------------|------|----------------|------|-------|------------|-----------------|-----------------| | (I&T)PT + (I&T)SFT | 1226.3 | 258.2 | 64.9 | 64.1 | 47.0 | 31.1 | 24.2 | 79.8 | 24.3 | 50.6 | 45.2 | 54.3 | | CCA [Xing et al., 2024] | 1212.7 | 243.6 | _67.4_ | _65.3_ | _54.0_ | _34.6_ | _25.6_ | _81.9_ | _29.0_ | **52.7** | _56.0_ | 62.8 | | (w/o T&I)PT | 1046.3 | 226.4 | 31.7 | 45.1 | 38.1 | 27.2 | 23.8 | 65.0 | 17.2 | 40.1 | 53.2 | 54.8 | | (w/o I&T)PT | 1013.2 | 208.6 | 32.0 | 43.3 | 37.9 | 27.7 | 22.4 | 70.4 | 20.6 | 39.5 | 55.4 | 53.0 | | (w/o T&I)SFT | 1194.8 | _289.3_ | 58.5 | 61.1 | 40.2 | 28.0 | 21.9 | 79.0 | 22.8 | 47.8 | 41.4 | _63.0_ | | (w/o I&T)SFT | 1166.2 | 264.3 | 58.4 | 60.8 | 36.9 | 26.7 | 23.1 | 76.8 | 20.4 | 46.9 | 43.3 | 61.2 | | DOT (Ours) | _1267.8_ | 251.4 | 43.8 | 54.7 | 47.5 | 30.7 | _25.6_ | **82.7** | 25.0 | 50.5 | 52.2 | 58.1 | | MMA (Ours) | **1363.7** | **315.4** | **71.8** | **67.1** | **59.6** | **37.3** | **26.4** | **82.7** | **30.2** | _52.3_ | **57.8** | **64.1** | | **Improvements** | 10.9% | 29.5% | 4.3% | 2.8% | 10.4% | 7.8% | 3.1% | 1% | 4.1% | - | 3.2% | 2.1% | ### AKI-4B (Table 2) | | MMEP | MMEC | MMB | SEEDI | LLaVAW | MMMU | MathVmini | POPE | MM-Vet | RealWorldQA | CV-Bench2D | CV-Bench3D | |---|---|---|---|---|---|---|---|---|---|---|---|---| | AKI-4B | **1491.9** | **362.9** | **73.1** | **69.4** | **74.6** | **38.7** | **32.1** | **86.9** | **40.8** | **58.9** | **62.1** | **71.8** | ## Ethical Considerations _Note: This section is mainly taken from the [xgen-mm](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1/blob/main/README.md) models_. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. ## License Our code and weights are released under the CC-BY-NC 4.0 license. The copyrights of the pre-training and finetuning data remain with the original data owner. ## Citations ```bibtex @misc{wywang2025AKI, title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs}, author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi}, year={2025}, eprint={2503.02597}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02597}, } ```