Upload 5 files

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +7 -6
config.json +4 -1
demo.ipynb +217 -0
example_image/aki_compressed.jpg +3 -0
model.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+example_image/aki_compressed.jpg filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -7,13 +7,15 @@ pipeline_tag: image-text-to-text
 # AKI Model Card
 `AKI` is the official checkpoint for the paper "[Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs](https://arxiv.org/abs/2503.02597)".
-AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality-mutual attention (MMA), which enables the earlier modality (images) to incorporate information from the latter modality (text) without introducing additional parameters and increasing training time.
 ## Model Details
 ### Model Descriptions
 - Vision Encoder: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
-- Vision-Language Connector: Perceiver Resampler
 - Language Decoder (LLM): [microsoft/Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)
 ### Model Sources
 - Repository: [GitHub](https://github.com/sony/aki)
@@ -35,11 +37,10 @@ Describe the scene of this image.
 > : The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ...
 ### Inference Example
-> Please refer to the [GitHub repo](https://github.com/sony/aki) for the training scripts.
-```=python
-```
 ## Evaluation Results
 ### Main Comparisons with the Same Configurations (Table 1)

 # AKI Model Card
 `AKI` is the official checkpoint for the paper "[Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs](https://arxiv.org/abs/2503.02597)".
+AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality-mutual attention (MMA), which enables the earlier modality (images) to incorporate information from the latter modality (text) for addressing vision-language misalignment without introducing additional parameters and increasing training time.
 ## Model Details
 ### Model Descriptions
 - Vision Encoder: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
+- Vision-Language Connector: [Perceiver Resampler](https://arxiv.org/abs/2204.14198)
 - Language Decoder (LLM): [microsoft/Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)
+- Pretraining Datasets: [Blip3-kale](https://huggingface.co/datasets/Salesforce/blip3-kale) and [Blip3-OCR-200m](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m)
+- SFT Datasets: VQAv2, GQA, VSR, OCRVQA, A-OKVQA, ScienceQA, RefCOCO, RefCOCOg, RefCOCO+, VisualGnome, LLaVA-150k
 ### Model Sources
 - Repository: [GitHub](https://github.com/sony/aki)
 > : The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ...
 ### Inference Example
+Please refer to the [notebook](demo.ipynb) for the zero-shot inference.
+To build a local demo website, please refer to [local_demo.py](https://github.com/sony/aki/blob/main/codes/open_flamingo/local_demo.py).
+> For the training scripts, please refer to the [GitHub repo](https://github.com/sony/aki).
 ## Evaluation Results
 ### Main Comparisons with the Same Configurations (Table 1)

config.json CHANGED Viewed

@@ -7,5 +7,8 @@
   "num_vision_tokens": 144,
   "pad_token_id": 32011,
   "tokenizer": null,
-  "vision_encoder_path": "google/siglip-so400m-patch14-384"
 }

   "num_vision_tokens": 144,
   "pad_token_id": 32011,
   "tokenizer": null,
+  "vision_encoder_path": "google/siglip-so400m-patch14-384",
+  "n_px": 384,
+  "norm_mean": 0.5,
+  "norm_std": 0.5
 }

demo.ipynb ADDED Viewed

	@@ -0,0 +1,217 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/Weiyao.Wang/virtualenvs/Kanzo/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "from src.aki import AKI\n",
+    "from transformers import AutoTokenizer, AutoConfig\n",
+    "from torchvision.transforms import Compose, Resize, Lambda, ToTensor, Normalize\n",
+    "from PIL import Image\n",
+    "try:\n",
+    "    from torchvision.transforms import InterpolationMode\n",
+    "    BICUBIC = InterpolationMode.BICUBIC\n",
+    "except ImportError:\n",
+    "    BICUBIC = Image.BICUBIC"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def apply_prompt_template(query: str) -> str:\n",
+    "    SYSTEM_BASE = \"A chat between a curious user and an artificial intelligence assistant.\"\n",
+    "    SYSTEM_DETAIL = \"The assistant gives helpful, detailed, and polite answers to the user's questions.\"\n",
+    "    SYSTEM_MESSAGE = SYSTEM_BASE + \" \" + SYSTEM_DETAIL\n",
+    "    SYSTEM_MESSAGE_ROLE = '<|system|>' + '\\n' + SYSTEM_MESSAGE + '<|end|>\\n'\n",
+    "\n",
+    "    s = (\n",
+    "        f'<s> {SYSTEM_MESSAGE_ROLE}'\n",
+    "        f'<|user|>\\n<image>\\n{query}<|end|>\\n<|assistant|>\\n'\n",
+    "    )\n",
+    "    return s\n",
+    "\n",
+    "\n",
+    "def load_model_and_processor(ckpt_path, config):\n",
+    "    n_px = getattr(config, \"n_px\", 384)\n",
+    "    norm_mean = getattr(config, \"norm_mean\", 0.5)\n",
+    "    norm_std = getattr(config, \"norm_std\", 0.5)\n",
+    "\n",
+    "    # replace GenerationMixin to modify attention mask handling\n",
+    "    from transformers.generation.utils import GenerationMixin\n",
+    "    from open_flamingo import _aki_update_model_kwargs_for_generation\n",
+    "    GenerationMixin._update_model_kwargs_for_generation = _aki_update_model_kwargs_for_generation\n",
+    "    \n",
+    "    tokenizer = AutoTokenizer.from_pretrained(ckpt_path)\n",
+    "    model = AKI.from_pretrained(ckpt_path, tokenizer=tokenizer)\n",
+    "    image_processor = Compose([\n",
+    "        Resize((n_px, n_px), interpolation=InterpolationMode.BICUBIC, antialias=True),\n",
+    "        Lambda(lambda x: x.convert('RGB')),\n",
+    "        ToTensor(),\n",
+    "        Normalize(mean=(norm_mean, norm_mean, norm_mean), std=(norm_std, norm_std, norm_std))\n",
+    "    ])\n",
+    "\n",
+    "    model.eval().cuda()\n",
+    "    print(\"Model initialization is done.\")\n",
+    "    return model, image_processor, tokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.\n",
+      "Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.\n",
+      "Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.52s/it]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading weights from local directory\n",
+      "Model initialization is done.\n"
+     ]
+    }
+   ],
+   "source": [
+    "model_path = \"/home/Weiyao.Wang/projects/Multimodal-Foundation-Models/codes/open_flamingo/aki-phi3.5-mini-4b\"\n",
+    "config = AutoConfig.from_pretrained(model_path)\n",
+    "# Load model, image_processor, tokenizer\n",
+    "model, image_processor, tokenizer = load_model_and_processor(model_path, config=config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def process_input(image_path: str, text_input: str) -> str:\n",
+    "    \"\"\"\n",
+    "    Processes the input image and text prompt to generate a response from the AKI model.\n",
+    "    \n",
+    "    Args:\n",
+    "    image_path (str): The path of the image.\n",
+    "    text_input (str): The text prompt to accompany the image.\n",
+    "    \n",
+    "    Returns:\n",
+    "    str: The generated text from the model.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    image = Image.open(image_path).convert('RGB')\n",
+    "    \n",
+    "    # tokenize text input with the chat template\n",
+    "    prompt = apply_prompt_template(text_input)\n",
+    "    lang_x = tokenizer([prompt], return_tensors='pt', add_special_tokens=False)\n",
+    "\n",
+    "    print(\"Prompt:\", prompt)\n",
+    "    \n",
+    "    # Preprocess inputs for the model\n",
+    "    vision_x = image_processor(image)[None, None, None, ...].cuda()\n",
+    "\n",
+    "    generation_kwargs = {\n",
+    "        'max_new_tokens': 256,\n",
+    "        'do_sample': False,\n",
+    "    }\n",
+    "    \n",
+    "    # Generate the model's output based on the inputs\n",
+    "    output = model.generate(\n",
+    "        vision_x=vision_x.cuda(),\n",
+    "        lang_x=lang_x['input_ids'].cuda(),\n",
+    "        attention_mask=lang_x['attention_mask'].cuda(),\n",
+    "        **generation_kwargs\n",
+    "    )\n",
+    "    \n",
+    "    # Decode the generated output into readable text\n",
+    "    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)\n",
+    "    \n",
+    "    return generated_text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prompt: <s> <|system|>\n",
+      "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>\n",
+      "<|user|>\n",
+      "<image>\n",
+      "Describe the scene of this image.<|end|>\n",
+      "<|assistant|>\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Response:\n",
+      " The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting.\n",
+      "\n",
+      "A few benches are scattered along the path, providing visitors with a place to sit and enjoy the view of the falling leaves and the surrounding trees. The overall scene is serene and inviting, making it an ideal spot for relaxation and appreciating the beauty of the season.\n"
+     ]
+    }
+   ],
+   "source": [
+    "image_path = \"example_image/aki_compressed.jpg\"\n",
+    "text_input = \"Describe the scene of this image.\"\n",
+    "response = process_input(image_path, text_input)\n",
+    "print(\"Response:\\n\", response)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Kanzo",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

example_image/aki_compressed.jpg ADDED Viewed

Git LFS Details

SHA256: 38bc61b8e8915ef74770baa7136fc5fc9228e409fdb12f05f967a537b4fed49d
Pointer size: 132 Bytes
Size of remote file: 6.28 MB

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1154b183974b8ab07bd8e5f36a093cb50a37751825caacd648b0acd92e5cfc4a
+size 17323922632