{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "1682fb28-95ae-4966-834c-84304531006f", "metadata": {}, "source": [ "# Visual-language assistant with Video-LLaVA and OpenVINO\n", "\n", "Video-LLaVA (Learning United Visual Representation by Alignment Before Projection, [paper](https://arxiv.org/pdf/2311.10122.pdf)) is a Large Vision-Language Model (LVLM) that breaks new ground by understanding both images and videos through a single, unified visual representation. While LLaVA excels at image-based tasks, Video-LLaVA expands this fluency to the dynamic world of videos, enabling seamless comprehension and reasoning across both visual domains. This means it can answer questions, generate text, and perform other tasks with equal ease, regardless of whether it's presented with a still image or a moving scene.\n", "\n", "In this tutorial we consider how to use Video-LLaVA model to build multimodal chatbot. For demonstration purposes we will use [Video-LLaVA-7B](https://huggingface.co/LanguageBind/Video-LLaVA-7B) model for conversion.\n", "\n", "The tutorial consists from following steps:\n", "\n", "- Install prerequisites\n", "- Prepare input processor and tokenizer\n", "- Download original model\n", "- Compress model weights to 4 and 8 bits using NNCF\n", "- Convert model to OpenVINO Intermediate Representation (IR) format\n", "- Prepare OpenVINO-based inference pipeline\n", "- Run OpenVINO model\n", "\n", "\n", "#### Table of contents:\n", "\n", "- [About model](#About-model)\n", "- [Prerequisites](#Prerequisites)\n", "- [Build model and convert it to OpenVINO IR format](#Build-model-and-convert-it-to-OpenVINO-IR-format)\n", " - [Prepare helpers for model conversion](#Prepare-helpers-for-model-conversion)\n", " - [Convert and Optimize Model](#Convert-and-Optimize-Model)\n", " - [Instantiate PyTorch model $\\Uparrow$(#Table-of-content:)](#Instantiate-PyTorch-model-\\Uparrow(#Table-of-content:))\n", " - [Compress Model weights to 4 and 8 bits using NNCF $\\Uparrow$(#Table-of-content:)](#Compress-Model-weights-to-4-and-8-bits-using-NNCF-\\Uparrow(#Table-of-content:))\n", " - [Convert model to OpenVINO IR format $\\Uparrow$(#Table-of-content:)](#Convert-model-to-OpenVINO-IR-format-\\Uparrow(#Table-of-content:))\n", "- [Prepare OpenVINO based inference pipeline](#Prepare-OpenVINO-based-inference-pipeline)\n", "- [Run model inference](#Run-model-inference)\n", " - [Select inference device](#Select-inference-device)\n", " - [Load OpenVINO model](#Load-OpenVINO-model)\n", " - [Prepare input data](#Prepare-input-data)\n", " - [Test model inference](#Test-model-inference)\n", "- [Interactive demo](#Interactive-demo)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b802ee93-aae9-45e8-839b-eb0beeb5f15b", "metadata": {}, "source": [ "## About model\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Video-LLaVA connects pre-trained [CLIP ViT-L/14](https://openai.com/research/clip) visual encoders and large language model using a simple projection matrix\n", "\n", "\n", "\n", "\n", "\n", "More details about model can be found in original [paper](https://arxiv.org/pdf/2311.10122.pdf) and [repo](https://github.com/PKU-YuanGroup/Video-LLaVA)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "5ae5da98-0d3d-424b-afe2-0907fdb849da", "metadata": {}, "source": [ "## Prerequisites\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Install required dependencies" ] }, { "cell_type": "code", "execution_count": 1, "id": "1917249a-e452-46c5-ba03-8c417e7bace4", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:15:21.466964700Z", "start_time": "2023-11-03T14:15:21.231032100Z" } }, "outputs": [], "source": [ "%pip install -q torch \"torchvision<0.17.0\" \"transformers>=4.31.0,<4.35.0\" \"pytorchvideo\" \"einops\" \"peft==0.6.2\" --extra-index-url https://download.pytorch.org/whl/cpu\n", "%pip install -q opencv_python decord sentencepiece protobuf \"openvino>=2023.2.0\" \"nncf>=2.7.0\" \"gradio>=4.19\"" ] }, { "cell_type": "code", "execution_count": 2, "id": "5b8128a3-bc08-43de-804c-7242c5fda869", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:15:21.466964700Z", "start_time": "2023-11-03T14:15:21.231032100Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "import sys\n", "\n", "repo_dir = Path(\"Video-LLaVA\")\n", "\n", "if not repo_dir.exists():\n", " !git clone https://github.com/PKU-YuanGroup/Video-LLaVA.git\n", "\n", "sys.path.insert(0, str(repo_dir.resolve()))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "460904f4-c902-40b7-8b4c-244d86d0a670", "metadata": {}, "source": [ "
ffmpeg
package. To install it for your system, visit the official FFmpeg download page.\n",
"\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:Statistics of the bitwidth distribution:\n", "+--------------+-----------------+--------------------+\n", "| Num bits (N) | % all weight | % internal weights |\n", "+==============+=================+====================+\n", "| 8 | 22% (58 / 225) | 20% (56 / 223) |\n", "+--------------+-----------------+--------------------+\n", "| 4 | 78% (167 / 225) | 80% (167 / 223) |\n", "+--------------+-----------------+--------------------+\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3cc1be2cf3e746d5bb9fe24d533eed19", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Applying weight compression to second stage Video-LLaVA model\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "85211010a53343a098445c97ea953848", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:Statistics of the bitwidth distribution:\n", "+--------------+-----------------+--------------------+\n", "| Num bits (N) | % all weight | % internal weights |\n", "+==============+=================+====================+\n", "| 8 | 23% (58 / 226) | 20% (56 / 224) |\n", "+--------------+-----------------+--------------------+\n", "| 4 | 77% (168 / 226) | 80% (168 / 224) |\n", "+--------------+-----------------+--------------------+\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "066d12b12fba4b8c928cd40cb051a397", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Video-LLaVA model successfully converted\n" ] } ], "source": [ "if compression_mode.value == \"INT4\":\n", " compressed_model_dir = Path(\"videollava/INT4_compressed_weights\")\n", " videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT4_ASYM, group_size=128, ratio=0.8)\n", "else:\n", " compressed_model_dir = Path(\"videollava/INT8_compressed_weights\")\n", " videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT8)\n", "\n", "if not compressed_model_dir.exists():\n", " compressed_model_dir.mkdir(exist_ok=True, parents=True)\n", " model = LlavaLlamaForCausalLM.from_pretrained(model_id)\n", " model.resize_token_embeddings(len(tokenizer))\n", "\n", " if hasattr(config, \"max_sequence_length\"):\n", " context_len = config.max_sequence_length\n", " else:\n", " context_len = 2048\n", " image_tower = model.get_image_tower()\n", " if not image_tower.is_loaded:\n", " image_tower.load_model()\n", " video_tower = model.get_video_tower()\n", " if not video_tower.is_loaded:\n", " video_tower.load_model()\n", "\n", " model.eval()\n", " with torch.no_grad():\n", " convert_videollava(\n", " model,\n", " compressed_model_dir,\n", " videollava_wc_parameters=videollava_wc_parameters,\n", " )\n", " del model\n", " gc.collect();" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a1675134-3f4a-46c9-9e21-f162c155ebf5", "metadata": {}, "source": [ "## Prepare OpenVINO based inference pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "`OVLlavaLlamaForCausalLM` class provides ease-to-use interface for using model in generation scenario. It is based on `transformers.generation.GenerationMixin` that gives us opportunity to reuse all reach capabilities for generation implemented in HuggingFace Transformers library. More details about this interface can be found in [HuggingFace documentation](https://huggingface.co/docs/transformers/main_classes/text_generation).\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "99c35e39-2748-4500-b3c9-e793c36a8d0b", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:29:15.077528700Z", "start_time": "2023-11-03T14:29:15.033990300Z" } }, "outputs": [], "source": [ "from transformers.generation import GenerationConfig, GenerationMixin\n", "from transformers.modeling_outputs import CausalLMOutputWithPast\n", "import numpy as np\n", "import torch\n", "\n", "\n", "class OVLlavaLlamaForCausalLM(GenerationMixin):\n", " def __init__(self, core, model_dir, device):\n", " self.model = core.read_model(model_dir / \"videollava_with_past.xml\")\n", " self.model_input_embed = core.compile_model(model_dir / \"videollava_input_embed.xml\", device)\n", " self.input_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.inputs)}\n", " self.output_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.outputs)}\n", " self.key_value_input_names = [key for key in self.input_names if \"key_values\" in key]\n", " self.key_value_output_names = [key for key in self.output_names if \"present\" in key]\n", " compiled_model = core.compile_model(self.model, device)\n", " self.request = compiled_model.create_infer_request()\n", " self.config = transformers.AutoConfig.from_pretrained(model_dir)\n", " self.generation_config = GenerationConfig.from_model_config(config)\n", " self.main_input_name = \"input_ids\"\n", " self.device = torch.device(\"cpu\")\n", " self.num_pkv = 2\n", "\n", " def can_generate(self):\n", " \"\"\"Returns True to validate the check that the model using `GenerationMixin.generate()` can indeed generate.\"\"\"\n", " return True\n", "\n", " def __call__(\n", " self,\n", " input_ids: torch.LongTensor,\n", " images: torch.Tensor,\n", " attention_mask: Optional[torch.LongTensor] = None,\n", " prefix_mask: Optional[torch.LongTensor] = None,\n", " past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,\n", " **kwargs,\n", " ) -> CausalLMOutputWithPast:\n", " return self.forward(input_ids, images, attention_mask, prefix_mask, past_key_values)\n", "\n", " def forward(\n", " self,\n", " input_ids: torch.LongTensor,\n", " images: torch.Tensor,\n", " attention_mask: Optional[torch.LongTensor] = None,\n", " prefix_mask: Optional[torch.LongTensor] = None,\n", " past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,\n", " **kwargs,\n", " ) -> CausalLMOutputWithPast:\n", " \"\"\"General inference method\"\"\"\n", " inputs = {}\n", " if past_key_values is not None:\n", " # Flatten the past_key_values\n", " attention_mask = torch.ones(\n", " (input_ids.shape[0], past_key_values[-1][-1].shape[-2] + 1),\n", " dtype=input_ids.dtype,\n", " )\n", " past_key_values = (past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer)\n", " # Add the past_key_values to the decoder inputs\n", " inputs = dict(zip(self.key_value_input_names, past_key_values))\n", "\n", " else:\n", " return self.forward_with_image(input_ids, images, attention_mask)\n", " inputs[\"input_ids\"] = np.array(input_ids)\n", "\n", " if \"attention_mask\" in self.input_names:\n", " inputs[\"attention_mask\"] = np.array(attention_mask)\n", "\n", " # Run inference\n", " self.request.start_async(inputs, share_inputs=True)\n", " self.request.wait()\n", "\n", " logits = torch.from_numpy(self.request.get_tensor(\"logits\").data)\n", "\n", " # Tuple of length equal to : number of layer * number of past_key_value per decoder layer (2 corresponds to the self-attention layer)\n", " past_key_values = tuple(self.request.get_tensor(key).data for key in self.key_value_output_names)\n", " # Tuple of tuple of length `n_layers`, with each tuple of length equal to 2 (k/v of self-attention)\n", "\n", " past_key_values = tuple(past_key_values[i : i + self.num_pkv] for i in range(0, len(past_key_values), self.num_pkv))\n", " return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values)\n", "\n", " def forward_with_image(self, input_ids, images, attention_mask):\n", " \"\"\"First step inference method, that resolves multimodal data\"\"\"\n", " _, _, attention_mask, _, input_embeds, _ = preprocess_fn(\n", " input_ids=input_ids,\n", " position_ids=None,\n", " attention_mask=attention_mask,\n", " past_key_values=None,\n", " labels=None,\n", " images=images,\n", " )\n", " outs = self.model_input_embed({\"inputs_embeds\": input_embeds, \"attention_mask\": attention_mask})\n", " logits = outs[0]\n", " pkv = list(outs.values())[1:]\n", " pkv = tuple(pkv[i : i + self.num_pkv] for i in range(0, len(pkv), self.num_pkv))\n", " return CausalLMOutputWithPast(logits=torch.from_numpy(logits), past_key_values=pkv)\n", "\n", " def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):\n", " \"\"\"\n", " This function is used during running GenerationMixin.generate for preparing model specific inputs for\n", " each generation step\n", " \"\"\"\n", " past_len = 0\n", " if past_key_values is not None:\n", " input_ids = input_ids[:, -1].unsqueeze(-1)\n", " past_len = past_key_values[-1][-1].shape[-2]\n", " attention_mask = kwargs.get(\n", " \"attention_mask\",\n", " torch.ones(input_ids.shape[0], input_ids.shape[1] + past_len),\n", " )\n", " if not kwargs.get(\"use_cache\", True):\n", " raise NotImplementedError(\"MPT with prefix_lm=True does not support use_cache=False.\")\n", " else:\n", " prefix_mask = None\n", " return {\n", " \"input_ids\": input_ids,\n", " \"attention_mask\": attention_mask,\n", " \"prefix_mask\": prefix_mask,\n", " \"past_key_values\": past_key_values,\n", " \"images\": kwargs.get(\"images\", None),\n", " }\n", "\n", " def _reorder_cache(self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:\n", " \"\"\"\n", " This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or\n", " [`~PreTrainedModel.beam_sample`] is called.\n", " This is required to match `past_key_values` with the correct beam_idx at every generation step.\n", " \"\"\"\n", "\n", " # from transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel._reorder_cache\n", " return tuple(tuple(np.take(past_state, beam_idx, 0) for past_state in layer_past) for layer_past in past_key_values)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0a55e773-4a15-4497-afbe-b56fa22f7ee3", "metadata": {}, "source": [ "## Run model inference\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Now, when we have model and defined generation pipeline, we can run model inference.\n", "\n", "### Select inference device\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Select device from dropdown list for running inference using OpenVINO.\n", "\n", ">**Note**: There is no speedup for INT4 compressed models on dGPU." ] }, { "cell_type": "code", "execution_count": 8, "id": "c8497081-24e6-49b8-83d9-f4aced2d690a", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:29:15.379698600Z", "start_time": "2023-11-03T14:29:15.373212200Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "37d4c7744020409bbc9fb97c85a47bdd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ipywidgets as widgets\n", "\n", "core = ov.Core()\n", "\n", "device = widgets.Dropdown(\n", " options=core.available_devices + [\"AUTO\"],\n", " value=\"AUTO\",\n", " description=\"Device:\",\n", " disabled=False,\n", ")\n", "\n", "device" ] }, { "attachments": {}, "cell_type": "markdown", "id": "75301dbc-d10a-413d-89f3-3a2ed065bf96", "metadata": {}, "source": [ "### Load OpenVINO model\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "10e5dcb5-ae8a-4ae6-95ca-8bb2ad976a32", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:29:16.852049700Z", "start_time": "2023-11-03T14:29:15.382225100Z" } }, "outputs": [], "source": [ "ov_model = OVLlavaLlamaForCausalLM(core, compressed_model_dir, device.value)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "035e62ea-ea37-43b1-b47b-63a12ff4ca51", "metadata": {}, "source": [ "### Prepare input data\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "For preparing input data, we will use tokenizer and image processor defined in the begging of our tutorial. For alignment with original PyTorch implementation we will use PyTorch tensors as input." ] }, { "cell_type": "code", "execution_count": 10, "id": "36a31244-384e-4ee9-8d60-6fdd281eb7e0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Question: Are the instruments in the pictures used in the video?\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "