{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "1682fb28-95ae-4966-834c-84304531006f", "metadata": {}, "source": [ "# Visual-language assistant with Video-LLaVA and OpenVINO\n", "\n", "Video-LLaVA (Learning United Visual Representation by Alignment Before Projection, [paper](https://arxiv.org/pdf/2311.10122.pdf)) is a Large Vision-Language Model (LVLM) that breaks new ground by understanding both images and videos through a single, unified visual representation. While LLaVA excels at image-based tasks, Video-LLaVA expands this fluency to the dynamic world of videos, enabling seamless comprehension and reasoning across both visual domains. This means it can answer questions, generate text, and perform other tasks with equal ease, regardless of whether it's presented with a still image or a moving scene.\n", "\n", "In this tutorial we consider how to use Video-LLaVA model to build multimodal chatbot. For demonstration purposes we will use [Video-LLaVA-7B](https://huggingface.co/LanguageBind/Video-LLaVA-7B) model for conversion.\n", "\n", "The tutorial consists from following steps:\n", "\n", "- Install prerequisites\n", "- Prepare input processor and tokenizer\n", "- Download original model\n", "- Compress model weights to 4 and 8 bits using NNCF\n", "- Convert model to OpenVINO Intermediate Representation (IR) format\n", "- Prepare OpenVINO-based inference pipeline\n", "- Run OpenVINO model\n", "\n", "\n", "#### Table of contents:\n", "\n", "- [About model](#About-model)\n", "- [Prerequisites](#Prerequisites)\n", "- [Build model and convert it to OpenVINO IR format](#Build-model-and-convert-it-to-OpenVINO-IR-format)\n", " - [Prepare helpers for model conversion](#Prepare-helpers-for-model-conversion)\n", " - [Convert and Optimize Model](#Convert-and-Optimize-Model)\n", " - [Instantiate PyTorch model $\\Uparrow$(#Table-of-content:)](#Instantiate-PyTorch-model-\\Uparrow(#Table-of-content:))\n", " - [Compress Model weights to 4 and 8 bits using NNCF $\\Uparrow$(#Table-of-content:)](#Compress-Model-weights-to-4-and-8-bits-using-NNCF-\\Uparrow(#Table-of-content:))\n", " - [Convert model to OpenVINO IR format $\\Uparrow$(#Table-of-content:)](#Convert-model-to-OpenVINO-IR-format-\\Uparrow(#Table-of-content:))\n", "- [Prepare OpenVINO based inference pipeline](#Prepare-OpenVINO-based-inference-pipeline)\n", "- [Run model inference](#Run-model-inference)\n", " - [Select inference device](#Select-inference-device)\n", " - [Load OpenVINO model](#Load-OpenVINO-model)\n", " - [Prepare input data](#Prepare-input-data)\n", " - [Test model inference](#Test-model-inference)\n", "- [Interactive demo](#Interactive-demo)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b802ee93-aae9-45e8-839b-eb0beeb5f15b", "metadata": {}, "source": [ "## About model\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Video-LLaVA connects pre-trained [CLIP ViT-L/14](https://openai.com/research/clip) visual encoders and large language model using a simple projection matrix\n", "\n", "![](https://github.com/itrushkin/openvino_notebooks/assets/76161256/193f6bc4-b3c5-4508-8fe5-c5e5036aab12)\n", "\n", "\n", "\n", "More details about model can be found in original [paper](https://arxiv.org/pdf/2311.10122.pdf) and [repo](https://github.com/PKU-YuanGroup/Video-LLaVA)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "5ae5da98-0d3d-424b-afe2-0907fdb849da", "metadata": {}, "source": [ "## Prerequisites\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Install required dependencies" ] }, { "cell_type": "code", "execution_count": 1, "id": "1917249a-e452-46c5-ba03-8c417e7bace4", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:15:21.466964700Z", "start_time": "2023-11-03T14:15:21.231032100Z" } }, "outputs": [], "source": [ "%pip install -q torch \"torchvision<0.17.0\" \"transformers>=4.31.0,<4.35.0\" \"pytorchvideo\" \"einops\" \"peft==0.6.2\" --extra-index-url https://download.pytorch.org/whl/cpu\n", "%pip install -q opencv_python decord sentencepiece protobuf \"openvino>=2023.2.0\" \"nncf>=2.7.0\" \"gradio>=4.19\"" ] }, { "cell_type": "code", "execution_count": 2, "id": "5b8128a3-bc08-43de-804c-7242c5fda869", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:15:21.466964700Z", "start_time": "2023-11-03T14:15:21.231032100Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "import sys\n", "\n", "repo_dir = Path(\"Video-LLaVA\")\n", "\n", "if not repo_dir.exists():\n", " !git clone https://github.com/PKU-YuanGroup/Video-LLaVA.git\n", "\n", "sys.path.insert(0, str(repo_dir.resolve()))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "460904f4-c902-40b7-8b4c-244d86d0a670", "metadata": {}, "source": [ "
\n", "Warning: this tutorial requires the ffmpeg package. To install it for your system, visit the official FFmpeg download page.\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "id": "d1e7b52f-bcfd-4180-82cf-6beb125bdb5c", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:15:25.083172Z", "start_time": "2023-11-03T14:15:21.231032100Z" }, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML\n", " warnings.warn(\"Can't initialize NVML\")\n", "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torch/cuda/__init__.py:740: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n", " return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count\n", "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12recordStreamERKNS_7DataPtrENS0_10CUDAStreamE\n", " warn(f\"Failed to load image Python extension: {e}\")\n", "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in 0.14. Please use the 'torchvision.transforms.functional' module instead.\n", " warnings.warn(\n", "/home/itrushkin/.virtualenvs/videollava/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:25: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in 0.14. Please use the 'torchvision.transforms' module instead.\n", " warnings.warn(\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "554596cb371042ebb9df938d5ab6f1b7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00**Note**: There is no speedup for INT4 compressed models on dGPU.\n", "\n", "#### Convert model to OpenVINO IR format [$\\Uparrow$](#Table-of-content:)\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Convert model to OpenVINO format using conversion helper function defined above.\n", "\n", "Please select below whether you would like to run INT4 weight compression instead of INT8 weight compression." ] }, { "cell_type": "code", "execution_count": 5, "id": "aca80991", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:37:00.167129100Z", "start_time": "2023-11-03T14:37:00.141353600Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a994439f981b4d78a0e6b9123d3994d1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Compression mode:', options=('INT4', 'INT8'), value='INT4')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ipywidgets as widgets\n", "\n", "compression_mode = widgets.Dropdown(\n", " options=[\"INT4\", \"INT8\"],\n", " value=\"INT4\",\n", " description=\"Compression mode:\",\n", " disabled=False,\n", ")\n", "\n", "compression_mode" ] }, { "cell_type": "code", "execution_count": 6, "id": "001bab2b-b36b-4e95-b454-593dd71fb596", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:44:15.693843800Z", "start_time": "2023-11-03T14:37:01.826679700Z" }, "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dce4b5bb4dcf4bb7898c269384fdf8b4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:Statistics of the bitwidth distribution:\n", "+--------------+-----------------+--------------------+\n", "| Num bits (N) | % all weight | % internal weights |\n", "+==============+=================+====================+\n", "| 8 | 22% (58 / 225) | 20% (56 / 223) |\n", "+--------------+-----------------+--------------------+\n", "| 4 | 78% (167 / 225) | 80% (167 / 223) |\n", "+--------------+-----------------+--------------------+\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3cc1be2cf3e746d5bb9fe24d533eed19", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Applying weight compression to second stage Video-LLaVA model\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "85211010a53343a098445c97ea953848", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:Statistics of the bitwidth distribution:\n", "+--------------+-----------------+--------------------+\n", "| Num bits (N) | % all weight | % internal weights |\n", "+==============+=================+====================+\n", "| 8 | 23% (58 / 226) | 20% (56 / 224) |\n", "+--------------+-----------------+--------------------+\n", "| 4 | 77% (168 / 226) | 80% (168 / 224) |\n", "+--------------+-----------------+--------------------+\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "066d12b12fba4b8c928cd40cb051a397", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Video-LLaVA model successfully converted\n" ] } ], "source": [ "if compression_mode.value == \"INT4\":\n", " compressed_model_dir = Path(\"videollava/INT4_compressed_weights\")\n", " videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT4_ASYM, group_size=128, ratio=0.8)\n", "else:\n", " compressed_model_dir = Path(\"videollava/INT8_compressed_weights\")\n", " videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT8)\n", "\n", "if not compressed_model_dir.exists():\n", " compressed_model_dir.mkdir(exist_ok=True, parents=True)\n", " model = LlavaLlamaForCausalLM.from_pretrained(model_id)\n", " model.resize_token_embeddings(len(tokenizer))\n", "\n", " if hasattr(config, \"max_sequence_length\"):\n", " context_len = config.max_sequence_length\n", " else:\n", " context_len = 2048\n", " image_tower = model.get_image_tower()\n", " if not image_tower.is_loaded:\n", " image_tower.load_model()\n", " video_tower = model.get_video_tower()\n", " if not video_tower.is_loaded:\n", " video_tower.load_model()\n", "\n", " model.eval()\n", " with torch.no_grad():\n", " convert_videollava(\n", " model,\n", " compressed_model_dir,\n", " videollava_wc_parameters=videollava_wc_parameters,\n", " )\n", " del model\n", " gc.collect();" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a1675134-3f4a-46c9-9e21-f162c155ebf5", "metadata": {}, "source": [ "## Prepare OpenVINO based inference pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "`OVLlavaLlamaForCausalLM` class provides ease-to-use interface for using model in generation scenario. It is based on `transformers.generation.GenerationMixin` that gives us opportunity to reuse all reach capabilities for generation implemented in HuggingFace Transformers library. More details about this interface can be found in [HuggingFace documentation](https://huggingface.co/docs/transformers/main_classes/text_generation).\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "99c35e39-2748-4500-b3c9-e793c36a8d0b", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:29:15.077528700Z", "start_time": "2023-11-03T14:29:15.033990300Z" } }, "outputs": [], "source": [ "from transformers.generation import GenerationConfig, GenerationMixin\n", "from transformers.modeling_outputs import CausalLMOutputWithPast\n", "import numpy as np\n", "import torch\n", "\n", "\n", "class OVLlavaLlamaForCausalLM(GenerationMixin):\n", " def __init__(self, core, model_dir, device):\n", " self.model = core.read_model(model_dir / \"videollava_with_past.xml\")\n", " self.model_input_embed = core.compile_model(model_dir / \"videollava_input_embed.xml\", device)\n", " self.input_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.inputs)}\n", " self.output_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.outputs)}\n", " self.key_value_input_names = [key for key in self.input_names if \"key_values\" in key]\n", " self.key_value_output_names = [key for key in self.output_names if \"present\" in key]\n", " compiled_model = core.compile_model(self.model, device)\n", " self.request = compiled_model.create_infer_request()\n", " self.config = transformers.AutoConfig.from_pretrained(model_dir)\n", " self.generation_config = GenerationConfig.from_model_config(config)\n", " self.main_input_name = \"input_ids\"\n", " self.device = torch.device(\"cpu\")\n", " self.num_pkv = 2\n", "\n", " def can_generate(self):\n", " \"\"\"Returns True to validate the check that the model using `GenerationMixin.generate()` can indeed generate.\"\"\"\n", " return True\n", "\n", " def __call__(\n", " self,\n", " input_ids: torch.LongTensor,\n", " images: torch.Tensor,\n", " attention_mask: Optional[torch.LongTensor] = None,\n", " prefix_mask: Optional[torch.LongTensor] = None,\n", " past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,\n", " **kwargs,\n", " ) -> CausalLMOutputWithPast:\n", " return self.forward(input_ids, images, attention_mask, prefix_mask, past_key_values)\n", "\n", " def forward(\n", " self,\n", " input_ids: torch.LongTensor,\n", " images: torch.Tensor,\n", " attention_mask: Optional[torch.LongTensor] = None,\n", " prefix_mask: Optional[torch.LongTensor] = None,\n", " past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,\n", " **kwargs,\n", " ) -> CausalLMOutputWithPast:\n", " \"\"\"General inference method\"\"\"\n", " inputs = {}\n", " if past_key_values is not None:\n", " # Flatten the past_key_values\n", " attention_mask = torch.ones(\n", " (input_ids.shape[0], past_key_values[-1][-1].shape[-2] + 1),\n", " dtype=input_ids.dtype,\n", " )\n", " past_key_values = (past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer)\n", " # Add the past_key_values to the decoder inputs\n", " inputs = dict(zip(self.key_value_input_names, past_key_values))\n", "\n", " else:\n", " return self.forward_with_image(input_ids, images, attention_mask)\n", " inputs[\"input_ids\"] = np.array(input_ids)\n", "\n", " if \"attention_mask\" in self.input_names:\n", " inputs[\"attention_mask\"] = np.array(attention_mask)\n", "\n", " # Run inference\n", " self.request.start_async(inputs, share_inputs=True)\n", " self.request.wait()\n", "\n", " logits = torch.from_numpy(self.request.get_tensor(\"logits\").data)\n", "\n", " # Tuple of length equal to : number of layer * number of past_key_value per decoder layer (2 corresponds to the self-attention layer)\n", " past_key_values = tuple(self.request.get_tensor(key).data for key in self.key_value_output_names)\n", " # Tuple of tuple of length `n_layers`, with each tuple of length equal to 2 (k/v of self-attention)\n", "\n", " past_key_values = tuple(past_key_values[i : i + self.num_pkv] for i in range(0, len(past_key_values), self.num_pkv))\n", " return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values)\n", "\n", " def forward_with_image(self, input_ids, images, attention_mask):\n", " \"\"\"First step inference method, that resolves multimodal data\"\"\"\n", " _, _, attention_mask, _, input_embeds, _ = preprocess_fn(\n", " input_ids=input_ids,\n", " position_ids=None,\n", " attention_mask=attention_mask,\n", " past_key_values=None,\n", " labels=None,\n", " images=images,\n", " )\n", " outs = self.model_input_embed({\"inputs_embeds\": input_embeds, \"attention_mask\": attention_mask})\n", " logits = outs[0]\n", " pkv = list(outs.values())[1:]\n", " pkv = tuple(pkv[i : i + self.num_pkv] for i in range(0, len(pkv), self.num_pkv))\n", " return CausalLMOutputWithPast(logits=torch.from_numpy(logits), past_key_values=pkv)\n", "\n", " def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):\n", " \"\"\"\n", " This function is used during running GenerationMixin.generate for preparing model specific inputs for\n", " each generation step\n", " \"\"\"\n", " past_len = 0\n", " if past_key_values is not None:\n", " input_ids = input_ids[:, -1].unsqueeze(-1)\n", " past_len = past_key_values[-1][-1].shape[-2]\n", " attention_mask = kwargs.get(\n", " \"attention_mask\",\n", " torch.ones(input_ids.shape[0], input_ids.shape[1] + past_len),\n", " )\n", " if not kwargs.get(\"use_cache\", True):\n", " raise NotImplementedError(\"MPT with prefix_lm=True does not support use_cache=False.\")\n", " else:\n", " prefix_mask = None\n", " return {\n", " \"input_ids\": input_ids,\n", " \"attention_mask\": attention_mask,\n", " \"prefix_mask\": prefix_mask,\n", " \"past_key_values\": past_key_values,\n", " \"images\": kwargs.get(\"images\", None),\n", " }\n", "\n", " def _reorder_cache(self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:\n", " \"\"\"\n", " This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or\n", " [`~PreTrainedModel.beam_sample`] is called.\n", " This is required to match `past_key_values` with the correct beam_idx at every generation step.\n", " \"\"\"\n", "\n", " # from transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel._reorder_cache\n", " return tuple(tuple(np.take(past_state, beam_idx, 0) for past_state in layer_past) for layer_past in past_key_values)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0a55e773-4a15-4497-afbe-b56fa22f7ee3", "metadata": {}, "source": [ "## Run model inference\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Now, when we have model and defined generation pipeline, we can run model inference.\n", "\n", "### Select inference device\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Select device from dropdown list for running inference using OpenVINO.\n", "\n", ">**Note**: There is no speedup for INT4 compressed models on dGPU." ] }, { "cell_type": "code", "execution_count": 8, "id": "c8497081-24e6-49b8-83d9-f4aced2d690a", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:29:15.379698600Z", "start_time": "2023-11-03T14:29:15.373212200Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "37d4c7744020409bbc9fb97c85a47bdd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ipywidgets as widgets\n", "\n", "core = ov.Core()\n", "\n", "device = widgets.Dropdown(\n", " options=core.available_devices + [\"AUTO\"],\n", " value=\"AUTO\",\n", " description=\"Device:\",\n", " disabled=False,\n", ")\n", "\n", "device" ] }, { "attachments": {}, "cell_type": "markdown", "id": "75301dbc-d10a-413d-89f3-3a2ed065bf96", "metadata": {}, "source": [ "### Load OpenVINO model\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "10e5dcb5-ae8a-4ae6-95ca-8bb2ad976a32", "metadata": { "ExecuteTime": { "end_time": "2023-11-03T14:29:16.852049700Z", "start_time": "2023-11-03T14:29:15.382225100Z" } }, "outputs": [], "source": [ "ov_model = OVLlavaLlamaForCausalLM(core, compressed_model_dir, device.value)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "035e62ea-ea37-43b1-b47b-63a12ff4ca51", "metadata": {}, "source": [ "### Prepare input data\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "For preparing input data, we will use tokenizer and image processor defined in the begging of our tutorial. For alignment with original PyTorch implementation we will use PyTorch tensors as input." ] }, { "cell_type": "code", "execution_count": 10, "id": "36a31244-384e-4ee9-8d60-6fdd281eb7e0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Question: Are the instruments in the pictures used in the video?\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display, Video, Image\n", "\n", "\n", "examples_dir = Path(\"Video-LLaVA/videollava/serve/examples\")\n", "video_file = examples_dir / \"sample_demo_22.mp4\"\n", "image_file = examples_dir / \"sample_img_22.png\"\n", "\n", "\n", "video_tensor = video_processor.preprocess(str(video_file), return_tensors=\"pt\")[\"pixel_values\"][0]\n", "image_tensor = image_processor.preprocess(str(image_file), return_tensors=\"pt\")[\"pixel_values\"][0]\n", "images_tensor = [video_tensor, image_tensor]\n", "\n", "text_message = \"Are the instruments in the pictures used in the video?\"\n", "print(f\"Question: {text_message}\")\n", "display(Video(video_file, embed=True))\n", "Image(image_file, embed=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2030c809-e74c-45ad-bd5e-3f4177213a22", "metadata": {}, "source": [ "### Test model inference\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Generation process for long response maybe time consuming, for accessing partial result as soon as it is generated without waiting when whole process finished, Streaming API can be used. Token streaming is the mode in which the generative system returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience. You can find more details about how streaming work in [HuggingFace documentation](https://huggingface.co/docs/text-generation-inference/conceptual/streaming).\n", "\n", "Also for simplification of preparing input in conversational mode, we will use Conversation Template helper provided by model authors for accumulating history of provided messages and images." ] }, { "cell_type": "code", "execution_count": 11, "id": "b990ec8a-e69a-4b2f-a3b0-e8d4acd27af1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Answer:\n", "['video', 'image']\n", "Yes, the instruments in the pictures are used in the video. The man is playing a drum set, which includes a bass drum, snare drum, and cymbals. The cymbals are used to produce different sounds, such as crashes and hi-hats. The man is also seen playing a guitar, which is another instrument used in the video.\n" ] } ], "source": [ "from videollava.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria\n", "from videollava.constants import IMAGE_TOKEN_INDEX\n", "from transformers import TextStreamer\n", "from videollava.conversation import conv_templates, SeparatorStyle\n", "\n", "# Prepare\n", "streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n", "conv_mode = \"llava_v1\"\n", "\n", "conv = conv_templates[conv_mode].copy()\n", "roles = (\"user\", \"assistant\")\n", "\n", "if mm_use_im_start_end:\n", " inp = DEFAULT_VIDEO_START_TOKEN + DEFAULT_IMAGE_TOKEN * 8 + DEFAULT_VIDEO_END_TOKEN + \"\\n\" + text_message\n", "else:\n", " inp = DEFAULT_IMAGE_TOKEN * 8 + \"\\n\" + text_message\n", "conv.append_message(conv.roles[0], inp)\n", "conv.append_message(conv.roles[1], None)\n", "\n", "prompt = conv.get_prompt()\n", "input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0)\n", "\n", "\n", "stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2\n", "keywords = [stop_str]\n", "stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)\n", "streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n", "print(\"Answer:\")\n", "\n", "output_ids = ov_model.generate(\n", " input_ids,\n", " images=images_tensor,\n", " do_sample=True,\n", " temperature=0.2,\n", " max_new_tokens=1024,\n", " streamer=streamer,\n", " use_cache=True,\n", " stopping_criteria=[stopping_criteria],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bcdd3765-868b-4697-861c-34affc929016", "metadata": {}, "source": [ "## Interactive demo\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "f1ce960e-8b89-44ab-a600-19d836f5dc3e", "metadata": {}, "outputs": [], "source": [ "import torch\n", "import gradio as gr\n", "\n", "from videollava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX\n", "from videollava.conversation import conv_templates, SeparatorStyle\n", "\n", "\n", "def generate(image, video, textbox_in):\n", " if video is not None:\n", " textbox_in = DEFAULT_IMAGE_TOKEN * 8 + \"\\n\" + textbox_in\n", " if image is not None:\n", " textbox_in += \"\\n\" + DEFAULT_IMAGE_TOKEN\n", " elif image is not None:\n", " textbox_in = DEFAULT_IMAGE_TOKEN + \"\\n\" + textbox_in\n", "\n", " conv_mode = \"llava_v1\"\n", " conv = conv_templates[conv_mode].copy()\n", " conv.append_message(conv.roles[0], textbox_in)\n", " conv.append_message(conv.roles[1], None)\n", " prompt = conv.get_prompt()\n", " images_tensor = []\n", " if image is not None:\n", " images_tensor.append(image_processor(image, return_tensors=\"pt\")[\"pixel_values\"][0])\n", " if video is not None:\n", " images_tensor.append(video_processor(video, return_tensors=\"pt\")[\"pixel_values\"][0])\n", " input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0)\n", "\n", " stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2\n", " keywords = [stop_str]\n", " stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)\n", "\n", " generate_kwargs = dict(\n", " input_ids=input_ids,\n", " images=images_tensor,\n", " max_new_tokens=1024,\n", " temperature=0.2,\n", " do_sample=True,\n", " use_cache=True,\n", " stopping_criteria=[stopping_criteria],\n", " )\n", "\n", " output_ids = ov_model.generate(**generate_kwargs)\n", "\n", " input_token_len = input_ids.shape[1]\n", " outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]\n", " outputs = outputs.strip()\n", " if outputs.endswith(stop_str):\n", " outputs = outputs[: -len(stop_str)]\n", " outputs = outputs.strip()\n", "\n", " return outputs\n", "\n", "\n", "demo = gr.Interface(\n", " generate,\n", " [\n", " gr.Image(label=\"Input Image\", type=\"filepath\"),\n", " gr.Video(label=\"Input Video\"),\n", " gr.Textbox(label=\"Question\"),\n", " ],\n", " gr.Textbox(lines=10),\n", " examples=[\n", " [\n", " f\"{examples_dir}/extreme_ironing.jpg\",\n", " None,\n", " \"What is unusual about this image?\",\n", " ],\n", " [\n", " f\"{examples_dir}/waterview.jpg\",\n", " None,\n", " \"What are the things I should be cautious about when I visit here?\",\n", " ],\n", " [\n", " f\"{examples_dir}/desert.jpg\",\n", " None,\n", " \"If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert?\",\n", " ],\n", " [\n", " None,\n", " f\"{examples_dir}/sample_demo_1.mp4\",\n", " \"Why is this video funny?\",\n", " ],\n", " [\n", " None,\n", " f\"{examples_dir}/sample_demo_3.mp4\",\n", " \"Can you identify any safety hazards in this video?\",\n", " ],\n", " [\n", " None,\n", " f\"{examples_dir}/sample_demo_9.mp4\",\n", " \"Describe the video.\",\n", " ],\n", " [\n", " None,\n", " f\"{examples_dir}/sample_demo_22.mp4\",\n", " \"Describe the activity in the video.\",\n", " ],\n", " [\n", " f\"{examples_dir}/sample_img_22.png\",\n", " f\"{examples_dir}/sample_demo_22.mp4\",\n", " \"Are the instruments in the pictures used in the video?\",\n", " ],\n", " [\n", " f\"{examples_dir}/sample_img_13.png\",\n", " f\"{examples_dir}/sample_demo_13.mp4\",\n", " \"Does the flag in the image appear in the video?\",\n", " ],\n", " [\n", " f\"{examples_dir}/sample_img_8.png\",\n", " f\"{examples_dir}/sample_demo_8.mp4\",\n", " \"Are the image and the video depicting the same place?\",\n", " ],\n", " ],\n", " title=\"Video-LLaVA🚀\",\n", " allow_flagging=\"never\",\n", ")\n", "try:\n", " demo.queue().launch(debug=True)\n", "except Exception:\n", " demo.queue().launch(share=True, debug=True)\n", "# if you are launching remotely, specify server_name and server_port\n", "# demo.launch(server_name='your server name', server_port='server port in int')\n", "# Read more in the docs: https://gradio.app/docs/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "openvino_notebooks": { "imageUrl": "https://camo.githubusercontent.com/ef232f43135222dc7cfc6e27ae26ac64edf6918512a8a4f78077e4f86c27883c/68747470733a2f2f7a312e617831782e636f6d2f323032332f31312f30372f70696c347371482e706e67", "tags": { "categories": [ "Model Demos", "AI Trends" ], "libraries": [], "other": [], "tasks": [ "Visual Question Answering", "Video-to-Text", "Text Generation" ] } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }