TIGER-Lab
/

VideoScore-Qwen2-VL

@@ -3,15 +3,18 @@ library_name: transformers
 tags: []
 ---
-[📃Paper](https://arxiv.org/abs/2406.15252) | [🌐Website](https://tiger-ai-lab.github.io/VideoScore/) | [💻Github](https://github.com/TIGER-AI-Lab/VideoScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/VideoScore) | [🤗Demo](https://huggingface.co/spaces/TIGER-Lab/VideoScore) | [📉Wandb](https://api.wandb.ai/links/xuanhe/ptohlfcx)
 ![VideoScore](https://tiger-ai-lab.github.io/VideoScore/static/images/teaser.png)
 ## Introduction
-- VideoScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
 and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
-a large video evaluation dataset with multi-aspect human scores.
 - VideoScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
@@ -21,19 +24,31 @@ a large video evaluation dataset with multi-aspect human scores.
 ## Evaluation Results
-We test our video evaluation model VideoScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
-For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
 averaged among all the evaluation aspects as indicator.
-For GenAI-Bench and VBench, which include human preference data among two or more videos,
-we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
-- We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset
-for VideoFeedback-test set, while for other three benchmarks.
-- We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
-excluding the real videos.
-The evaluation results are coming soon
 ## Usage
 ### Installation
@@ -57,7 +72,6 @@ from mantis.models.qwen2_vl import Qwen2VLForSequenceClassification
 from transformers import Qwen2VLProcessor
 from qwen_vl_utils import process_vision_info
-MAX_NUM_FRAMES=16
 ROUND_DIGIT=3
 REGRESSION_QUERY_PROMPT = """
 Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
@@ -81,9 +95,9 @@ factual consistency: 1.8
 For this video, the text prompt is "{text_prompt}",
 all the frames of video are as follows:
-"""
-model_name="Mantis-VL/qwen2-vl-video-eval-debug_12288_regression"
 video_path="video1.mp4"
 video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense."
@@ -96,6 +110,10 @@ model = Qwen2VLForSequenceClassification.from_pretrained(
 processor = Qwen2VLProcessor.from_pretrained(model_name)
 # Messages containing a images list as a video and a text query
 messages = [
     {
         "role": "user",
@@ -107,12 +125,18 @@ messages = [
             },
             {"type": "text", "text": REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)},
         ],
     }
 ]
 # Preparation for inference
 text = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
 inputs = processor(
@@ -123,7 +147,6 @@ inputs = processor(
     return_tensors="pt",
 )
 inputs = inputs.to("cuda")
-print(inputs['input_ids'].shape)
 # Inference
 with torch.no_grad():
@@ -140,8 +163,11 @@ print(aspect_scores)
 """
 model output on visual quality, temporal consistency, dynamic degree,
 text-to-video alignment, factual consistency, respectively
-[3.578, 3.594, 3.703, 3.156, 3.688]
 """
 ```

 tags: []
 ---
+[📃Paper](https://arxiv.org/abs/2406.15252) | [🌐Website](https://tiger-ai-lab.github.io/VideoScore/) | [💻Github](https://github.com/TIGER-AI-Lab/VideoScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model (VideoScore)](https://huggingface.co/TIGER-Lab/VideoScore) | [🤗Demo](https://huggingface.co/spaces/TIGER-Lab/VideoScore)
 ![VideoScore](https://tiger-ai-lab.github.io/VideoScore/static/images/teaser.png)
 ## Introduction
+- 🧐🧐[VideoScore-Qwen2-VL](https://huggingface.co/TIGER-Lab/VideoScore-Qwen2-VL) is a variant from [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore),
+taking [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) as base model, and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) dataset.
+- [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) series is a video quality evaluation model series, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) or [Qwen/Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) as base-model
 and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
+a large video evaluation dataset with multi-aspect human scores.
 - VideoScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
 ## Evaluation Results
+We test VideoScore-Qwen2-VL on VideoFeedback-test and take Spearman corrleation between model's output and human ratings
 averaged among all the evaluation aspects as indicator.
+The evaluation results are shown below:
+| metric            | VideoFeedback-test |
+|:-----------------:|:------------------:|
+| VideoScore-Qwen2-VL   |           **74.9** |
+| Gemini-1.5-Pro    |               22.1 |
+| Gemini-1.5-Flash  |               20.8 |
+| GPT-4o            |        <u>23.1</u> |
+| CLIP-sim          |                8.9 |
+| DINO-sim          |                7.5 |
+| SSIM-sim          |               13.4 |
+| CLIP-Score        |               -7.2 |
+| LLaVA-1.5-7B      |                8.5 |
+| LLaVA-1.6-7B      |               -3.1 |
+| X-CLIP-Score      |               -1.9 |
+| PIQE              |              -10.1 |
+| BRISQUE           |              -20.3 |
+| Idefics2          |                6.5 |
+| MSE-dyn           |               -5.5 |
+| SSIM-dyn          |              -12.9 |
+The best in VideoScore series is in bold and the best in baselines is underlined.
 ## Usage
 ### Installation
 from transformers import Qwen2VLProcessor
 from qwen_vl_utils import process_vision_info
 ROUND_DIGIT=3
 REGRESSION_QUERY_PROMPT = """
 Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
 For this video, the text prompt is "{text_prompt}",
 all the frames of video are as follows:
+"""
+model_name="TIGER-Lab/VideoScore-Qwen2-VL"
 video_path="video1.mp4"
 video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense."
 processor = Qwen2VLProcessor.from_pretrained(model_name)
 # Messages containing a images list as a video and a text query
+response = ""
+label_names = ["visual quality", "temporal consistency", "dynamic degree", "text-to-video alignment", "factual consistency"]
+for i in range(len(label_names)):
+    response += f"The score for {label_names[i]} is {model.config.label_special_tokens[i]}. "
 messages = [
     {
         "role": "user",
             },
             {"type": "text", "text": REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)},
         ],
+    },
+    {
+        "role": "assistant",
+        "content": [
+            {"type": "text", "text": response},
+        ],
     }
 ]
 # Preparation for inference
 text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=False
 )
 image_inputs, video_inputs = process_vision_info(messages)
 inputs = processor(
     return_tensors="pt",
 )
 inputs = inputs.to("cuda")
 # Inference
 with torch.no_grad():
 """
 model output on visual quality, temporal consistency, dynamic degree,
 text-to-video alignment, factual consistency, respectively
+VideoScore:
+[2.297, 2.469, 2.906, 2.766, 2.516]
+VideoScore-Qwen2-VL:
+[2.297, 2.531, 2.766, 2.312, 2.547]
 """
 ```