File size: 3,613 Bytes
9c38fb5
 
d4bead9
 
9c38fb5
d4bead9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df460d0
d4bead9
 
f548402
d4bead9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---

license: apache-2.0
tags:
- video LLM
---



# Tarsier Model Card
## Introduction
Tarsier2-Recap-7b is build upon [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) by distilling the video description capabilities of Tarsier2-7b. Specifically, we finetuned Qwen2-VL-7B-Instruct on [Tarsier2-Recap-585K](https://huggingface.co/datasets/omni-research/Tarsier2-Recap-585K) for 2 epochs with a learning rate of 2e-5. Tarsier2-Recap-7b shares a similar video captioning ability as Tarsier2-7b, reaching an overall F1 score of 40.7% on [DREAM-1K](https://tarsier-vlm.github.io/), which is only behind Tarsier2-7b (42.0%) and surpasses GPT-4o's 39.2%. See the [Tarsier2 technical report](https://arxiv.org/abs/2501.07888) for more details.

## Model details
- Base Model: [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
- Training Data: [Tarsier2-Recap-585K](https://huggingface.co/datasets/omni-research/Tarsier2-Recap-585K)

**Model date:**
Tarsier2-Recap-7b was trained in December 2024.

**Paper or resources for more information:**
- github repo: https://github.com/bytedance/tarsier/tree/tarsier2
- paper link: https://arxiv.org/abs/2501.07888
- leaderboard: https://tarsier-vlm.github.io/

## License
Qwen/Qwen2-VL-7B-Instruct license.

## Intended use
**Primary intended uses:**
The primary use of Tarsier is research on large multimodal models, especially video description.

**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

## Model Performance
### Video Description
We evaluate Tarsier2-Recap-7b on DREAM-1K, a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. Here is the evaluation result:
![images](./assets/dream-1k_results.png)
_Note: The results of Tarsier2-Recap-7b is different from the results we reported in Table 11 in the [Tarsier2 technical report](https://arxiv.org/abs/2501.07888), as Tarsier2-Recap-7b is more fully trained (2 epochs vs 1 epoch)._

### Video Question-Answering
We evalute Tarsier2-Recap-7b on [TVBench](https://paperswithcode.com/sota/video-question-answering-on-tvbench), a novel multiple-choice question-answering which requires a high level of temporal understanding. As Tarsier2-Recap-7b is only trained with video caption data, it needs some additional prompt to enduce it to conduct multi-choice question-answering tasks, see [TVBench](https://github.com/bytedance/tarsier/blob/tarsier2/data/annotations/TVBench.jsonl) samples as an example. Here is the evaluation result:

  | Task    | Tarsier2-Recap-7b | Tarsier2-7b |
  | ------- | :--------: | :-------: |
  |    Action Antonym   |   91.2   |   94.1   |
  |    Action Count   |   43.1   |   40.5   |
  |    Action Localization   |   42.5   |   37.5   |
  |    Action Sequence   |   70.5   |   72.3   |
  |    Egocentric Sequence   |   22.0   |   24.5   |
  |    Moving Direction   |   37.1   |   33.2   |
  |    Object Count   |   46.6   |   62.8   |
  |    Object Shuffle   |   36.9  |   31.6   |
  |    Scene Transition   |   85.9   |   88.1   |
  |    Unexpected Action  |   28.0   |   41.5   |
  | OVERALL |   54.0   |   54.7   |


## How to Use
see https://github.com/bytedance/tarsier/tree/tarsier2?tab=readme-ov-file#usage (The tarsier2 branch!!!)

**Where to send questions or comments about the model:**
https://github.com/bytedance/tarsier/issues