Text Generation
Transformers
Safetensors
English
llava
multimodal
conversational
Eval Results
ZhangYuanhan commited on
Commit
25fd6f8
·
verified ·
1 Parent(s): c9d20cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +233 -3
README.md CHANGED
@@ -1,3 +1,233 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - lmms-lab/LLaVA-NeXT-Video-SFT-Data
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - accuracy
10
+ tags:
11
+ - multimodal
12
+ model-index:
13
+ - name: LLaVA-NeXT-Video-72B-Qwen2
14
+ results:
15
+ - task:
16
+ type: multimodal
17
+ dataset:
18
+ name: ActNet-QA
19
+ type: actnet-qa
20
+ metrics:
21
+ - type: accuracy
22
+ value: 63.4
23
+ name: accuracy
24
+ verified: true
25
+ - task:
26
+ type: multimodal
27
+ dataset:
28
+ name: EgoSchema
29
+ type: egoschema
30
+ metrics:
31
+ - type: accuracy
32
+ value: 65.6
33
+ name: accuracy
34
+ verified: true
35
+ - task:
36
+ type: multimodal
37
+ dataset:
38
+ name: MLVU
39
+ type: mlvu
40
+ metrics:
41
+ - type: accuracy
42
+ value: 74.4
43
+ name: accuracy
44
+ verified: true
45
+ - task:
46
+ type: multimodal
47
+ dataset:
48
+ name: MVBench
49
+ type: mvbench
50
+ metrics:
51
+ - type: accuracy
52
+ value: 64.1
53
+ name: accuracy
54
+ verified: true
55
+ - task:
56
+ type: multimodal
57
+ dataset:
58
+ name: NextQA
59
+ type: nextqa
60
+ metrics:
61
+ - type: accuracy
62
+ value: 85.4
63
+ name: accuracy
64
+ verified: true
65
+ - task:
66
+ type: multimodal
67
+ dataset:
68
+ name: PercepTest
69
+ type: percepTest
70
+ metrics:
71
+ - type: accuracy
72
+ value: 74.3
73
+ name: accuracy
74
+ verified: true
75
+ - task:
76
+ type: multimodal
77
+ dataset:
78
+ name: VideoChatGPT
79
+ type: videochatgpt
80
+ metrics:
81
+ - type: score
82
+ value: 3.62
83
+ name: score
84
+ verified: true
85
+ - task:
86
+ type: multimodal
87
+ dataset:
88
+ name: VideoDC
89
+ type: videodc
90
+ metrics:
91
+ - type: score
92
+ value: 3.73
93
+ name: score
94
+ verified: true
95
+ - task:
96
+ type: multimodal
97
+ dataset:
98
+ name: LongVideoBench
99
+ type: longvideobench
100
+ metrics:
101
+ - type: accuracy
102
+ value: 61.9
103
+ name: accuracy
104
+ verified: true
105
+ - task:
106
+ type: multimodal
107
+ dataset:
108
+ name: VideoMME
109
+ type: videomme
110
+ metrics:
111
+ - type: accuracy
112
+ value: 58.2
113
+ name: accuracy
114
+ verified: true
115
+ ---
116
+
117
+
118
+ # LLaVA-NeXT-Video
119
+
120
+ ## Table of Contents
121
+
122
+ 1. [Model Summary](##model-summary)
123
+ 2. [Use](##use)
124
+ 3. [Limitations](##limitations)
125
+ 4. [Training](##training)
126
+ 5. [License](##license)
127
+ 6. [Citation](##citation)
128
+
129
+ ## Model Summary
130
+
131
+ The LLaVA-OneVision models are 7/72B parameter models trained on [LLaVA-NeXT-Video-SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data), based on Qwen2 language model with a context window of 32K tokens.
132
+
133
+ - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
134
+ - **Paper:** [LLaVA-OneVision](arxiv.org/abs/2408.03326)
135
+ - **Point of Contact:** [Yuanhan Zhang](mailto:[email protected])
136
+ - **Languages:** English, Chinese
137
+
138
+
139
+ ## Use
140
+
141
+ ### Intended use
142
+
143
+ The model was trained on [LLaVA-NeXT-Video-SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data) and have the ability to interact with images, multi-image and videos, but specific to videos.
144
+
145
+ **Feel free to share your generations in the Community tab!**
146
+
147
+ ### Generation
148
+
149
+ We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
150
+
151
+ ```python
152
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
153
+ from llava.model.builder import load_pretrained_model
154
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
155
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
156
+ from llava.conversation import conv_templates, SeparatorStyle
157
+ from PIL import Image
158
+ import requests
159
+ import copy
160
+ import torch
161
+ import sys
162
+ import warnings
163
+ from decord import VideoReader, cpu
164
+ import numpy as np
165
+
166
+ warnings.filterwarnings("ignore")
167
+
168
+ def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
169
+ if max_frames_num == 0:
170
+ return np.zeros((1, 336, 336, 3))
171
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
172
+ total_frame_num = len(vr)
173
+ video_time = total_frame_num / vr.get_avg_fps()
174
+ fps = round(vr.get_avg_fps()/fps)
175
+ frame_idx = [i for i in range(0, len(vr), fps)]
176
+ frame_time = [i/fps for i in frame_idx]
177
+ if len(frame_idx) > max_frames_num or force_sample:
178
+ sample_fps = max_frames_num
179
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
180
+ frame_idx = uniform_sampled_frames.tolist()
181
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
182
+ frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
183
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
184
+ # import pdb;pdb.set_trace()
185
+
186
+ return spare_frames,frame_time,video_time
187
+
188
+ pretrained = "lmms-lab/LLaVA-NeXT-Video-72B-Qwen2"
189
+ model_name = "llava_qwen"
190
+ device = "cuda"
191
+ device_map = "auto"
192
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
193
+ model.eval()
194
+ video_path = "XXXX"
195
+ max_frames_num = "64"
196
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
197
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
198
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
199
+ question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
200
+ conv = copy.deepcopy(conv_templates[conv_template])
201
+ conv.append_message(conv.roles[0], question)
202
+ conv.append_message(conv.roles[1], None)
203
+ prompt_question = conv.get_prompt()
204
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
205
+ cont = model.generate(
206
+ input_ids,
207
+ images=video,
208
+ modalities="video"
209
+ do_sample=False,
210
+ temperature=0,
211
+ max_new_tokens=4096,
212
+ )
213
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
214
+ print(text_outputs)
215
+ ```
216
+
217
+
218
+ # Training
219
+
220
+ ## Model
221
+
222
+ - **Architecture:** SO400M + Qwen2
223
+ - **Initialized Model:** lmms-lab/llava-onevision-qwen2-72b-si
224
+ - **Data:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
225
+ - **Precision:** bfloat16
226
+
227
+ ## Hardware & Software
228
+
229
+ - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
230
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
231
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
232
+
233
+ # Citation