wjpoom commited on
Commit
122b84c
·
verified ·
1 Parent(s): 0d1e383

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -169
README.md CHANGED
@@ -1,167 +1,166 @@
1
- ---
2
- license: llama2
3
- datasets:
4
- - Inst-IT/Inst-IT-Dataset
5
- - lmms-lab/LLaVA-NeXT-Data
6
- language:
7
- - en
8
- metrics:
9
- - accuracy
10
- base_model:
11
- - liuhaotian/llava-v1.6-vicuna-7b
12
- pipeline_tag: video-text-to-text
13
- tags:
14
- - multimodal
15
- - fine-grained
16
- - instance-understanding
17
- model-index:
18
- - name: LLaVA-Next-Inst-It-Vicuna-7B
19
- results:
20
- - task:
21
- type: multimodal
22
- dataset:
23
- name: Inst-IT-Bench-I-OE
24
- type: Open-Ended
25
- metrics:
26
- - type: accuracy
27
- value: 68.6
28
- name: accuracy
29
- verified: true
30
- - task:
31
- type: multimodal
32
- dataset:
33
- name: Inst-IT-Bench-I-MC
34
- type: Multi-Choice
35
- metrics:
36
- - type: accuracy
37
- value: 63.0
38
- name: accuracy
39
- verified: true
40
- - task:
41
- type: multimodal
42
- dataset:
43
- name: AI2D
44
- type: ai2d
45
- metrics:
46
- - type: accuracy
47
- value: 71.0
48
- name: accuracy
49
- verified: true
50
- - task:
51
- type: multimodal
52
- dataset:
53
- name: MMMU
54
- type: mmmu
55
- metrics:
56
- - type: accuracy
57
- value: 37.4
58
- name: accuracy
59
- verified: true
60
- - task:
61
- type: multimodal
62
- dataset:
63
- name: POPE
64
- type: pope
65
- metrics:
66
- - type: accuracy
67
- value: 87.2
68
- name: accuracy
69
- verified: true
70
- - task:
71
- type: multimodal
72
- dataset:
73
- name: GQA
74
- type: gqa
75
- metrics:
76
- - type: accuracy
77
- value: 65.9
78
- name: accuracy
79
- verified: true
80
- - task:
81
- type: multimodal
82
- dataset:
83
- name: MM-Vet
84
- type: mm-vet
85
- metrics:
86
- - type: accuracy
87
- value: 38.1
88
- name: accuracy
89
- verified: true
90
- - task:
91
- type: multimodal
92
- dataset:
93
- name: Inst-IT-Bench-V-OE
94
- type: Open-Ended
95
- metrics:
96
- - type: accuracy
97
- value: 49.3
98
- name: accuracy
99
- verified: true
100
- - task:
101
- type: multimodal
102
- dataset:
103
- name: Inst-IT-Bench-V-MC
104
- type: Multi-Choice
105
- metrics:
106
- - type: accuracy
107
- value: 42.1
108
- name: accuracy
109
- verified: true
110
- - task:
111
- type: multimodal
112
- dataset:
113
- name: ActNet-QA
114
- type: actnet-qa
115
- metrics:
116
- - type: accuracy
117
- value: 53.7
118
- name: accuracy
119
- verified: true
120
- - task:
121
- type: multimodal
122
- dataset:
123
- name: EgoSchema
124
- type: egoschema
125
- metrics:
126
- - type: accuracy
127
- value: 57.8
128
- name: accuracy
129
- verified: true
130
- - task:
131
- type: multimodal
132
- dataset:
133
- name: NextQA
134
- type: nextqa
135
- metrics:
136
- - type: accuracy
137
- value: 70.2
138
- name: accuracy
139
- verified: true
140
- - task:
141
- type: multimodal
142
- dataset:
143
- name: VideoMME
144
- type: videomme
145
- metrics:
146
- - type: accuracy
147
- value: 44.3
148
- name: accuracy
149
- verified: true
150
- - task:
151
- type: multimodal
152
- dataset:
153
- name: TempoCompass
154
- type: tempocompass
155
- metrics:
156
- - type: accuracy
157
- value: 59.8
158
- name: accuracy
159
- verified: true
160
-
161
- ---
162
 
163
  # LLaVA-Next-Inst-It-Vicuna-7B
164
- [**🌐 Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**🤗 Paper**](https://huggingface.co/papers/2412.03565) | [**📖 arXiv**](https://arxiv.org/abs/2412.03565)
165
 
166
  LLaVA-Next-Inst-It-Vicuna-7B is a multimodal model that excels at instance-level understanding,
167
  which is introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
@@ -217,11 +216,10 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
217
  ```
218
  **Image Inference**
219
 
 
220
  <details>
221
  <summary>Inference without SoMs</summary>
222
 
223
- Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
224
-
225
  ```python
226
  import torch
227
  import requests
@@ -267,14 +265,12 @@ print(pred)
267
  ```
268
  </details>
269
 
270
-
271
- <details>
272
- <summary>Inference with SoMs</summary>
273
-
274
  Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
275
  Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
276
  You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
277
-
 
 
278
  ```python
279
  import torch
280
  import requests
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Inst-IT/Inst-IT-Dataset
5
+ - lmms-lab/LLaVA-NeXT-Data
6
+ language:
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ base_model:
11
+ - liuhaotian/llava-v1.6-vicuna-7b
12
+ pipeline_tag: video-text-to-text
13
+ tags:
14
+ - multimodal
15
+ - fine-grained
16
+ - instance-understanding
17
+ model-index:
18
+ - name: LLaVA-Next-Inst-It-Vicuna-7B
19
+ results:
20
+ - task:
21
+ type: multimodal
22
+ dataset:
23
+ name: Inst-IT-Bench-I-OE
24
+ type: Open-Ended
25
+ metrics:
26
+ - type: accuracy
27
+ value: 68.6
28
+ name: accuracy
29
+ verified: true
30
+ - task:
31
+ type: multimodal
32
+ dataset:
33
+ name: Inst-IT-Bench-I-MC
34
+ type: Multi-Choice
35
+ metrics:
36
+ - type: accuracy
37
+ value: 63
38
+ name: accuracy
39
+ verified: true
40
+ - task:
41
+ type: multimodal
42
+ dataset:
43
+ name: AI2D
44
+ type: ai2d
45
+ metrics:
46
+ - type: accuracy
47
+ value: 71
48
+ name: accuracy
49
+ verified: true
50
+ - task:
51
+ type: multimodal
52
+ dataset:
53
+ name: MMMU
54
+ type: mmmu
55
+ metrics:
56
+ - type: accuracy
57
+ value: 37.4
58
+ name: accuracy
59
+ verified: true
60
+ - task:
61
+ type: multimodal
62
+ dataset:
63
+ name: POPE
64
+ type: pope
65
+ metrics:
66
+ - type: accuracy
67
+ value: 87.2
68
+ name: accuracy
69
+ verified: true
70
+ - task:
71
+ type: multimodal
72
+ dataset:
73
+ name: GQA
74
+ type: gqa
75
+ metrics:
76
+ - type: accuracy
77
+ value: 65.9
78
+ name: accuracy
79
+ verified: true
80
+ - task:
81
+ type: multimodal
82
+ dataset:
83
+ name: MM-Vet
84
+ type: mm-vet
85
+ metrics:
86
+ - type: accuracy
87
+ value: 38.1
88
+ name: accuracy
89
+ verified: true
90
+ - task:
91
+ type: multimodal
92
+ dataset:
93
+ name: Inst-IT-Bench-V-OE
94
+ type: Open-Ended
95
+ metrics:
96
+ - type: accuracy
97
+ value: 49.3
98
+ name: accuracy
99
+ verified: true
100
+ - task:
101
+ type: multimodal
102
+ dataset:
103
+ name: Inst-IT-Bench-V-MC
104
+ type: Multi-Choice
105
+ metrics:
106
+ - type: accuracy
107
+ value: 42.1
108
+ name: accuracy
109
+ verified: true
110
+ - task:
111
+ type: multimodal
112
+ dataset:
113
+ name: ActNet-QA
114
+ type: actnet-qa
115
+ metrics:
116
+ - type: accuracy
117
+ value: 53.7
118
+ name: accuracy
119
+ verified: true
120
+ - task:
121
+ type: multimodal
122
+ dataset:
123
+ name: EgoSchema
124
+ type: egoschema
125
+ metrics:
126
+ - type: accuracy
127
+ value: 57.8
128
+ name: accuracy
129
+ verified: true
130
+ - task:
131
+ type: multimodal
132
+ dataset:
133
+ name: NextQA
134
+ type: nextqa
135
+ metrics:
136
+ - type: accuracy
137
+ value: 70.2
138
+ name: accuracy
139
+ verified: true
140
+ - task:
141
+ type: multimodal
142
+ dataset:
143
+ name: VideoMME
144
+ type: videomme
145
+ metrics:
146
+ - type: accuracy
147
+ value: 44.3
148
+ name: accuracy
149
+ verified: true
150
+ - task:
151
+ type: multimodal
152
+ dataset:
153
+ name: TempoCompass
154
+ type: tempocompass
155
+ metrics:
156
+ - type: accuracy
157
+ value: 59.8
158
+ name: accuracy
159
+ verified: true
160
+ ---
 
161
 
162
  # LLaVA-Next-Inst-It-Vicuna-7B
163
+ [**Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**Paper**](https://huggingface.co/papers/2412.03565) | [**arXiv**](https://arxiv.org/abs/2412.03565)
164
 
165
  LLaVA-Next-Inst-It-Vicuna-7B is a multimodal model that excels at instance-level understanding,
166
  which is introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
 
216
  ```
217
  **Image Inference**
218
 
219
+ Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
220
  <details>
221
  <summary>Inference without SoMs</summary>
222
 
 
 
223
  ```python
224
  import torch
225
  import requests
 
265
  ```
266
  </details>
267
 
 
 
 
 
268
  Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
269
  Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
270
  You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
271
+ <details>
272
+ <summary>Inference with SoMs</summary>
273
+
274
  ```python
275
  import torch
276
  import requests