wjpoom commited on
Commit
9ba7a33
·
verified ·
1 Parent(s): 122b84c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -165
README.md CHANGED
@@ -1,163 +1,163 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - Inst-IT/Inst-IT-Dataset
5
- - lmms-lab/LLaVA-NeXT-Data
6
- language:
7
- - en
8
- metrics:
9
- - accuracy
10
- base_model:
11
- - liuhaotian/llava-v1.6-vicuna-7b
12
- pipeline_tag: video-text-to-text
13
- tags:
14
- - multimodal
15
- - fine-grained
16
- - instance-understanding
17
- model-index:
18
- - name: LLaVA-Next-Inst-It-Vicuna-7B
19
- results:
20
- - task:
21
- type: multimodal
22
- dataset:
23
- name: Inst-IT-Bench-I-OE
24
- type: Open-Ended
25
- metrics:
26
- - type: accuracy
27
- value: 68.6
28
- name: accuracy
29
- verified: true
30
- - task:
31
- type: multimodal
32
- dataset:
33
- name: Inst-IT-Bench-I-MC
34
- type: Multi-Choice
35
- metrics:
36
- - type: accuracy
37
- value: 63
38
- name: accuracy
39
- verified: true
40
- - task:
41
- type: multimodal
42
- dataset:
43
- name: AI2D
44
- type: ai2d
45
- metrics:
46
- - type: accuracy
47
- value: 71
48
- name: accuracy
49
- verified: true
50
- - task:
51
- type: multimodal
52
- dataset:
53
- name: MMMU
54
- type: mmmu
55
- metrics:
56
- - type: accuracy
57
- value: 37.4
58
- name: accuracy
59
- verified: true
60
- - task:
61
- type: multimodal
62
- dataset:
63
- name: POPE
64
- type: pope
65
- metrics:
66
- - type: accuracy
67
- value: 87.2
68
- name: accuracy
69
- verified: true
70
- - task:
71
- type: multimodal
72
- dataset:
73
- name: GQA
74
- type: gqa
75
- metrics:
76
- - type: accuracy
77
- value: 65.9
78
- name: accuracy
79
- verified: true
80
- - task:
81
- type: multimodal
82
- dataset:
83
- name: MM-Vet
84
- type: mm-vet
85
- metrics:
86
- - type: accuracy
87
- value: 38.1
88
- name: accuracy
89
- verified: true
90
- - task:
91
- type: multimodal
92
- dataset:
93
- name: Inst-IT-Bench-V-OE
94
- type: Open-Ended
95
- metrics:
96
- - type: accuracy
97
- value: 49.3
98
- name: accuracy
99
- verified: true
100
- - task:
101
- type: multimodal
102
- dataset:
103
- name: Inst-IT-Bench-V-MC
104
- type: Multi-Choice
105
- metrics:
106
- - type: accuracy
107
- value: 42.1
108
- name: accuracy
109
- verified: true
110
- - task:
111
- type: multimodal
112
- dataset:
113
- name: ActNet-QA
114
- type: actnet-qa
115
- metrics:
116
- - type: accuracy
117
- value: 53.7
118
- name: accuracy
119
- verified: true
120
- - task:
121
- type: multimodal
122
- dataset:
123
- name: EgoSchema
124
- type: egoschema
125
- metrics:
126
- - type: accuracy
127
- value: 57.8
128
- name: accuracy
129
- verified: true
130
- - task:
131
- type: multimodal
132
- dataset:
133
- name: NextQA
134
- type: nextqa
135
- metrics:
136
- - type: accuracy
137
- value: 70.2
138
- name: accuracy
139
- verified: true
140
- - task:
141
- type: multimodal
142
- dataset:
143
- name: VideoMME
144
- type: videomme
145
- metrics:
146
- - type: accuracy
147
- value: 44.3
148
- name: accuracy
149
- verified: true
150
- - task:
151
- type: multimodal
152
- dataset:
153
- name: TempoCompass
154
- type: tempocompass
155
- metrics:
156
- - type: accuracy
157
- value: 59.8
158
- name: accuracy
159
- verified: true
160
- ---
161
 
162
  # LLaVA-Next-Inst-It-Vicuna-7B
163
  [**Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**Paper**](https://huggingface.co/papers/2412.03565) | [**arXiv**](https://arxiv.org/abs/2412.03565)
@@ -225,7 +225,7 @@ import torch
225
  import requests
226
  from PIL import Image
227
 
228
- img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
229
  image = Image.open(requests.get(img_url, stream=True).raw)
230
  image_tensor = process_images([image], image_processor, model.config).bfloat16()
231
  image_sizes = [image.size]
@@ -265,9 +265,10 @@ print(pred)
265
  ```
266
  </details>
267
 
268
- Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
 
269
  Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
270
- You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
271
  <details>
272
  <summary>Inference with SoMs</summary>
273
 
@@ -276,12 +277,13 @@ import torch
276
  import requests
277
  from PIL import Image
278
 
279
- img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
280
  image = Image.open(requests.get(img_url, stream=True).raw)
281
  image_tensor = process_images([image], image_processor, model.config).bfloat16()
282
  image_sizes = [image.size]
283
 
284
- question = "Describe this image."
 
285
  question = DEFAULT_IMAGE_TOKEN + "\n" + question
286
 
287
  conv_template = 'vicuna_v1'
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Inst-IT/Inst-IT-Dataset
5
+ - lmms-lab/LLaVA-NeXT-Data
6
+ language:
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ base_model:
11
+ - liuhaotian/llava-v1.6-vicuna-7b
12
+ pipeline_tag: video-text-to-text
13
+ tags:
14
+ - multimodal
15
+ - fine-grained
16
+ - instance-understanding
17
+ model-index:
18
+ - name: LLaVA-Next-Inst-It-Vicuna-7B
19
+ results:
20
+ - task:
21
+ type: multimodal
22
+ dataset:
23
+ name: Inst-IT-Bench-I-OE
24
+ type: Open-Ended
25
+ metrics:
26
+ - type: accuracy
27
+ value: 68.6
28
+ name: accuracy
29
+ verified: true
30
+ - task:
31
+ type: multimodal
32
+ dataset:
33
+ name: Inst-IT-Bench-I-MC
34
+ type: Multi-Choice
35
+ metrics:
36
+ - type: accuracy
37
+ value: 63
38
+ name: accuracy
39
+ verified: true
40
+ - task:
41
+ type: multimodal
42
+ dataset:
43
+ name: AI2D
44
+ type: ai2d
45
+ metrics:
46
+ - type: accuracy
47
+ value: 71
48
+ name: accuracy
49
+ verified: true
50
+ - task:
51
+ type: multimodal
52
+ dataset:
53
+ name: MMMU
54
+ type: mmmu
55
+ metrics:
56
+ - type: accuracy
57
+ value: 37.4
58
+ name: accuracy
59
+ verified: true
60
+ - task:
61
+ type: multimodal
62
+ dataset:
63
+ name: POPE
64
+ type: pope
65
+ metrics:
66
+ - type: accuracy
67
+ value: 87.2
68
+ name: accuracy
69
+ verified: true
70
+ - task:
71
+ type: multimodal
72
+ dataset:
73
+ name: GQA
74
+ type: gqa
75
+ metrics:
76
+ - type: accuracy
77
+ value: 65.9
78
+ name: accuracy
79
+ verified: true
80
+ - task:
81
+ type: multimodal
82
+ dataset:
83
+ name: MM-Vet
84
+ type: mm-vet
85
+ metrics:
86
+ - type: accuracy
87
+ value: 38.1
88
+ name: accuracy
89
+ verified: true
90
+ - task:
91
+ type: multimodal
92
+ dataset:
93
+ name: Inst-IT-Bench-V-OE
94
+ type: Open-Ended
95
+ metrics:
96
+ - type: accuracy
97
+ value: 49.3
98
+ name: accuracy
99
+ verified: true
100
+ - task:
101
+ type: multimodal
102
+ dataset:
103
+ name: Inst-IT-Bench-V-MC
104
+ type: Multi-Choice
105
+ metrics:
106
+ - type: accuracy
107
+ value: 42.1
108
+ name: accuracy
109
+ verified: true
110
+ - task:
111
+ type: multimodal
112
+ dataset:
113
+ name: ActNet-QA
114
+ type: actnet-qa
115
+ metrics:
116
+ - type: accuracy
117
+ value: 53.7
118
+ name: accuracy
119
+ verified: true
120
+ - task:
121
+ type: multimodal
122
+ dataset:
123
+ name: EgoSchema
124
+ type: egoschema
125
+ metrics:
126
+ - type: accuracy
127
+ value: 57.8
128
+ name: accuracy
129
+ verified: true
130
+ - task:
131
+ type: multimodal
132
+ dataset:
133
+ name: NextQA
134
+ type: nextqa
135
+ metrics:
136
+ - type: accuracy
137
+ value: 70.2
138
+ name: accuracy
139
+ verified: true
140
+ - task:
141
+ type: multimodal
142
+ dataset:
143
+ name: VideoMME
144
+ type: videomme
145
+ metrics:
146
+ - type: accuracy
147
+ value: 44.3
148
+ name: accuracy
149
+ verified: true
150
+ - task:
151
+ type: multimodal
152
+ dataset:
153
+ name: TempoCompass
154
+ type: tempocompass
155
+ metrics:
156
+ - type: accuracy
157
+ value: 59.8
158
+ name: accuracy
159
+ verified: true
160
+ ---
161
 
162
  # LLaVA-Next-Inst-It-Vicuna-7B
163
  [**Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**Paper**](https://huggingface.co/papers/2412.03565) | [**arXiv**](https://arxiv.org/abs/2412.03565)
 
225
  import requests
226
  from PIL import Image
227
 
228
+ img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
229
  image = Image.open(requests.get(img_url, stream=True).raw)
230
  image_tensor = process_images([image], image_processor, model.config).bfloat16()
231
  image_sizes = [image.size]
 
265
  ```
266
  </details>
267
 
268
+ Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
269
+ You can refer to the instances that you are interested in using their IDs.
270
  Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
271
+ Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
272
  <details>
273
  <summary>Inference with SoMs</summary>
274
 
 
277
  import requests
278
  from PIL import Image
279
 
280
+ img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
281
  image = Image.open(requests.get(img_url, stream=True).raw)
282
  image_tensor = process_images([image], image_processor, model.config).bfloat16()
283
  image_sizes = [image.size]
284
 
285
+ # You can use [id] to refer to the instances that you are interested in
286
+ question = "Describe [8] in detail."
287
  question = DEFAULT_IMAGE_TOKEN + "\n" + question
288
 
289
  conv_template = 'vicuna_v1'