shubhrapandit commited on
Commit
ac8d74d
·
verified ·
1 Parent(s): 66d9c70

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -1
README.md CHANGED
@@ -119,18 +119,124 @@ oneshot(
119
 
120
  ## Evaluation
121
 
122
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
123
 
124
  <details>
125
  <summary>Evaluation Commands</summary>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ```
 
 
 
128
  ```
 
 
 
 
 
 
 
129
 
 
130
  </details>
131
 
 
132
  ### Accuracy
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  ## Inference Performance
135
 
136
 
 
119
 
120
  ## Evaluation
121
 
122
+ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:
123
 
124
  <details>
125
  <summary>Evaluation Commands</summary>
126
+
127
+ ### Vision Tasks
128
+ - vqav2
129
+ - docvqa
130
+ - mathvista
131
+ - mmmu
132
+ - chartqa
133
+
134
+ ```
135
+ vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
136
+
137
+ python -m eval.run eval_vllm \
138
+ --model_name neuralmagic/pixtral-12b-quantized.w8a8 \
139
+ --url http://0.0.0.0:8000 \
140
+ --output_dir ~/tmp \
141
+ --eval_name <vision_task_name>
142
+ ```
143
+
144
+ ### Text-based Tasks
145
+ #### MMLU
146
+
147
+ ```
148
+ lm_eval \
149
+ --model vllm \
150
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
151
+ --tasks mmlu \
152
+ --num_fewshot 5 \
153
+ --batch_size auto \
154
+ --output_path output_dir
155
 
156
  ```
157
+
158
+ #### MGSM
159
+
160
  ```
161
+ lm_eval \
162
+ --model vllm \
163
+ --model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \
164
+ --tasks mgsm_cot_native \
165
+ --num_fewshot 0 \
166
+ --batch_size auto \
167
+ --output_path output_dir
168
 
169
+ ```
170
  </details>
171
 
172
+
173
  ### Accuracy
174
 
175
+ <table>
176
+ <thead>
177
+ <tr>
178
+ <th>Category</th>
179
+ <th>Metric</th>
180
+ <th>Qwen/Qwen2.5-VL-7B-Instruct</th>
181
+ <th>neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic</th>
182
+ <th>Recovery (%)</th>
183
+ </tr>
184
+ </thead>
185
+ <tbody>
186
+ <tr>
187
+ <td rowspan="6"><b>Vision</b></td>
188
+ <td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
189
+ <td>52.00</td>
190
+ <td>52.55</td>
191
+ <td>101.06%</td>
192
+ </tr>
193
+ <tr>
194
+ <td>VQAv2 (val)<br><i>vqa_match</i></td>
195
+ <td>75.59</td>
196
+ <td>75.79</td>
197
+ <td>100.26%</td>
198
+ </tr>
199
+ <tr>
200
+ <td>DocVQA (val)<br><i>anls</i></td>
201
+ <td>94.27</td>
202
+ <td>94.27</td>
203
+ <td>100.00%</td>
204
+ </tr>
205
+ <tr>
206
+ <td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
207
+ <td>86.44</td>
208
+ <td>86.80</td>
209
+ <td>100.42%</td>
210
+ </tr>
211
+ <tr>
212
+ <td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
213
+ <td>69.47</td>
214
+ <td>71.07</td>
215
+ <td>102.31%</td>
216
+ </tr>
217
+ <tr>
218
+ <td><b>Average Score</b></td>
219
+ <td><b>75.95</b></td>
220
+ <td><b>76.50</b></td>
221
+ <td><b>100.73%</b></td>
222
+ </tr>
223
+ <tr>
224
+ <td rowspan="2"><b>Text</b></td>
225
+ <td>MGSM (CoT)</td>
226
+ <td>58.72</td>
227
+ <td>55.34</td>
228
+ <td>94.24%</td>
229
+ </tr>
230
+ <tr>
231
+ <td>MMLU (5-shot)</td>
232
+ <td>71.09</td>
233
+ <td>70.98</td>
234
+ <td>99.85%</td>
235
+ </tr>
236
+ </tbody>
237
+ </table>
238
+
239
+
240
  ## Inference Performance
241
 
242