shubhrapandit commited on
Commit
a0de4a5
·
verified ·
1 Parent(s): 4d00594

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -1
README.md CHANGED
@@ -183,17 +183,121 @@ oneshot(
183
 
184
  ## Evaluation
185
 
186
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
187
 
188
  <details>
189
  <summary>Evaluation Commands</summary>
 
 
 
 
 
 
 
 
 
 
190
 
 
 
 
 
 
191
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  ```
 
 
 
 
 
 
 
193
 
 
194
  </details>
195
 
196
  ### Accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
  ## Inference Performance
199
 
 
183
 
184
  ## Evaluation
185
 
186
+ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:
187
 
188
  <details>
189
  <summary>Evaluation Commands</summary>
190
+
191
+ ### Vision Tasks
192
+ - vqav2
193
+ - docvqa
194
+ - mathvista
195
+ - mmmu
196
+ - chartqa
197
+
198
+ ```
199
+ vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
200
 
201
+ python -m eval.run eval_vllm \
202
+ --model_name neuralmagic/pixtral-12b-quantized.w8a8 \
203
+ --url http://0.0.0.0:8000 \
204
+ --output_dir ~/tmp \
205
+ --eval_name <vision_task_name>
206
  ```
207
+
208
+ ### Text-based Tasks
209
+ #### MMLU
210
+
211
+ ```
212
+ lm_eval \
213
+ --model vllm \
214
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
215
+ --tasks mmlu \
216
+ --num_fewshot 5 \
217
+ --batch_size auto \
218
+ --output_path output_dir
219
+
220
+ ```
221
+
222
+ #### MGSM
223
+
224
  ```
225
+ lm_eval \
226
+ --model vllm \
227
+ --model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \
228
+ --tasks mgsm_cot_native \
229
+ --num_fewshot 0 \
230
+ --batch_size auto \
231
+ --output_path output_dir
232
 
233
+ ```
234
  </details>
235
 
236
  ### Accuracy
237
+ <table>
238
+ <thead>
239
+ <tr>
240
+ <th>Category</th>
241
+ <th>Metric</th>
242
+ <th>Qwen/Qwen2.5-VL-3B-Instruct</th>
243
+ <th>neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16</th>
244
+ <th>Recovery (%)</th>
245
+ </tr>
246
+ </thead>
247
+ <tbody>
248
+ <tr>
249
+ <td rowspan="6"><b>Vision</b></td>
250
+ <td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
251
+ <td>44.56</td>
252
+ <td>45.67</td>
253
+ <td>102.49%</td>
254
+ </tr>
255
+ <tr>
256
+ <td>VQAv2 (val)<br><i>vqa_match</i></td>
257
+ <td>75.94</td>
258
+ <td>75.55</td>
259
+ <td>99.49%</td>
260
+ </tr>
261
+ <tr>
262
+ <td>DocVQA (val)<br><i>anls</i></td>
263
+ <td>92.53</td>
264
+ <td>92.32</td>
265
+ <td>99.77%</td>
266
+ </tr>
267
+ <tr>
268
+ <td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
269
+ <td>81.20</td>
270
+ <td>78.80</td>
271
+ <td>97.04%</td>
272
+ </tr>
273
+ <tr>
274
+ <td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
275
+ <td>54.15</td>
276
+ <td>53.85</td>
277
+ <td>99.45%</td>
278
+ </tr>
279
+ <tr>
280
+ <td><b>Average Score</b></td>
281
+ <td><b>69.28</b></td>
282
+ <td><b>69.24</b></td>
283
+ <td><b>99.94%</b></td>
284
+ </tr>
285
+ <tr>
286
+ <td rowspan="2"><b>Text</b></td>
287
+ <td>MGSM (CoT)</td>
288
+ <td>52.49</td>
289
+ <td>50.42</td>
290
+ <td>96.05%</td>
291
+ </tr>
292
+ <tr>
293
+ <td>MMLU (5-shot)</td>
294
+ <td>65.32</td>
295
+ <td>64.83</td>
296
+ <td>99.25%</td>
297
+ </tr>
298
+ </tbody>
299
+ </table>
300
+
301
 
302
  ## Inference Performance
303