shubhrapandit commited on
Commit
8f24e90
·
verified ·
1 Parent(s): d279068

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md CHANGED
@@ -114,17 +114,124 @@ print("==========================================")
114
 
115
  ## Evaluation
116
 
 
117
 
118
  <details>
119
  <summary>Evaluation Commands</summary>
 
 
 
 
 
 
 
120
 
121
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  ```
123
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  </details>
125
 
 
126
  ### Accuracy
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  ## Inference Performance
129
 
130
 
 
114
 
115
  ## Evaluation
116
 
117
+ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:
118
 
119
  <details>
120
  <summary>Evaluation Commands</summary>
121
+
122
+ ### Vision Tasks
123
+ - vqav2
124
+ - docvqa
125
+ - mathvista
126
+ - mmmu
127
+ - chartqa
128
 
129
  ```
130
+ vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
131
+
132
+ python -m eval.run eval_vllm \
133
+ --model_name neuralmagic/pixtral-12b-quantized.w8a8 \
134
+ --url http://0.0.0.0:8000 \
135
+ --output_dir ~/tmp \
136
+ --eval_name <vision_task_name>
137
+ ```
138
+
139
+ ### Text-based Tasks
140
+ #### MMLU
141
+
142
+ ```
143
+ lm_eval \
144
+ --model vllm \
145
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
146
+ --tasks mmlu \
147
+ --num_fewshot 5 \
148
+ --batch_size auto \
149
+ --output_path output_dir
150
+
151
  ```
152
 
153
+ #### MGSM
154
+
155
+ ```
156
+ lm_eval \
157
+ --model vllm \
158
+ --model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \
159
+ --tasks mgsm_cot_native \
160
+ --num_fewshot 0 \
161
+ --batch_size auto \
162
+ --output_path output_dir
163
+
164
+ ```
165
  </details>
166
 
167
+
168
  ### Accuracy
169
 
170
+ <table>
171
+ <thead>
172
+ <tr>
173
+ <th>Category</th>
174
+ <th>Metric</th>
175
+ <th>Qwen/Qwen2-VL-72B-Instruct</th>
176
+ <th>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</th>
177
+ <th>Recovery (%)</th>
178
+ </tr>
179
+ </thead>
180
+ <tbody>
181
+ <tr>
182
+ <td rowspan="6"><b>Vision</b></td>
183
+ <td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
184
+ <td>62.11</td>
185
+ <td>60.67</td>
186
+ <td>97.68%</td>
187
+ </tr>
188
+ <tr>
189
+ <td>VQAv2 (val)<br><i>vqa_match</i></td>
190
+ <td>82.51</td>
191
+ <td>82.44</td>
192
+ <td>99.91%</td>
193
+ </tr>
194
+ <tr>
195
+ <td>DocVQA (val)<br><i>anls</i></td>
196
+ <td>95.01</td>
197
+ <td>95.10</td>
198
+ <td>100.09%</td>
199
+ </tr>
200
+ <tr>
201
+ <td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
202
+ <td>83.40</td>
203
+ <td>83.68</td>
204
+ <td>100.34%</td>
205
+ </tr>
206
+ <tr>
207
+ <td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
208
+ <td>66.57</td>
209
+ <td>67.07</td>
210
+ <td>100.75%</td>
211
+ </tr>
212
+ <tr>
213
+ <td><b>Average Score</b></td>
214
+ <td><b>77.12</b></td>
215
+ <td><b>77.39</b></td>
216
+ <td><b>100.35%</b></td>
217
+ </tr>
218
+ <tr>
219
+ <td rowspan="2"><b>Text</b></td>
220
+ <td>MGSM (CoT)</td>
221
+ <td>68.60</td>
222
+ <td>67.78</td>
223
+ <td>98.80%</td>
224
+ </tr>
225
+ <tr>
226
+ <td>MMLU (5-shot)</td>
227
+ <td>82.70</td>
228
+ <td>82.60</td>
229
+ <td>99.88%</td>
230
+ </tr>
231
+ </tbody>
232
+ </table>
233
+
234
+
235
  ## Inference Performance
236
 
237