shubhrapandit commited on
Commit
7dca866
·
verified ·
1 Parent(s): 82819f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -2
README.md CHANGED
@@ -118,17 +118,133 @@ oneshot(
118
 
119
  ## Evaluation
120
 
121
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
122
 
123
  <details>
124
  <summary>Evaluation Commands</summary>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
 
 
 
126
  ```
 
 
 
 
 
 
 
 
127
  ```
128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  </details>
130
 
131
- ### Accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
  ## Inference Performance
134
 
 
118
 
119
  ## Evaluation
120
 
121
+ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:
122
 
123
  <details>
124
  <summary>Evaluation Commands</summary>
125
+
126
+ ### Vision Tasks
127
+ - vqav2
128
+ - docvqa
129
+ - mathvista
130
+ - mmmu
131
+ - chartqa
132
+
133
+ ```
134
+ vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
135
+
136
+ python -m eval.run eval_vllm \
137
+ --model_name neuralmagic/pixtral-12b-quantized.w4a16 \
138
+ --url http://0.0.0.0:8000 \
139
+ --output_dir ~/tmp
140
+ --eval_name <vision_task_name>
141
+ ```
142
 
143
+ ### Text-based Tasks
144
+ #### MMLU
145
+
146
  ```
147
+ lm_eval \
148
+ --model vllm \
149
+ --model_args pretrained="neuralmagic/pixtral-12b-quantized.w4a16 ",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
150
+ --tasks mmlu \
151
+ --num_fewshot 5
152
+ --batch_size auto \
153
+ --output_path output_dir \
154
+
155
  ```
156
 
157
+ #### HumanEval
158
+
159
+ ##### Generation
160
+ ```
161
+ python3 codegen/generate.py \
162
+ --model neuralmagic/pixtral-12b-quantized.w4a16 \
163
+ --bs 16 \
164
+ --temperature 0.2 \
165
+ --n_samples 50 \
166
+ --root "." \
167
+ --dataset humaneval
168
+ ```
169
+ ##### Sanitization
170
+ ```
171
+ python3 evalplus/sanitize.py \
172
+ humaneval/neuralmagic/pixtral-12b-quantized.w4a16_vllm_temp_0.2
173
+ ```
174
+ ##### Evaluation
175
+ ```
176
+ evalplus.evaluate \
177
+ --dataset humaneval \
178
+ --samples humaneval/neuralmagic/pixtral-12b-quantized.w4a16_vllm_temp_0.2-sanitized
179
+ ```
180
  </details>
181
 
182
+ ## Accuracy
183
+
184
+ <table border="1">
185
+ <thead>
186
+ <tr>
187
+ <th>Category</th>
188
+ <th>Metric</th>
189
+ <th>mgoin/pixtral-12b</th>
190
+ <th>neuralmagic/pixtral-12b-FP8-Dynamic</th>
191
+ <th>Recovery (%)</th>
192
+ </tr>
193
+ </thead>
194
+ <tbody>
195
+ <tr>
196
+ <td rowspan="6"><b>Vision</b></td>
197
+ <td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
198
+ <td>48.00</td>
199
+ <td>50.11</td>
200
+ <td>104.40%</td>
201
+ </tr>
202
+ <tr>
203
+ <td>VQAv2 (val)<br><i>vqa_match</i></td>
204
+ <td>78.71</td>
205
+ <td>78.44</td>
206
+ <td>99.66%</td>
207
+ </tr>
208
+ <tr>
209
+ <td>DocVQA (val)<br><i>anls</i></td>
210
+ <td>89.47</td>
211
+ <td>89.20</td>
212
+ <td>99.70%</td>
213
+ </tr>
214
+ <tr>
215
+ <td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
216
+ <td>81.68</td>
217
+ <td>81.76</td>
218
+ <td>100.10%</td>
219
+ </tr>
220
+ <tr>
221
+ <td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
222
+ <td>56.50</td>
223
+ <td>58.70</td>
224
+ <td>103.89%</td>
225
+ </tr>
226
+ <tr>
227
+ <td><b>Average Score</b></td>
228
+ <td><b>70.07</b></td>
229
+ <td><b>71.24</b></td>
230
+ <td><b>101.67%</b></td>
231
+ </tr>
232
+ <tr>
233
+ <td rowspan="2"><b>Text</b></td>
234
+ <td>HumanEval <br><i>pass@1</i></td>
235
+ <td>68.40</td>
236
+ <td>69.50</td>
237
+ <td>101.61%</td>
238
+ </tr>
239
+ <tr>
240
+ <td>MMLU (5-shot)</td>
241
+ <td>71.40</td>
242
+ <td>69.50</td>
243
+ <td>97.34%</td>
244
+ </tr>
245
+ </tbody>
246
+ </table>
247
+
248
 
249
  ## Inference Performance
250