nm-research commited on
Commit
9662a76
·
verified ·
1 Parent(s): 02e39c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -24
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-8b-base-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -66,6 +66,8 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
 
 
69
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-8b-base --quant_path "output_dir/granite-3.1-8b-base-quantized.w4a16" --calib_size 3072 --dampening_frac 0.1 --observer mse --actorder static
@@ -146,16 +148,20 @@ oneshot(
146
  model.save_pretrained(quant_path, save_compressed=True)
147
  tokenizer.save_pretrained(quant_path)
148
  ```
 
149
 
150
  ## Evaluation
151
 
152
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
 
 
 
153
 
154
  OpenLLM Leaderboard V1:
155
  ```
156
  lm_eval \
157
  --model vllm \
158
- --model_args pretrained="neuralmagic-ent/granite-3.1-8b-base-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
159
  --tasks openllm \
160
  --write_out \
161
  --batch_size auto \
@@ -163,11 +169,23 @@ lm_eval \
163
  --show_config
164
  ```
165
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  #### HumanEval
167
  ##### Generation
168
  ```
169
  python3 codegen/generate.py \
170
- --model neuralmagic-ent/granite-3.1-8b-base-quantized.w4a16 \
171
  --bs 16 \
172
  --temperature 0.2 \
173
  --n_samples 50 \
@@ -177,36 +195,82 @@ python3 codegen/generate.py \
177
  ##### Sanitization
178
  ```
179
  python3 evalplus/sanitize.py \
180
- humaneval/neuralmagic-ent--granite-3.1-8b-base-quantized.w4a16_vllm_temp_0.2
181
  ```
182
  ##### Evaluation
183
  ```
184
  evalplus.evaluate \
185
  --dataset humaneval \
186
- --samples humaneval/neuralmagic-ent--granite-3.1-8b-base-quantized.w4a16_vllm_temp_0.2-sanitized
187
  ```
 
188
 
189
  ### Accuracy
190
 
191
- #### OpenLLM Leaderboard V1 evaluation scores
192
-
193
-
194
- | Metric | ibm-granite/granite-3.1-8b-base | neuralmagic-ent/granite-3.1-8b-base-quantized.w4a16 |
195
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
196
- | ARC-Challenge (Acc-Norm, 25-shot) | 64.68 | 62.37 |
197
- | GSM8K (Strict-Match, 5-shot) | 60.88 | 54.89 |
198
- | HellaSwag (Acc-Norm, 10-shot) | 83.52 | 82.53 |
199
- | MMLU (Acc, 5-shot) | 63.33 | 62.78 |
200
- | TruthfulQA (MC2, 0-shot) | 51.33 | 51.30 |
201
- | Winogrande (Acc, 5-shot) | 80.90 | 79.24 |
202
- | **Average Score** | **67.44** | **65.52** |
203
- | **Recovery** | **100.00** | **97.15** |
204
-
205
- #### HumanEval pass@1 scores
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
- | Metric | ibm-granite/granite-3.1-8b-base | neuralmagic-ent/granite-3.1-8b-base-quantized.w4a16 |
208
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
209
- | HumanEval Pass@1 | 44.10 | 40.70 |
210
 
211
  ---
212
 
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-8b-base-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
+ <details>
70
+ <summary>Model Creation Code</summary>
71
 
72
  ```bash
73
  python quantize.py --model_path ibm-granite/granite-3.1-8b-base --quant_path "output_dir/granite-3.1-8b-base-quantized.w4a16" --calib_size 3072 --dampening_frac 0.1 --observer mse --actorder static
 
148
  model.save_pretrained(quant_path, save_compressed=True)
149
  tokenizer.save_pretrained(quant_path)
150
  ```
151
+ </details>
152
 
153
  ## Evaluation
154
 
155
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
156
+
157
+ <details>
158
+ <summary>Evaluation Commands</summary>
159
 
160
  OpenLLM Leaderboard V1:
161
  ```
162
  lm_eval \
163
  --model vllm \
164
+ --model_args pretrained="neuralmagic/granite-3.1-8b-base-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
165
  --tasks openllm \
166
  --write_out \
167
  --batch_size auto \
 
169
  --show_config
170
  ```
171
 
172
+ OpenLLM Leaderboard V2:
173
+ ```
174
+ lm_eval \
175
+ --model vllm \
176
+ --model_args pretrained="neuralmagic/granite-3.1-8b-base-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
177
+ --tasks leaderboard \
178
+ --write_out \
179
+ --batch_size auto \
180
+ --output_path output_dir \
181
+ --show_config
182
+ ```
183
+
184
  #### HumanEval
185
  ##### Generation
186
  ```
187
  python3 codegen/generate.py \
188
+ --model neuralmagic/granite-3.1-8b-base-quantized.w4a16 \
189
  --bs 16 \
190
  --temperature 0.2 \
191
  --n_samples 50 \
 
195
  ##### Sanitization
196
  ```
197
  python3 evalplus/sanitize.py \
198
+ humaneval/neuralmagic--granite-3.1-8b-base-quantized.w4a16_vllm_temp_0.2
199
  ```
200
  ##### Evaluation
201
  ```
202
  evalplus.evaluate \
203
  --dataset humaneval \
204
+ --samples humaneval/neuralmagic--granite-3.1-8b-base-quantized.w4a16_vllm_temp_0.2-sanitized
205
  ```
206
+ </details>
207
 
208
  ### Accuracy
209
 
210
+ <table>
211
+ <thead>
212
+ <tr>
213
+ <th>Category</th>
214
+ <th>Metric</th>
215
+ <th>ibm-granite/granite-3.1-8b-base</th>
216
+ <th>neuralmagic/granite-3.1-8b-base-quantized.w4a16</th>
217
+ <th>Recovery (%)</th>
218
+ </tr>
219
+ </thead>
220
+ <tbody>
221
+ <tr>
222
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
223
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
224
+ <td>64.68</td>
225
+ <td>62.37</td>
226
+ <td>96.43</td>
227
+ </tr>
228
+ <tr>
229
+ <td>GSM8K (Strict-Match, 5-shot)</td>
230
+ <td>60.88</td>
231
+ <td>54.89</td>
232
+ <td>90.16</td>
233
+ </tr>
234
+ <tr>
235
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
236
+ <td>83.52</td>
237
+ <td>82.53</td>
238
+ <td>98.81</td>
239
+ </tr>
240
+ <tr>
241
+ <td>MMLU (Acc, 5-shot)</td>
242
+ <td>63.33</td>
243
+ <td>62.78</td>
244
+ <td>99.13</td>
245
+ </tr>
246
+ <tr>
247
+ <td>TruthfulQA (MC2, 0-shot)</td>
248
+ <td>51.33</td>
249
+ <td>51.30</td>
250
+ <td>99.94</td>
251
+ </tr>
252
+ <tr>
253
+ <td>Winogrande (Acc, 5-shot)</td>
254
+ <td>80.90</td>
255
+ <td>79.24</td>
256
+ <td>97.95</td>
257
+ </tr>
258
+ <tr>
259
+ <td><b>Average Score</b></td>
260
+ <td><b>67.44</b></td>
261
+ <td><b>65.52</b></td>
262
+ <td><b>97.15</b></td>
263
+ </tr>
264
+ <tr>
265
+ <td rowspan="2"><b>HumanEval</b></td>
266
+ <td>HumanEval Pass@1</td>
267
+ <td>44.10</td>
268
+ <td>40.70</td>
269
+ <td><b>92.28</b></td>
270
+ </tr>
271
+ </tbody>
272
+ </table>
273
 
 
 
 
274
 
275
  ---
276