pytorch
/

Qwen3-32B-FP8

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on May 17

Commit

ccf920f

·

verified ·

1 Parent(s): 2c27924

Update README.md

Files changed (1) hide show

README.md +38 -2

README.md CHANGED Viewed

@@ -125,7 +125,43 @@ tokenizer.push_to_hub(save_to)
 ```
 # Model Quality
-TODO
 # Memory Usage
@@ -135,7 +171,7 @@ TODO
 | Peak Memory                      | 65.72 GB       | 34.54 GB (-47.44%)            |
 <details>
-<summary> Reproduce peak memory usage </summary>
 Code
 ```Py

 ```
 # Model Quality
+We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
+| Benchmark                        |                |                           |
+|----------------------------------|----------------|---------------------------|
+|                                  | Qwen3-8B       | Qwen3-8B-int4wo           |
+| **General**                      |                |                           |
+| mmlu                             | 73.04          | 70.4                      |
+| mmlu_pro                         | 53.81          | 52.79                     |
+| bbh                              | 79.33          | 74.92                     |
+| **Multilingual**                 |                |                           |
+| mgsm_en_cot_en                   | 39.6           | 33.2                      |
+| m_mmlu (avg)                     | 57.17          | 54.06                     |
+| **Math**                         |                |                           |
+| gpqa_main_zeroshot               | 35.71          | 32.14                     |
+| gsm8k                            | 87.79          | 86.28                     |
+| leaderboard_math_hard (v3)       | 53.7           | 46.83                     |
+| **Overall**                      | 60.02          | 56.33                     |
+<details>
+<summary> Reproduce Model Quality Results </summary>
+Need to install lm-eval from source:
+https://github.com/EleutherAI/lm-evaluation-harness#install
+## baseline
+```Shell
+lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks mmlu --device cuda:0 --batch_size 8
+```
+## float8 dynamic quantization (float8dq)
+```Shell
+export MODEL=pytorch/Qwen3-32B-float8dq
+# or
+# export MODEL=Qwen/Qwen3-32B
+lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
+```
+</details>
 # Memory Usage
 | Peak Memory                      | 65.72 GB       | 34.54 GB (-47.44%)            |
 <details>
+<summary> Reproduce Peak Memory Usage Results </summary>
 Code
 ```Py