pytorch
/

Phi-4-mini-instruct-AWQ-INT4

@@ -152,6 +152,7 @@ quantize_(
     model,
     quant_config,
 )
 TransformerEvalWrapper(
     model=model,
     tokenizer=tokenizer,
@@ -212,10 +213,12 @@ and use a token with write access, from https://huggingface.co/settings/tokens
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
-| Benchmark                        |                |                           |
-|----------------------------------|----------------|---------------------------|
-|                                  | microsoft/Phi-4-mini-instruct   | jerryzh168/Phi-4-mini-instruct-AWQ-INT4         |
-| mmlu                             | To be filled   | To be filled                      |
 <details>
@@ -245,8 +248,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
-|                  | microsoft/Phi-4-mini-instruct   | jerryzh168/Phi-4-mini-instruct-AWQ-INT4              |
-| Peak Memory (GB) | To be filled   | To be filled (?% reduction)    |
@@ -259,7 +262,7 @@ We can use the following code to get a sense of peak memory usage during inferen
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-# use "microsoft/Phi-4-mini-instruct" or "jerryzh168/Phi-4-mini-instruct-AWQ-INT4"
 model_id = "jerryzh168/Phi-4-mini-instruct-AWQ-INT4"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -305,8 +308,13 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 ## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
-|                                  | microsoft/Phi-4-mini-instruct   | jerryzh168/Phi-4-mini-instruct-AWQ-INT4        |
-| latency (batch_size=1)           | ?s          | ?s (?x speedup)    |
 <details>
 <summary> Reproduce Model Performance Results </summary>

     model,
     quant_config,
 )
+tasks = ["mmlu_pro"]
 TransformerEvalWrapper(
     model=model,
     tokenizer=tokenizer,
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
+Since the checkpoint is tuned on `mmlu_pro`, we check against the accuracy for `mmlu_pro`:
+| Benchmark                        |                |                           |                           |
+|----------------------------------|----------------|---------------------------|---------------------------|
+|                                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
+| mmlu_pro                             | 46.43   | 36.74                      |            |
 <details>
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
+|                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-AWQ-INT4              |
+| Peak Memory (GB) | 8.91   | 3.95 (55.67% reduction)    |
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-AWQ-INT4"
 model_id = "jerryzh168/Phi-4-mini-instruct-AWQ-INT4"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 ## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
+|                                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-AWQ-INT4        |
+| latency (batch_size=1)           | 1.60s          | 1.37s (1.17x speedup)    |
+| latency (batch_size=256)         | 5.47s          | 5.55s (0.98x speedup)    |
+Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
+int4 weight only checkpoint is only expected to have speedup for memory bound situations.
 <details>
 <summary> Reproduce Model Performance Results </summary>