|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Quantization Recipe |
|
|
|
We used following code to get the quantized model: |
|
|
|
``` |
|
model_id = "microsoft/Phi-4-mini-instruct" |
|
from transformers import ( |
|
AutoModelForCausalLM, |
|
AutoProcessor, |
|
AutoTokenizer, |
|
) |
|
from torchao.quantization.quant_api import ( |
|
Int8DynamicActivationIntxWeightConfig, |
|
MappingType, |
|
quantize_, |
|
) |
|
from torchao.quantization.granularity import PerGroup |
|
import torch |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, torch_dtype="auto", device_map="auto" |
|
) |
|
linear_config = Int8DynamicActivationIntxWeightConfig( |
|
weight_dtype=torch.int4, |
|
weight_granularity=PerGroup(32), |
|
weight_mapping_type=MappingType.SYMMETRIC, |
|
) |
|
quantize_( |
|
model, |
|
linear_config, |
|
) |
|
state_dict = model.state_dict() |
|
torch.save(state_dict, "phi4-mini-8dq4w.pt") |
|
``` |
|
|
|
# Model Quality |
|
|
|
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. |
|
|
|
# baseline |
|
``` |
|
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8 |
|
``` |
|
|
|
# 8dq4w |
|
``` |
|
import lm_eval |
|
from lm_eval import evaluator |
|
from lm_eval.utils import ( |
|
make_table, |
|
) |
|
|
|
# model is after calling quantize_ as we do in the recipe |
|
# quantize_( |
|
# model, |
|
# linear_config, |
|
#) |
|
lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=model, batch_size=8) |
|
results = evaluator.simple_evaluate( |
|
lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto" |
|
) |
|
print(make_table(results)) |
|
``` |
|
|
|
| Benchmark | | | |
|
|----------------------------------|-------------|-------------------| |
|
| | Phi-4 mini-Ins | phi4-mini-8dq4w | |
|
| **Popular aggregated benchmark** | | | |
|
| **Reasoning** | | | |
|
| HellaSwag | 54.57 | 53.19 | |
|
| **Multilingual** | | | |
|
| **Math** | | | |
|
| **Overall** | **TODO** | **TODO** | |
|
|
|
|
|
# Exporting to ExecuTorch |
|
|
|
Exporting to ExecuTorch requires you clone and install [ExecuTorch](https://github.com/pytorch/executorch). |
|
|
|
|
|
## Convert quantized checkpoint to ExecuTorch's format |
|
python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt |
|
|
|
## Export to an ExecuTorch *.pte with XNNPACK |
|
PARAMS="executorch/examples/models/phi_4_mini/config.json" |
|
python -m executorch.examples.models.llama.export_llama \ |
|
--model "phi_4_mini" \ |
|
--checkpoint "phi4-mini-8dq4w-converted.pt" \ |
|
--params "$PARAMS" \ |
|
-kv \ |
|
--use_sdpa_with_kv_cache \ |
|
-X \ |
|
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ |
|
--output_name="phi4-mini-8dq4w.pte" |
|
|
|
## Run model with pybindings |
|
export TOKENIZER="/path/to/tokenizer.json" |
|
export TOKENIZER_CONFIG="/path/to/tokenizer_config.json" |
|
export PROMPT="<|system|><|end|><|user|>What is in a california roll?<|end|><|assistant|>" |
|
python -m executorch.examples.models.llama.runner.native \ |
|
--model phi_4_mini \ |
|
--pte phi4-mini-8dq4w.pte \ |
|
-kv \ |
|
--tokenizer ${TOKENIZER} \ |
|
--tokenizer_config ${TOKENIZER_CONFIG} \ |
|
--prompt "${PROMPT}" \ |
|
--params "${PARAMS}" \ |
|
--max_len 128 \ |
|
--temperature 0 |