README.md · pytorch/Phi-4-mini-instruct-8da4w at refs/pr/1

Phi-4-mini-instruct-8da4w / README.md

metascroy

Updates to model card

eb70f62 verified 2 months ago

preview code

raw

history blame

3.43 kB

	---
	library_name: transformers
	tags: []
	---

	# Quantization Recipe

	We used following code to get the quantized model:

	```
	model_id = "microsoft/Phi-4-mini-instruct"
	from transformers import (
	AutoModelForCausalLM,
	AutoProcessor,
	AutoTokenizer,
	)
	from torchao.quantization.quant_api import (
	Int8DynamicActivationIntxWeightConfig,
	MappingType,
	quantize_,
	)
	from torchao.quantization.granularity import PerGroup
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	model_id, torch_dtype="auto", device_map="auto"
	)
	linear_config = Int8DynamicActivationIntxWeightConfig(
	weight_dtype=torch.int4,
	weight_granularity=PerGroup(32),
	weight_mapping_type=MappingType.SYMMETRIC,
	)
	quantize_(
	model,
	linear_config,
	)
	state_dict = model.state_dict()
	torch.save(state_dict, "phi4-mini-8dq4w.pt")
	```

	# Model Quality

	We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

	# baseline
	```
	lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
	```

	# 8dq4w
	```
	import lm_eval
	from lm_eval import evaluator
	from lm_eval.utils import (
	make_table,
	)

	# model is after calling quantize_ as we do in the recipe
	# quantize_(
	# model,
	# linear_config,
	#)
	lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=model, batch_size=8)
	results = evaluator.simple_evaluate(
	lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto"
	)
	print(make_table(results))
	```

	\| Benchmark \| \| \|
	\|----------------------------------\|-------------\|-------------------\|
	\| \| Phi-4 mini-Ins \| phi4-mini-8dq4w \|
	\| Popular aggregated benchmark \| \| \|
	\| Reasoning \| \| \|
	\| HellaSwag \| 54.57 \| 53.19 \|
	\| Multilingual \| \| \|
	\| Math \| \| \|
	\| Overall \| TODO \| TODO \|


	# Exporting to ExecuTorch

	Exporting to ExecuTorch requires you clone and install [ExecuTorch](https://github.com/pytorch/executorch).


	## Convert quantized checkpoint to ExecuTorch's format
	python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt

	## Export to an ExecuTorch *.pte with XNNPACK
	PARAMS="executorch/examples/models/phi_4_mini/config.json"
	python -m executorch.examples.models.llama.export_llama \
	--model "phi_4_mini" \
	--checkpoint "phi4-mini-8dq4w-converted.pt" \
	--params "$PARAMS" \
	-kv \
	--use_sdpa_with_kv_cache \
	-X \
	--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
	--output_name="phi4-mini-8dq4w.pte"

	## Run model with pybindings
	export TOKENIZER="/path/to/tokenizer.json"
	export TOKENIZER_CONFIG="/path/to/tokenizer_config.json"
	export PROMPT="<\|system\|><\|end\|><\|user\|>What is in a california roll?<\|end\|><\|assistant\|>"
	python -m executorch.examples.models.llama.runner.native \
	--model phi_4_mini \
	--pte phi4-mini-8dq4w.pte \
	-kv \
	--tokenizer ${TOKENIZER} \
	--tokenizer_config ${TOKENIZER_CONFIG} \
	--prompt "${PROMPT}" \
	--params "${PARAMS}" \
	--max_len 128 \
	--temperature 0