--- base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T inference: false model_type: llama prompt_template: | Question {prompt}\n Answer: quantized_by: mwitiderrick tags: - deepsparse --- ## TinyLlama-1.1B-intermediate-step-1431k-3T-gsms8k-pruned50-quant-ds This repo contains model files for [TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) optimized for [DeepSparse](https://github.com/neuralmagic/deepsparse), a CPU inference runtime for sparse models. This model was quantized and pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml). ## Inference Install [DeepSparse LLM](https://github.com/neuralmagic/deepsparse) for fast inference on CPUs: ```bash pip install deepsparse-nightly[llm] ``` Run in a [Python pipeline](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md): ```python from deepsparse import TextGeneration prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?" formatted_prompt = f"Question:{prompt}\nAnswer:" model = TextGeneration(model_path="hf:nm-testing/TinyLlama-1.1B-intermediate-step-1431k-3T-gsms8k-pruned50-quant-ds") print(model(formatted_prompt, max_new_tokens=200).generations[0].text) """ He runs 3*60=<<3*60=180>>180 meters a week So he runs 180/3=<<180/3=60>>60 times a week #### 60 """ ``` To obtain the final model the following process was followed: - Fine-tune the base model on GSM8K for 2 epochs using HF SFTrainer - Sparsify the model to 50% using SparseML - Fine-tune the sparse model on the GSM8K dataset - Perform one-shot quantization of the resulting model