mwitiderrick commited on
Commit
01b06a0
·
1 Parent(s): 58bd8dc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: TinyLlama/TinyLlama-1.1B-Chat-v0.4
3
+ inference: false
4
+ model_type: llama
5
+ prompt_template: |
6
+ <|im_start|>user\n
7
+ {prompt}<|im_end|>\n
8
+ <|im_start|>assistant\n
9
+ quantized_by: mwitiderrick
10
+ tags:
11
+ - deepsparse
12
+ ---
13
+ ## TinyLlama 1.1B Chat 0.4 - DeepSparse
14
+ This repo contains model files for [TinyLlama 1.1B Chat](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) optimized for [DeepSparse](https://github.com/neuralmagic/deepsparse), a CPU inference runtime for sparse models.
15
+
16
+ This model was quantized and pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).
17
+
18
+ ## Inference
19
+ Install [DeepSparse LLM](https://github.com/neuralmagic/deepsparse) for fast inference on CPUs:
20
+ ```bash
21
+ pip install deepsparse-nightly[llm]
22
+ ```
23
+ Run in a [Python pipeline](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md):
24
+ ```python
25
+ from deepsparse import TextGeneration
26
+
27
+ prompt = "How to make banana bread?"
28
+ formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
29
+
30
+ model = TextGeneration(model="hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant")
31
+ print(model(formatted_prompt, max_new_tokens=500).generations[0].text)
32
+
33
+ """
34
+ Banana bread is a delicious and easy-to-make recipe that is sure to please. Here is a recipe for making banana bread:
35
+
36
+ Ingredients:
37
+
38
+ For the Banana Bread:
39
+
40
+ - 1 cup of sugar
41
+ - 1 cup of flour
42
+ - 1/2 cup of mashed bananas
43
+ - 1/4 cup of milk
44
+ - 1/2 cup of melted butter
45
+ - 1/4 cup of baking powder
46
+ - 1/4 cup of baking soda
47
+ - 1/4 cup of eggs
48
+ - 1/4 cup of milk
49
+ - 1/4 cup of sugar
50
+
51
+
52
+ Instructions:
53
+
54
+ 1. Preheat the oven to 325°F (160°C).
55
+ 2. In a large bowl, combine the sugar and flour.
56
+ 3. In a separate bow, combine the mashed bananas, milk, butter, baking powder, baking soda, milk, sugar.
57
+ 4. Add the bananas and milk into the flour-sugar mixture.
58
+ 5. Pour the milk into the bowl of the flour-sugar mixture.
59
+ 6. Pour the baking powder into the bowl of the flour-sugar mixture.
60
+ 7. Pour the mashed bananas into the bowl of the flour-sugar mixture.
61
+ 8. Add the eggs into the bowl of the flour-sugar mixture.
62
+ 9. Stir the mixture until it becomes a dough.
63
+ 10. Grease a 9-inch (23 cm) square pan.
64
+ 11. Pour the mixture into the pan.
65
+ 12. Bake the banana bread in the oven for 40 minutes.
66
+ 13. Remove the banana bread from the oven and cool it.
67
+ 14. Cut the bread into 16 pieces.
68
+ 15. Make the glaze:
69
+ 16. Sprinkle the sugar over the bread.
70
+ 17. Bake the bread in the oven for 30 minutes.
71
+ """
72
+ ```
73
+ ## Prompt template
74
+
75
+ ```
76
+ <|im_start|>user\n
77
+ {prompt}<|im_end|>\n
78
+ <|im_start|>assistant\n
79
+
80
+ ```
81
+ ## Sparsification
82
+ For details on how this model was sparsified, see the `recipe.yaml` in this repo and follow the instructions below.
83
+
84
+ ```bash
85
+ git clone https://github.com/neuralmagic/sparseml
86
+ pip install -e "sparseml[transformers]"
87
+ wget https://huggingface.co/neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/raw/main/recipe.yaml # download recipe
88
+ python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py TinyLlama/TinyLlama-1.1B-Chat-v0.4 open_platypus --recipe recipe.yaml --save True
89
+ python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
90
+ cp deployment/model.onnx deployment/model-orig.onnx
91
+ ```
92
+ Run this kv-cache injection to speed up the model at inference by caching the Key and Value states:
93
+ ```python
94
+ import os
95
+ import onnx
96
+ from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
97
+ input_file = "deployment/model-orig.onnx"
98
+ output_file = "deployment/model.onnx"
99
+ model = onnx.load(input_file, load_external_data=False)
100
+ model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
101
+ onnx.save(model, output_file)
102
+ print(f"Modified model saved to: {output_file}")
103
+ ```
104
+ Follow the instructions on our [One Shot With SparseML](https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq) page for a step-by-step guide for performing one-shot quantization of large language models.
105
+ ## Slack
106
+
107
+ For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)