noxneural commited on
Commit
33d7428
·
verified ·
1 Parent(s): 6582644

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: llama3
5
+ tags:
6
+ - Llama-3
7
+ - instruct
8
+ - finetune
9
+ - chatml
10
+ - gpt4
11
+ - synthetic data
12
+ - distillation
13
+ - function calling
14
+ - json mode
15
+ - axolotl
16
+ - roleplaying
17
+ - chat
18
+ - quantization
19
+ - AWQ
20
+ base_model: meta-llama/Meta-Llama-3.2-3B
21
+ widget:
22
+ - example_title: Hermes 3 AWQ 4-bit
23
+ messages:
24
+ - role: system
25
+ content: >-
26
+ You are a sentient, superintelligent artificial general intelligence, here
27
+ to teach and assist me.
28
+ - role: user
29
+ content: >-
30
+ Write a short story about Goku discovering Kirby has teamed up with Majin
31
+ Buu to destroy the world.
32
+ model-index:
33
+ - name: Hermes-3-Llama-3.2-3B-AWQ-4bit
34
+ results: []
35
+ library_name: transformers
36
+ ---
37
+
38
+ # Hermes 3 - Llama-3.2 3B (AWQ 4-bit)
39
+
40
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/-kj_KflXsdpcZoTQsvx7W.jpeg)
41
+
42
+ ## Model Description
43
+
44
+ This is a 4-bit AWQ (Activation-aware Weight Quantization) quantized version of **Hermes 3 - Llama-3.2 3B**, a fine-tuned LLM developed by Nous Research. The quantization was performed to improve efficiency while maintaining strong performance, making the model suitable for low-memory devices and inference acceleration.
45
+
46
+ For details on the original model, please see the [**Hermes 3 Technical Report**](https://arxiv.org/abs/2408.11857).
47
+
48
+ ### What is AWQ 4-bit Quantization?
49
+
50
+ AWQ (Activation-aware Weight Quantization) is a quantization technique designed to optimize large language models for efficient inference while minimizing performance loss. The **4-bit AWQ** version of this model:
51
+
52
+ - **Reduces memory footprint**, enabling deployment on lower-end hardware (e.g., consumer GPUs and edge devices).
53
+ - **Speeds up inference**, making response times faster while maintaining accuracy.
54
+ - **Preserves performance**, as AWQ selectively quantizes weights based on activation sensitivity, ensuring minimal loss in capability.
55
+
56
+ ## Base Model Information
57
+
58
+ Hermes 3 3B is a generalist language model fine-tuned from **Llama-3.2 3B**, with improvements in:
59
+ - Reasoning
60
+ - Roleplaying
61
+ - Function calling & structured outputs
62
+ - Multi-turn conversation
63
+ - Long-context coherence
64
+
65
+ This quantized version retains these enhancements while offering better efficiency.
66
+
67
+ ## Performance Benchmarks
68
+
69
+ The original **Hermes 3 3B** model achieved strong performance on various benchmarks. While the AWQ quantized version maintains high accuracy, minor variations may occur due to the quantization process. For benchmarking, refer to the original model's results.
70
+
71
+ ## Prompt Format
72
+
73
+ This model follows **ChatML formatting**, similar to OpenAI's API prompt structure. Example:
74
+
75
+ ```python
76
+ messages = [
77
+ {"role": "system", "content": "You are Hermes 3."},
78
+ {"role": "user", "content": "Hello, who are you?"}
79
+ ]
80
+ gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
81
+ model.generate(**gen_input)
82
+ ```
83
+
84
+ For more details, see the [Hermes 3 documentation](https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B).
85
+
86
+ ## Inference with AWQ 4-bit Model
87
+
88
+ To use this quantized model efficiently, load it with **AutoAWQ** or **transformers**:
89
+
90
+ ```python
91
+ from transformers import AutoTokenizer
92
+ from autoawq import AutoAWQForCausalLM
93
+
94
+ model_path = "your_model_path"
95
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
96
+ model = AutoAWQForCausalLM.from_quantized(model_path, device="cuda")
97
+
98
+ prompt = "<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant"
99
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
100
+ output = model.generate(input_ids, max_new_tokens=256)
101
+ response = tokenizer.decode(output[0], skip_special_tokens=True)
102
+ print(response)
103
+ ```
104
+
105
+ ## Quantized Model Use Cases
106
+
107
+ - Running LLMs on lower-end **consumer GPUs** (e.g., RTX 3060, 4060, etc.)
108
+ - **Faster inference** with minimal degradation in quality
109
+ - **Edge computing & on-device AI** with constrained resources
110
+ - **Cloud inference** with optimized performance/cost ratio
111
+
112
+ ## Limitations & Considerations
113
+
114
+ - **Minor accuracy loss** due to 4-bit quantization (slightly less precise responses in rare cases)
115
+ - **Lower computational overhead** at the expense of some fine-grained details
116
+ - **Best suited for inference**, rather than fine-tuning or continued training
117
+
118
+ ## Citation
119
+
120
+ If you use this model, please cite the original **Hermes 3 Technical Report**:
121
+
122
+ ```bibtex
123
+ @misc{teknium2024hermes3technicalreport,
124
+ title={Hermes 3 Technical Report},
125
+ author={Ryan Teknium and Jeffrey Quesnelle and Chen Guang},
126
+ year={2024},
127
+ eprint={2408.11857},
128
+ archivePrefix={arXiv},
129
+ primaryClass={cs.CL},
130
+ url={https://arxiv.org/abs/2408.11857},
131
+ }
132
+ ```
133
+
134
+ ## Acknowledgments
135
+
136
+ This quantization was performed using the **AWQ** method for LLM optimization. The base model was developed by Nous Research, and quantization was applied to enhance deployment efficiency while preserving model quality.
137
+
138
+ For further details, refer to [Nous Research](https://huggingface.co/NousResearch) and [Hermes 3 models](https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B).
139
+