Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,50 @@ DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model
|
|
22 |
|
23 |
We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to one-fourth its original size—without losing any accuracy. Tests on an **HP Omnibook AIPC** with an **AMD Ryzen™ AI 9 HX 370 processor** showed a decoding speed of **66.40 tokens per second** and a peak RAM usage of just **1228 MB** in NexaQuant version—compared to only **25.28 tokens** per second and **3788 MB RAM** in the unquantized version—while NexaQuant **maintaining full precision model accuracy.**
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
## How to run locally
|
26 |
|
27 |
NexaQuant is compatible with **Nexa-SDK**, **Ollama**, **LM Studio**, **Llama.cpp**, and any llama.cpp based project. Below, we outline multiple ways to run the model locally.
|
@@ -67,50 +111,6 @@ Get the latest version from the [official website](https://lmstudio.ai/).
|
|
67 |
3. Once loaded, go to the chat window and start a conversation.
|
68 |
---
|
69 |
|
70 |
-
## Example
|
71 |
-
|
72 |
-
On the left, we have an example of what LMStudio Q4_K_M responded. On the right is our NexaQuant version.
|
73 |
-
|
74 |
-
Prompt: A Common Investment Banking BrainTeaser Question
|
75 |
-
|
76 |
-
There is a 6x8 rectangular chocolate bar made up of small 1x1 bits. We want to break it into the 48 bits. We can break one piece of chocolate horizontally or vertically, but cannot break two pieces together! What is the minimum number of breaks required?
|
77 |
-
|
78 |
-
Right Answer: 47
|
79 |
-
|
80 |
-
<div align="center">
|
81 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/66abfd6f65beb23afa427d8a/ZS9e66t7OhBIno4eQ3OaX.png" width="80%" alt="Example" />
|
82 |
-
</div>
|
83 |
-
|
84 |
-
## Benchmarks
|
85 |
-
|
86 |
-
NexaQuant on Reasoning Benchmarks Compared to BF16 and LMStudio's Q4_K_M
|
87 |
-
|
88 |
-
**1.5B:**
|
89 |
-
|
90 |
-
<div align="center">
|
91 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/66abfd6f65beb23afa427d8a/Cyh1zVvDHNBT598IkLHkd.png" width="80%" alt="Example" />
|
92 |
-
</div>
|
93 |
-
|
94 |
-
The general capacity has also greatly improved:
|
95 |
-
|
96 |
-
**1.5B:**
|
97 |
-
|
98 |
-
| Benchmark | Full 16-bit | llama.cpp (4-bit) | NexaQuant (4-bit)|
|
99 |
-
|----------------------------|------------|-------------------|-------------------|
|
100 |
-
| **HellaSwag** | 35.81 | 34.31 | 34.60 |
|
101 |
-
| **MMLU** | 37.31 | 35.49 | 37.41 |
|
102 |
-
| **Humanities** | 31.86 | 34.87 | 30.97 |
|
103 |
-
| **Social Sciences** | 41.50 | 38.17 | 42.09 |
|
104 |
-
| **STEM** | 38.60 | 35.74 | 39.26 |
|
105 |
-
| **ARC Easy** | 67.55 | 54.20 | 65.53 |
|
106 |
-
| **MathQA** | 41.04 | 28.51 | 39.87 |
|
107 |
-
| **PIQA** | 65.56 | 61.70 | 65.07 |
|
108 |
-
| **IFEval - Inst - Loose** | 25.06 | 24.77 | 28.54 |
|
109 |
-
| **IFEval - Inst - Strict** | 23.62 | 22.94 | 27.94 |
|
110 |
-
| **IFEval - Prom - Loose** | 13.86 | 10.29 | 15.71 |
|
111 |
-
| **IFEval - Prom - Strict** | 12.57 | 8.09 | 15.16 |
|
112 |
-
|
113 |
-
|
114 |
## What's next
|
115 |
|
116 |
1. Inference Nexa Quantized Deepseek-R1 distilled model on NPU.
|
|
|
22 |
|
23 |
We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to one-fourth its original size—without losing any accuracy. Tests on an **HP Omnibook AIPC** with an **AMD Ryzen™ AI 9 HX 370 processor** showed a decoding speed of **66.40 tokens per second** and a peak RAM usage of just **1228 MB** in NexaQuant version—compared to only **25.28 tokens** per second and **3788 MB RAM** in the unquantized version—while NexaQuant **maintaining full precision model accuracy.**
|
24 |
|
25 |
+
|
26 |
+
## NexaQunat Use Case Example
|
27 |
+
|
28 |
+
Here’s a comparison of how a standard Q4_K_M and NexaQuant-4Bit handle a common investment banking brain teaser question. NexaQuant excels in accuracy while shrinking the model file size by 4 times.
|
29 |
+
|
30 |
+
Prompt: A Common Investment Banking BrainTeaser Question
|
31 |
+
|
32 |
+
There is a 6x8 rectangular chocolate bar made up of small 1x1 bits. We want to break it into the 48 bits. We can break one piece of chocolate horizontally or vertically, but cannot break two pieces together! What is the minimum number of breaks required?
|
33 |
+
|
34 |
+
Right Answer: 47
|
35 |
+
|
36 |
+
<div align="center">
|
37 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66abfd6f65beb23afa427d8a/ZS9e66t7OhBIno4eQ3OaX.png" width="80%" alt="Example" />
|
38 |
+
</div>
|
39 |
+
|
40 |
+
## Benchmarks
|
41 |
+
|
42 |
+
NexaQuant on Reasoning Benchmarks Compared to BF16 and LMStudio's Q4_K_M
|
43 |
+
|
44 |
+
**Reasoning Capacity:**
|
45 |
+
|
46 |
+
<div align="center">
|
47 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66abfd6f65beb23afa427d8a/Cyh1zVvDHNBT598IkLHkd.png" width="80%" alt="Example" />
|
48 |
+
</div>
|
49 |
+
|
50 |
+
The general capacity has also greatly improved:
|
51 |
+
|
52 |
+
**General Capacity:**
|
53 |
+
|
54 |
+
| Benchmark | Full 16-bit | llama.cpp (4-bit) | NexaQuant (4-bit)|
|
55 |
+
|----------------------------|------------|-------------------|-------------------|
|
56 |
+
| **HellaSwag** | 35.81 | 34.31 | 34.60 |
|
57 |
+
| **MMLU** | 37.31 | 35.49 | 37.41 |
|
58 |
+
| **Humanities** | 31.86 | 34.87 | 30.97 |
|
59 |
+
| **Social Sciences** | 41.50 | 38.17 | 42.09 |
|
60 |
+
| **STEM** | 38.60 | 35.74 | 39.26 |
|
61 |
+
| **ARC Easy** | 67.55 | 54.20 | 65.53 |
|
62 |
+
| **MathQA** | 41.04 | 28.51 | 39.87 |
|
63 |
+
| **PIQA** | 65.56 | 61.70 | 65.07 |
|
64 |
+
| **IFEval - Inst - Loose** | 25.06 | 24.77 | 28.54 |
|
65 |
+
| **IFEval - Inst - Strict** | 23.62 | 22.94 | 27.94 |
|
66 |
+
| **IFEval - Prom - Loose** | 13.86 | 10.29 | 15.71 |
|
67 |
+
| **IFEval - Prom - Strict** | 12.57 | 8.09 | 15.16 |
|
68 |
+
|
69 |
## How to run locally
|
70 |
|
71 |
NexaQuant is compatible with **Nexa-SDK**, **Ollama**, **LM Studio**, **Llama.cpp**, and any llama.cpp based project. Below, we outline multiple ways to run the model locally.
|
|
|
111 |
3. Once loaded, go to the chat window and start a conversation.
|
112 |
---
|
113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
## What's next
|
115 |
|
116 |
1. Inference Nexa Quantized Deepseek-R1 distilled model on NPU.
|