Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
pipeline_tag: text-generation
|
5 |
+
---
|
6 |
+
|
7 |
+
# Meta-Llama-3-70B-Instruct-quantized.w8a16
|
8 |
+
|
9 |
+
## Model Overview
|
10 |
+
- **Model Architecture:** Meta-Llama-3
|
11 |
+
- **Input:** Text
|
12 |
+
- **Output:** Text
|
13 |
+
- **Model Optimizations:**
|
14 |
+
- **Quantized:** INT8 weights
|
15 |
+
- **Release Date:** 7/2/2024
|
16 |
+
- **Version:** 1.0
|
17 |
+
- **Model Developers:** Neural Magic
|
18 |
+
|
19 |
+
Quantized version of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).
|
20 |
+
It achieves an average score of 79.18% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 77.90%.
|
21 |
+
|
22 |
+
## Model Optimizations
|
23 |
+
|
24 |
+
This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
|
25 |
+
Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
|
26 |
+
[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization.
|
27 |
+
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
|
28 |
+
|
29 |
+
## Evaluation
|
30 |
+
|
31 |
+
The model was evaluated with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
|
32 |
+
|
33 |
+
## Accuracy
|
34 |
+
|
35 |
+
### Open LLM Leaderboard evaluation scores
|
36 |
+
| | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Meta-Llama-3-70B-Instruct-quantized.w8a16<br>(this model) |
|
37 |
+
| :------------------: | :----------------------: | :------------------------------------------------: |
|
38 |
+
| arc-c<br>25-shot | 72.44% | 71.59% |
|
39 |
+
| hellaswag<br>10-shot | 85.54% | 85.65% |
|
40 |
+
| mmlu<br>5-shot | 80.18% | 78.69% |
|
41 |
+
| truthfulqa<br>0-shot | 62.92% | 61.94% |
|
42 |
+
| winogrande<br>5-shot | 83.19% | 83.11% |
|
43 |
+
| gsm8k<br>5-shot | 90.83% | 86.43% |
|
44 |
+
| **Average<br>Accuracy** | **79.18%** | **77.90%** |
|
45 |
+
| **Recovery** | **100%** | **98.38%** |
|