Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: pytorch
|
3 |
+
license: llama2
|
4 |
+
pipeline_tag: text-generation
|
5 |
+
tags:
|
6 |
+
- llm
|
7 |
+
- generative_ai
|
8 |
+
- quantized
|
9 |
+
- android
|
10 |
+
|
11 |
+
---
|
12 |
+
|
13 |
+

|
14 |
+
|
15 |
+
# Llama-v2-7B-Chat: Optimized for Mobile Deployment
|
16 |
+
## State-of-the-art large language model useful on a variety of language understanding and generation tasks
|
17 |
+
|
18 |
+
Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to 4-bit weights and 16-bit activations making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
|
19 |
+
|
20 |
+
This is based on the implementation of Llama-v2-7B-Chat found
|
21 |
+
[here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). More details on model performance
|
22 |
+
accross various devices, can be found [here](https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized).
|
23 |
+
|
24 |
+
### Model Details
|
25 |
+
|
26 |
+
- **Model Type:** Text generation
|
27 |
+
- **Model Stats:**
|
28 |
+
- Number of parameters: 7B
|
29 |
+
- Model size: 3.6GB
|
30 |
+
- Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
|
31 |
+
- Max context length: 1024
|
32 |
+
- Prompt processor input: 1024 tokens
|
33 |
+
- Prompt processor output: 1 output token + KVCache for token generator
|
34 |
+
- Model-2 (Token Generator): Llama-TokenGenerator-KVCache-Quantized
|
35 |
+
- Token generator input: 1 input token + past KVCache
|
36 |
+
- Token generator output: 1 output token + KVCache for next iteration
|
37 |
+
- Decoding length: 1024 (1 output token + 1023 from KVCache)
|
38 |
+
- Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
|
39 |
+
- QNN-SDK: 2.19
|
40 |
+
|
41 |
+
|
42 |
+
| Device | Chipset | Target Runtime | Inference Time (ms) | Peak Memory Range (MB) | Precision | Primary Compute Unit | Target Model
|
43 |
+
| ---|---|---|---|---|---|---|---|
|
44 |
+
| Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 117.812 ms | 66 - 238 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
|
45 |
+
| Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 2578.521 ms | 12 - 17 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
|
46 |
+
|
47 |
+
|
48 |
+
## License
|
49 |
+
- The license for the original implementation of Llama-v2-7B-Chat can be found
|
50 |
+
[here](https://github.com/facebookresearch/llama/blob/main/LICENSE).
|
51 |
+
|
52 |
+
## References
|
53 |
+
* [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
|
54 |
+
* [Source Model Implementation](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
55 |
+
|
56 |
+
## Community
|
57 |
+
* Join [our AI Hub Slack community](https://join.slack.com/t/qualcomm-ai-hub/shared_invite/zt-2dgf95loi-CXHTDRR1rvPgQWPO~ZZZJg) to collaborate, post questions and learn more about on-device AI.
|
58 |
+
* For questions or feedback please [reach out to us](mailto:[email protected]).
|
59 |
+
|
60 |
+
## Usage and Limitations
|
61 |
+
|
62 |
+
Model may not be used for or in connection with any of the following applications:
|
63 |
+
|
64 |
+
- Accessing essential private and public services and benefits;
|
65 |
+
- Administration of justice and democratic processes;
|
66 |
+
- Assessing or recognizing the emotional state of a person;
|
67 |
+
- Biometric and biometrics-based systems, including categorization of persons based on sensitive characteristics;
|
68 |
+
- Education and vocational training;
|
69 |
+
- Employment and workers management;
|
70 |
+
- Exploitation of the vulnerabilities of persons resulting in harmful behavior;
|
71 |
+
- General purpose social scoring;
|
72 |
+
- Law enforcement;
|
73 |
+
- Management and operation of critical infrastructure;
|
74 |
+
- Migration, asylum and border control management;
|
75 |
+
- Predictive policing;
|
76 |
+
- Real-time remote biometric identification in public spaces;
|
77 |
+
- Recommender systems of social media platforms;
|
78 |
+
- Scraping of facial images (from the internet or otherwise); and/or
|
79 |
+
- Subliminal manipulation
|