Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,9 @@ tags:
|
|
11 |
This model provides a few variants of
|
12 |
[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
|
13 |
deployment on Android using the
|
14 |
-
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert)
|
15 |
-
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference)
|
|
|
16 |
|
17 |
## Use the models
|
18 |
|
@@ -28,6 +29,16 @@ on Colab could be much worse than on a local device.*
|
|
28 |
|
29 |
### Android
|
30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
* Download and install
|
32 |
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
|
33 |
* Follow the instructions in the app.
|
@@ -45,31 +56,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
45 |
|
46 |
<table border="1">
|
47 |
<tr>
|
48 |
-
<th></th>
|
49 |
<th>Backend</th>
|
|
|
|
|
50 |
<th>Prefill (tokens/sec)</th>
|
51 |
<th>Decode (tokens/sec)</th>
|
52 |
<th>Time-to-first-token (sec)</th>
|
53 |
-
<th>Memory (RSS in MB)</th>
|
54 |
<th>Model size (MB)</th>
|
|
|
|
|
55 |
</tr>
|
56 |
<tr>
|
57 |
-
<td>
|
58 |
-
<td>
|
59 |
-
<td><p style="text-align: right">
|
60 |
-
<td><p style="text-align: right">
|
61 |
-
<td><p style="text-align: right">
|
62 |
-
<td><p style="text-align: right">
|
63 |
-
<td><p style="text-align: right">
|
|
|
|
|
64 |
</tr>
|
65 |
<tr>
|
66 |
-
<td>
|
67 |
-
<td>
|
68 |
-
<td><p style="text-align: right">
|
69 |
-
<td><p style="text-align: right">
|
70 |
-
<td><p style="text-align: right">
|
71 |
-
<td><p style="text-align: right">
|
72 |
-
<td><p style="text-align: right">
|
|
|
|
|
73 |
</tr>
|
74 |
|
75 |
</table>
|
@@ -80,4 +97,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
80 |
* The inference on CPU is accelerated via the LiteRT
|
81 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
82 |
* Benchmark is done assuming XNNPACK cache is enabled
|
|
|
83 |
* dynamic_int8: quantized model with int8 weights and float activations.
|
|
|
11 |
This model provides a few variants of
|
12 |
[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
|
13 |
deployment on Android using the
|
14 |
+
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
|
15 |
+
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
|
16 |
+
[LiteRt-LM](https://github.com/google-ai-edge/LiteRT-LM).
|
17 |
|
18 |
## Use the models
|
19 |
|
|
|
29 |
|
30 |
### Android
|
31 |
|
32 |
+
#### Edge Gallery App
|
33 |
+
|
34 |
+
* Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
|
35 |
+
|
36 |
+
* Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play
|
37 |
+
|
38 |
+
* Follow the instructions in the app.
|
39 |
+
|
40 |
+
#### LLM Inference API
|
41 |
+
|
42 |
* Download and install
|
43 |
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
|
44 |
* Follow the instructions in the app.
|
|
|
56 |
|
57 |
<table border="1">
|
58 |
<tr>
|
|
|
59 |
<th>Backend</th>
|
60 |
+
<th>Quantization</th>
|
61 |
+
<th>Context Length</th>
|
62 |
<th>Prefill (tokens/sec)</th>
|
63 |
<th>Decode (tokens/sec)</th>
|
64 |
<th>Time-to-first-token (sec)</th>
|
|
|
65 |
<th>Model size (MB)</th>
|
66 |
+
<th>Peak RSS Memory (MB)</th>
|
67 |
+
<th>GPU Memory (MB)</th>
|
68 |
</tr>
|
69 |
<tr>
|
70 |
+
<td><p style="text-align: right">CPU</p></td>
|
71 |
+
<td><p style="text-align: right">dynamic_int8</p></td>
|
72 |
+
<td><p style="text-align: right">4096</p></td>
|
73 |
+
<td><p style="text-align: right">166.50 tk/s</p></td>
|
74 |
+
<td><p style="text-align: right">26.35 tk/s</p></td>
|
75 |
+
<td><p style="text-align: right">6.41 s</p></td>
|
76 |
+
<td><p style="text-align: right">1831.43 MB</p></td>
|
77 |
+
<td><p style="text-align: right">2221 MB</p></td>
|
78 |
+
<td><p style="text-align: right">N/A</p></td>
|
79 |
</tr>
|
80 |
<tr>
|
81 |
+
<td><p style="text-align: right">GPU</p></td>
|
82 |
+
<td><p style="text-align: right">dynamic_int8</p></td>
|
83 |
+
<td><p style="text-align: right">4096</p></td>
|
84 |
+
<td><p style="text-align: right">927.54 tk/s</p></td>
|
85 |
+
<td><p style="text-align: right">26.98 tk/s</p></td>
|
86 |
+
<td><p style="text-align: right">5.46 s</p></td>
|
87 |
+
<td><p style="text-align: right">1831.43 MB</p></td>
|
88 |
+
<td><p style="text-align: right">2096 MB</p></td>
|
89 |
+
<td><p style="text-align: right">1659 MB</p></td>
|
90 |
</tr>
|
91 |
|
92 |
</table>
|
|
|
97 |
* The inference on CPU is accelerated via the LiteRT
|
98 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
99 |
* Benchmark is done assuming XNNPACK cache is enabled
|
100 |
+
* Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
|
101 |
* dynamic_int8: quantized model with int8 weights and float activations.
|