fengwuyao commited on
Commit
421f7fe
·
verified ·
1 Parent(s): 8a9f456

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -18
README.md CHANGED
@@ -11,8 +11,9 @@ tags:
11
  This model provides a few variants of
12
  [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
13
  deployment on Android using the
14
- [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
15
- [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 
16
 
17
  ## Use the models
18
 
@@ -28,6 +29,16 @@ on Colab could be much worse than on a local device.*
28
 
29
  ### Android
30
 
 
 
 
 
 
 
 
 
 
 
31
  * Download and install
32
  [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
33
  * Follow the instructions in the app.
@@ -45,31 +56,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
45
 
46
  <table border="1">
47
  <tr>
48
- <th></th>
49
  <th>Backend</th>
 
 
50
  <th>Prefill (tokens/sec)</th>
51
  <th>Decode (tokens/sec)</th>
52
  <th>Time-to-first-token (sec)</th>
53
- <th>Memory (RSS in MB)</th>
54
  <th>Model size (MB)</th>
 
 
55
  </tr>
56
  <tr>
57
- <td>fp32 (baseline)</td>
58
- <td>cpu</td>
59
- <td><p style="text-align: right">39.56 tk/s</p></td>
60
- <td><p style="text-align: right">1.43 tk/s</p></td>
61
- <td><p style="text-align: right">19.24 s</p></td>
62
- <td><p style="text-align: right">5,997 MB</p></td>
63
- <td><p style="text-align: right">6,794 MB</p></td>
 
 
64
  </tr>
65
  <tr>
66
- <td>dynamic_int8</td>
67
- <td>cpu</td>
68
- <td><p style="text-align: right">110.58 tk/s</p></td>
69
- <td><p style="text-align: right">12.96 tk/s</p></td>
70
- <td><p style="text-align: right">6.81 s</p></td>
71
- <td><p style="text-align: right">3,598 MB</p></td>
72
- <td><p style="text-align: right">1,774 MB</p></td>
 
 
73
  </tr>
74
 
75
  </table>
@@ -80,4 +97,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
80
  * The inference on CPU is accelerated via the LiteRT
81
  [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
82
  * Benchmark is done assuming XNNPACK cache is enabled
 
83
  * dynamic_int8: quantized model with int8 weights and float activations.
 
11
  This model provides a few variants of
12
  [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
13
  deployment on Android using the
14
+ [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
15
+ [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
16
+ [LiteRt-LM](https://github.com/google-ai-edge/LiteRT-LM).
17
 
18
  ## Use the models
19
 
 
29
 
30
  ### Android
31
 
32
+ #### Edge Gallery App
33
+
34
+ * Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
35
+
36
+ * Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play
37
+
38
+ * Follow the instructions in the app.
39
+
40
+ #### LLM Inference API
41
+
42
  * Download and install
43
  [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
44
  * Follow the instructions in the app.
 
56
 
57
  <table border="1">
58
  <tr>
 
59
  <th>Backend</th>
60
+ <th>Quantization</th>
61
+ <th>Context Length</th>
62
  <th>Prefill (tokens/sec)</th>
63
  <th>Decode (tokens/sec)</th>
64
  <th>Time-to-first-token (sec)</th>
 
65
  <th>Model size (MB)</th>
66
+ <th>Peak RSS Memory (MB)</th>
67
+ <th>GPU Memory (MB)</th>
68
  </tr>
69
  <tr>
70
+ <td><p style="text-align: right">CPU</p></td>
71
+ <td><p style="text-align: right">dynamic_int8</p></td>
72
+ <td><p style="text-align: right">4096</p></td>
73
+ <td><p style="text-align: right">166.50 tk/s</p></td>
74
+ <td><p style="text-align: right">26.35 tk/s</p></td>
75
+ <td><p style="text-align: right">6.41 s</p></td>
76
+ <td><p style="text-align: right">1831.43 MB</p></td>
77
+ <td><p style="text-align: right">2221 MB</p></td>
78
+ <td><p style="text-align: right">N/A</p></td>
79
  </tr>
80
  <tr>
81
+ <td><p style="text-align: right">GPU</p></td>
82
+ <td><p style="text-align: right">dynamic_int8</p></td>
83
+ <td><p style="text-align: right">4096</p></td>
84
+ <td><p style="text-align: right">927.54 tk/s</p></td>
85
+ <td><p style="text-align: right">26.98 tk/s</p></td>
86
+ <td><p style="text-align: right">5.46 s</p></td>
87
+ <td><p style="text-align: right">1831.43 MB</p></td>
88
+ <td><p style="text-align: right">2096 MB</p></td>
89
+ <td><p style="text-align: right">1659 MB</p></td>
90
  </tr>
91
 
92
  </table>
 
97
  * The inference on CPU is accelerated via the LiteRT
98
  [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
99
  * Benchmark is done assuming XNNPACK cache is enabled
100
+ * Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
101
  * dynamic_int8: quantized model with int8 weights and float activations.