litert-community
/

DeepSeek-R1-Distill-Qwen-1.5B

Text Generation

LiteRT

chat

Model card Files Files and versions

xet

Community

fengwuyao commited on 8 days ago

Commit

421f7fe

verified ·

1 Parent(s): 8a9f456

Update README.md

Browse files

Files changed (1) hide show

README.md +36 -18

README.md CHANGED Viewed

@@ -11,8 +11,9 @@ tags:
 This model provides a few variants of
 [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
 deployment on Android using the
-[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
-[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 ## Use the models
@@ -28,6 +29,16 @@ on Colab could be much worse than on a local device.*
 ### Android
 *   Download and install
     [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
 *   Follow the instructions in the app.
@@ -45,31 +56,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
 <table border="1">
   <tr>
-   <th></th>
    <th>Backend</th>
    <th>Prefill (tokens/sec)</th>
    <th>Decode (tokens/sec)</th>
    <th>Time-to-first-token (sec)</th>
-   <th>Memory (RSS in MB)</th>
    <th>Model size (MB)</th>
   </tr>
   <tr>
-<td>fp32 (baseline)</td>
-<td>cpu</td>
-<td><p style="text-align: right">39.56 tk/s</p></td>
-<td><p style="text-align: right">1.43 tk/s</p></td>
-<td><p style="text-align: right">19.24 s</p></td>
-<td><p style="text-align: right">5,997 MB</p></td>
-<td><p style="text-align: right">6,794 MB</p></td>
 </tr>
 <tr>
-<td>dynamic_int8</td>
-<td>cpu</td>
-<td><p style="text-align: right">110.58 tk/s</p></td>
-<td><p style="text-align: right">12.96 tk/s</p></td>
-<td><p style="text-align: right">6.81 s</p></td>
-<td><p style="text-align: right">3,598 MB</p></td>
-<td><p style="text-align: right">1,774 MB</p></td>
 </tr>
 </table>
@@ -80,4 +97,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
 *   The inference on CPU is accelerated via the LiteRT
     [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
 *   Benchmark is done assuming XNNPACK cache is enabled
 *   dynamic_int8: quantized model with int8 weights and float activations.

 This model provides a few variants of
 [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
 deployment on Android using the
+[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
+[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
+[LiteRt-LM](https://github.com/google-ai-edge/LiteRT-LM).
 ## Use the models
 ### Android
+#### Edge Gallery App
+*   Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
+*   Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play
+*   Follow the instructions in the app.
+#### LLM Inference API
 *   Download and install
     [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
 *   Follow the instructions in the app.
 <table border="1">
   <tr>
    <th>Backend</th>
+   <th>Quantization</th>
+   <th>Context Length</th>
    <th>Prefill (tokens/sec)</th>
    <th>Decode (tokens/sec)</th>
    <th>Time-to-first-token (sec)</th>
    <th>Model size (MB)</th>
+   <th>Peak RSS Memory (MB)</th>
+   <th>GPU Memory (MB)</th>
   </tr>
   <tr>
+<td><p style="text-align: right">CPU</p></td>
+<td><p style="text-align: right">dynamic_int8</p></td>
+<td><p style="text-align: right">4096</p></td>
+<td><p style="text-align: right">166.50 tk/s</p></td>
+<td><p style="text-align: right">26.35 tk/s</p></td>
+<td><p style="text-align: right">6.41 s</p></td>
+<td><p style="text-align: right">1831.43 MB</p></td>
+<td><p style="text-align: right">2221 MB</p></td>
+<td><p style="text-align: right">N/A</p></td>
 </tr>
 <tr>
+<td><p style="text-align: right">GPU</p></td>
+<td><p style="text-align: right">dynamic_int8</p></td>
+<td><p style="text-align: right">4096</p></td>
+<td><p style="text-align: right">927.54 tk/s</p></td>
+<td><p style="text-align: right">26.98 tk/s</p></td>
+<td><p style="text-align: right">5.46 s</p></td>
+<td><p style="text-align: right">1831.43 MB</p></td>
+<td><p style="text-align: right">2096 MB</p></td>
+<td><p style="text-align: right">1659 MB</p></td>
 </tr>
 </table>
 *   The inference on CPU is accelerated via the LiteRT
     [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
 *   Benchmark is done assuming XNNPACK cache is enabled
+*   Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
 *   dynamic_int8: quantized model with int8 weights and float activations.