pytorch
/

Qwen3-4B-8da4w

Text Generation

text-generation-inference

Model card Files Files and versions

metascroy commited on May 13

Commit

3be6f2e

·

verified ·

1 Parent(s): 8ad2519

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -21,11 +21,11 @@ pipeline_tag: text-generation
 [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
 The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
-We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4b-1024-ctx.pte) for direct use in ExecuTorch.
 (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
 # Running in a mobile app
-The [pte file](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4b-1024-ctx.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at [TODO: ADD] tokens/sec and uses [TODO: ADD] Mb of memory.
 [TODO: ADD SCREENSHOT]
@@ -227,7 +227,7 @@ python -m executorch.examples.models.llama.export_llama \
   --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
   --max_seq_length 1024 \
   --max_context_length 1024 \
-  --output_name="qwen3-4B-8da4w-1024-cxt.pte"
 ```
 After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).

 [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
 The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
+We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4B-8da4w-1024-cxt.pte) for direct use in ExecuTorch.
 (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
 # Running in a mobile app
+The [pte file](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4B-8da4w-1024-cxt.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at [TODO: ADD] tokens/sec and uses [TODO: ADD] Mb of memory.
 [TODO: ADD SCREENSHOT]
   --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
   --max_seq_length 1024 \
   --max_context_length 1024 \
+  --output_name="qwen3-4b-8da4w-1024-cxt.pte"
 ```
 After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).