See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats.
Unsloth's r1-1776 2-bit Dynamic Quants is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit.
Instructions to run this model in llama.cpp:
Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic
- Do not forget about
<|User|>
and<|Assistant|>
tokens! - Or use a chat template formatter. Also do not forget about<think>\n
! Prompt format:"<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
- Obtain the latest
llama.cpp
at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
- It's best to use
--min-p 0.05
to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. - Download the model via:
# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/r1-1776-GGUF",
local_dir = "r1-1776-GGUF",
allow_patterns = ["*UD-Q2_K_XL*"], # Select quant type Q2_K_XL for dynamic 2bit
)
- Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode
./llama.cpp/llama-cli \
--model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
Example output:
Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
- If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
./llama.cpp/llama-cli \
--model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --prio 2 \
--n-gpu-layers 7 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
- If you want to merge the weights together, use this script:
./llama.cpp/llama-gguf-split --merge \
r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
merged_file.gguf
Dynamic Bits | Type | Disk Size | Accuracy | Link | Details |
---|---|---|---|---|---|
2bit | UD-Q2_K_XL | 211GB | Better | Link | MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit |
3bit | UD-Q3_K_XL | 298GB | Best | Link | MoE Q3_K_M. Attention parts are upcasted |
4bit | UD-Q4_K_XL | 377GB | Best | Link | MoE Q4_K_M. Attention parts are upcasted |
Finetune your own Reasoning model like R1 with Unsloth!
We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb
✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
Unsloth supports | Free Notebooks | Performance | Memory use |
---|---|---|---|
GRPO with Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 80% less |
Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less |
Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less |
Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less |
Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less |
Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less |
Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less |
Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less |
Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less |
- This Llama 3.2 conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
- This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
- * Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
R1 1776
Blog link: https://perplexity.ai/hub/blog/open-sourcing-r1-1776
R1 1776 is a DeepSeek-R1 reasoning model that has been post-trained by Perplexity AI to remove Chinese Communist Party censorship. The model provides unbiased, accurate, and factual information while maintaining high reasoning capabilities.
Evals
To ensure our model remains fully “uncensored” and capable of engaging with a broad spectrum of sensitive topics, we curated a diverse, multilingual evaluation set of over a 1000 of examples that comprehensively cover such subjects. We then use human annotators as well as carefully designed LLM judges to measure the likelihood a model will evade or provide overly sanitized responses to the queries.
We also ensured that the model’s math and reasoning abilities remained intact after the decensoring process. Evaluations on multiple benchmarks showed that our post-trained model performed on par with the base R1 model, indicating that the decensoring had no impact on its core reasoning capabilities.
- Downloads last month
- 13,271