See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats.

Unsloth's r1-1776 2-bit Dynamic Quants is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit.

Instructions to run this model in llama.cpp:

Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic

Do not forget about <｜User｜> and <｜Assistant｜> tokens! - Or use a chat template formatter. Also do not forget about <think>\n! Prompt format: "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜><think>\n"
Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

It's best to use --min-p 0.05 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
Download the model via:

# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/r1-1776-GGUF",
  local_dir = "r1-1776-GGUF",
  allow_patterns = ["*UD-Q2_K_XL*"], # Select quant type Q2_K_XL for dynamic 2bit
)

Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

   ./llama.cpp/llama-cli \
      --model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
      --cache-type-k q4_0 \
      --threads 12 -no-cnv --prio 2 \
      --temp 0.6 \
      --ctx-size 8192 \
      --seed 3407 \
      --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜><think>\n"

Example output:

 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...

If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

  ./llama.cpp/llama-cli \
    --model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜><think>\n"

If you want to merge the weights together, use this script:

./llama.cpp/llama-gguf-split --merge \
    r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
    merged_file.gguf

Dynamic Bits	Type	Disk Size	Accuracy	Link	Details
2bit	UD-Q2_K_XL	211GB	Better	Link	MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit
3bit	UD-Q3_K_XL	298GB	Best	Link	MoE Q3_K_M. Attention parts are upcasted
4bit	UD-Q4_K_XL	377GB	Best	Link	MoE Q4_K_M. Attention parts are upcasted

Finetune your own Reasoning model like R1 with Unsloth!

We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
GRPO with Phi-4 (14B)	▶️ Start on Colab	2x faster	80% less
Llama-3.2 (3B)	▶️ Start on Colab	2.4x faster	58% less
Llama-3.2 (11B vision)	▶️ Start on Colab	2x faster	60% less
Qwen2 VL (7B)	▶️ Start on Colab	1.8x faster	60% less
Qwen2.5 (7B)	▶️ Start on Colab	2x faster	60% less
Llama-3.1 (8B)	▶️ Start on Colab	2.4x faster	58% less
Phi-3.5 (mini)	▶️ Start on Colab	2x faster	50% less
Gemma 2 (9B)	▶️ Start on Colab	2.4x faster	58% less
Mistral (7B)	▶️ Start on Colab	2.2x faster	62% less

This Llama 3.2 conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

R1 1776

Blog link: https://perplexity.ai/hub/blog/open-sourcing-r1-1776

R1 1776 is a DeepSeek-R1 reasoning model that has been post-trained by Perplexity AI to remove Chinese Communist Party censorship. The model provides unbiased, accurate, and factual information while maintaining high reasoning capabilities.

Evals

To ensure our model remains fully “uncensored” and capable of engaging with a broad spectrum of sensitive topics, we curated a diverse, multilingual evaluation set of over a 1000 of examples that comprehensively cover such subjects. We then use human annotators as well as carefully designed LLM judges to measure the likelihood a model will evade or provide overly sanitized responses to the queries.

We also ensured that the model’s math and reasoning abilities remained intact after the decensoring process. Evaluations on multiple benchmarks showed that our post-trained model performed on par with the base R1 model, indicating that the decensoring had no impact on its core reasoning capabilities.

unsloth
/

r1-1776-GGUF

Instructions to run this model in llama.cpp:

Finetune your own Reasoning model like R1 with Unsloth!

✨ Finetune for Free

R1 1776

Evals

Model tree for unsloth/r1-1776-GGUF