See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats.

Unsloth's r1-1776 2-bit Dynamic Quants is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit.

Instructions to run this model in llama.cpp:

Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic

  1. Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter. Also do not forget about <think>\n! Prompt format: "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
  2. Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
  1. It's best to use --min-p 0.05 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
  2. Download the model via:
# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/r1-1776-GGUF",
  local_dir = "r1-1776-GGUF",
  allow_patterns = ["*UD-Q2_K_XL*"], # Select quant type Q2_K_XL for dynamic 2bit
)
  1. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode
   ./llama.cpp/llama-cli \
      --model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
      --cache-type-k q4_0 \
      --threads 12 -no-cnv --prio 2 \
      --temp 0.6 \
      --ctx-size 8192 \
      --seed 3407 \
      --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"

Example output:

 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
  1. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
  ./llama.cpp/llama-cli \
    --model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
  1. If you want to merge the weights together, use this script:
./llama.cpp/llama-gguf-split --merge \
    r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
    merged_file.gguf
Dynamic Bits Type Disk Size Accuracy Link Details
2bit UD-Q2_K_XL 211GB Better Link MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit
3bit UD-Q3_K_XL 298GB Best Link MoE Q3_K_M. Attention parts are upcasted
4bit UD-Q4_K_XL 377GB Best Link MoE Q4_K_M. Attention parts are upcasted

Finetune your own Reasoning model like R1 with Unsloth!

We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use
GRPO with Phi-4 (14B) ▶️ Start on Colab 2x faster 80% less
Llama-3.2 (3B) ▶️ Start on Colab 2.4x faster 58% less
Llama-3.2 (11B vision) ▶️ Start on Colab 2x faster 60% less
Qwen2 VL (7B) ▶️ Start on Colab 1.8x faster 60% less
Qwen2.5 (7B) ▶️ Start on Colab 2x faster 60% less
Llama-3.1 (8B) ▶️ Start on Colab 2.4x faster 58% less
Phi-3.5 (mini) ▶️ Start on Colab 2x faster 50% less
Gemma 2 (9B) ▶️ Start on Colab 2.4x faster 58% less
Mistral (7B) ▶️ Start on Colab 2.2x faster 62% less

R1 1776

Blog link: https://perplexity.ai/hub/blog/open-sourcing-r1-1776

R1 1776 is a DeepSeek-R1 reasoning model that has been post-trained by Perplexity AI to remove Chinese Communist Party censorship. The model provides unbiased, accurate, and factual information while maintaining high reasoning capabilities.

Evals

To ensure our model remains fully “uncensored” and capable of engaging with a broad spectrum of sensitive topics, we curated a diverse, multilingual evaluation set of over a 1000 of examples that comprehensively cover such subjects. We then use human annotators as well as carefully designed LLM judges to measure the likelihood a model will evade or provide overly sanitized responses to the queries. image/png

We also ensured that the model’s math and reasoning abilities remained intact after the decensoring process. Evaluations on multiple benchmarks showed that our post-trained model performed on par with the base R1 model, indicating that the decensoring had no impact on its core reasoning capabilities. image/png

Downloads last month
13,271
GGUF
Model size
671B params
Architecture
deepseek2

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for unsloth/r1-1776-GGUF

Quantized
(5)
this model