AlejandroOlmedo commited on
Commit
891b1d8
·
verified ·
1 Parent(s): d8b88c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -17,11 +17,16 @@ licence: license
17
 
18
  **This GRPO trained model is a fine-tuned version of **[**__deepseek-ai/DeepSeek-R1-Distill-Qwen-7B__**](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)** on the **[**__DigitalLearningGmbH/MATH-lighteval__**](https://huggingface.co/datasets/DigitalLearningGmbH/MATH-lighteval)** dataset.**
19
 
 
 
20
  *Special thanks to Dongwei for fine-tuning this version of DeepSeek-R1-Distill-Qwen-7B. More information about it can be found here:*
21
  [https://huggingface.co/Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math](https://huggingface.co/Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math)
22
 
23
  I simply converted it to MLX format with a quantization of 8-bit for better performance on Apple Silicon Macs (M1,M2,M3,M4 Chips).
24
 
 
 
 
25
  # Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx
26
 
27
  The Model [Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx](https://huggingface.co/Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx) was converted to MLX format from [Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math](https://huggingface.co/Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math) using mlx-lm version **0.20.5**.
 
17
 
18
  **This GRPO trained model is a fine-tuned version of **[**__deepseek-ai/DeepSeek-R1-Distill-Qwen-7B__**](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)** on the **[**__DigitalLearningGmbH/MATH-lighteval__**](https://huggingface.co/datasets/DigitalLearningGmbH/MATH-lighteval)** dataset.**
19
 
20
+ GRPO is applied after a distilled R1 model is created to further refine its reasoning capabilities. Rather than the initial distillation step—which transfers capacities from a larger model—GRPO uses reinforcement learning to optimize the policy model by maximizing a reward signal. This fine-tuning step is distinct from distillation and aims to boost performance in chain-of-thought and reasoning tasks.
21
+
22
  *Special thanks to Dongwei for fine-tuning this version of DeepSeek-R1-Distill-Qwen-7B. More information about it can be found here:*
23
  [https://huggingface.co/Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math](https://huggingface.co/Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math)
24
 
25
  I simply converted it to MLX format with a quantization of 8-bit for better performance on Apple Silicon Macs (M1,M2,M3,M4 Chips).
26
 
27
+ # Notes:
28
+ - Seems to brush over the "thinking" process and immediately start answering, leading to extremely quick but correct answers.
29
+
30
  # Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx
31
 
32
  The Model [Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx](https://huggingface.co/Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx) was converted to MLX format from [Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math](https://huggingface.co/Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math) using mlx-lm version **0.20.5**.