Abstract
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.
Community
Scaling laws provide a valuable framework for optimizing teacher-student compute allocation, but real-world AI deployment rarely follows a fixed equation. In areas like chatbots, search engines, and recommendation systems, models don’t just train once and remain static—they evolve alongside shifting user behaviors, changing datasets, and emerging tasks. This study offers strong insights into when distillation is more efficient than supervised learning, but should we be thinking beyond one-time optimization? Could an adaptive scaling law, one that dynamically adjusts teacher-student relationships based on real-world performance and resource constraints, be a more effective approach for AI systems that operate in constantly changing environments? Additionally, the study assumes access to large-scale training data, but many low-resource languages and specialized technical fields lack the volume necessary for traditional scaling laws to apply. Could distillation strategies be refined not just for compute efficiency but also to prioritize knowledge transfer in data-scarce settings, making AI more adaptable and inclusive across different domains?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- On Teacher Hacking in Language Model Distillation (2025)
- Self-Evolution Knowledge Distillation for LLM-based Machine Translation (2024)
- AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation (2025)
- Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much) (2025)
- Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment (2025)
- Scaling Laws for Upcycling Mixture-of-Experts Language Models (2025)
- Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Great paper! We made a deep dive video for this paper: https://www.youtube.com/watch?v=yIl-JM6SJm8. Distill or Drill?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper