In my case I asked both models to write code. The model is good if the code passes tests. What are your prompts?
https://huggingface.co/datasets/onekq-ai/WebApp1K-Duo-React
I know though Anthropic weighs in on safety.
In my case I asked both models to write code. The model is good if the code passes tests. What are your prompts?
https://huggingface.co/datasets/onekq-ai/WebApp1K-Duo-React
I know though Anthropic weighs in on safety.
And their python package too π
Having AI to do the refactor is a great idea though. It will be breaking change if you switch your model from non-reasoning to reasoning.
Adding Qwen2.5-Max
My conclusion is the same. The R1 paper already reported lower success rate of the distilled models. This is not surprising since we cannot expect the same outcomes out of a much smaller model.
Here is the problem. The small models released by frontier labs are always generic, i.e. decent but lower performance than the flagship model on every benchmark. But we GPU deplorables often want a specialized model which is excellent on only one thing, hence the disappointment.
I guess we will have to help ourselves on this one. Distill an opinionated dataset from the flagship model to a small model of your choice, then hill climb the benchmark you care about.
1000% agree.
Also reasoning models sure spit out lots of tokens. The same benchmark cost 4x or 5x the money and time to run than regular LLMs. Exciting time for inference players.
Have you tried the distilled models of R1(Qwen and Llama)?
+1
Also the velocity of progress. I have wanted to learn Monte Carlo Tree Search and process rewards etc. and haven't got the time. I guess now I can skip them π€