--- license: mit --- > a simple yet effective postSFT method that enhances long CoT reasoning without requiring new long CoT responses > # Introduction - Here, we show the results of open-source reasoning LLMs before and after ThinkPO. ## Accuracy | Models | Dataset | SFT | Ours (+ThinkPO) | Improv. (%) | |:--------:|:--------:|:--------:|:--------:|:--------:| |DeepSeek-R1-Distill-Qwen-7B (Deepseek) |MATH500 | 87.4 | 91.2 | 4.3% | || AIME | 56.7 | 43.3 | -23.6% | || GPQA | 47.0 | 49.5 | 5.3% | || GSM8K | 87.2 | 87.6 | 0.5% | || Olympiad | 58.6 | 58.6 | 0.0% | |Bespoke-Stratos-7B (Bespoke)| MATH500 | 84.0 | 82.8 | -1.4% | || AIME | 20.0 | 23.3 | 16.5% | || GPQA | 37.9 | 43.4 | 14.5% | || GSM8K | 92.9 | 93.3 | 0.4% | || Olympiad | 44.1 | 48.5 | 10.0% | ## Average Response Length | Model | Dataset | SFT | Ours (+ThinkPO) | Improv. (%) | |:--------:|:--------:|:--------:|:--------:|:--------:| |DeepSeek-R1-Distill-Qwen-7B (Deepseek) | MATH500 | 2577 | 3021 | 17.2% | || AIME | 11419 | 12875 | 12.8% | || GPQA | 4895 | 5604 | 14.5% | || GSM8K | 619 | 668 | 7.9% | || Olympiad | 7196 | 7383 | 2.6% | |Bespoke-Stratos-7B (Bespoke)| MATH500 | 5696 | 6404 | 12.4% | || AIME | 19858 | 20079 | 1.1% | || GPQA | 5968 | 7301 | 22.3% | || GSM8K | 1404 | 1755 | 25.0% | || Olympiad | 11140 | 12204 | 9.6% | ---