Long Reasoning
Datasets with reasoning traces for math and code (Train + Eval)
Viewer • Updated • 13.8k • 576 • 4Note The high school math contest consists of questions covering several branches of mathematics such as algebra, geometry, probability and number theory. Train: 7500 Test: 5000 https://github.com/hendrycks/math
HuggingFaceH4/MATH-500
Viewer • Updated • 500 • 49.3k • 124Note 500 questions selected from the MATH benchmark Test: 500 https://github.com/openai/prm800k
microsoft/orca-math-word-problems-200k
Viewer • Updated • 200k • 2.17k • 443Note A variety of elementary math word problem sets (grade school) Train: 200K https://arxiv.org/pdf/2402.14830
openai/gsm8k
Viewer • Updated • 17.6k • 362k • 633Note Grade school math word problems (easy) Train: 7473 Test: 1319 https://arxiv.org/abs/2110.14168
AI-MO/aimo-validation-aime
Viewer • Updated • 90 • 10.5k • 40Note AIME 22, AIME 23, and AIME 24 Train: 90 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
HuggingFaceH4/aime_2024
Viewer • Updated • 30 • 17.9k • 21Note 30 problems from the 2024 AIME I and AIME II tests
AI-MO/aimo-validation-amc
Viewer • Updated • 83 • 1.88k • 14Note AMC12 2022, AMC12 2023 (Elementary math competition for students in grade 12 and below, examining the more basic high school math knowledge) Train: 83 https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions
opencompass/AIME2025
Viewer • Updated • 30 • 2.61k • 10Note AIME 25 Train: 15
NovaSky-AI/Sky-T1_preference_data_10k
Viewer • Updated • 9.43k • 257 • 13Note Decrease generation length, while preserving accuracy across domains such as mathematics, coding, science, and general knowledge. Sky-T1-32B-Preview + PRM800K (12K questions) Train: 10K https://novasky-ai.github.io/posts/reduce-overthinking/
TIGER-Lab/MMLU-Pro
Viewer • Updated • 12.1k • 43.8k • 330Note Each question has ten multiple-choice options. Train: 12032
tasksource/PRM800K
Preview • Updated • 127 • 32Note Train: 12000 Test: 500 https://github.com/openai/prm800k/tree/main
Idavidrein/gpqa
Viewer • Updated • 1.25k • 52.9k • 143Note Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry GPQA Diamond (Rein et al., 2023) consists of 198 PhD level science questions from Biology, Chemistry and Physics. Test: 448 https://github.com/idavidrein/gpqa
livecodebench/code_generation_lite
Updated • 32.1k • 23Note a continuously updated code benchmark from contests across: LeetCode, AtCoder, and CodeForces Test: 1. release_v1: 400 2. release_v2: 511 3. release_v3: 612 4. release_v4: 713 5. release_v5: 880 https://github.com/LiveCodeBench/LiveCodeBench
AI-MO/NuminaMath-1.5
Viewer • Updated • 896k • 4.4k • 119Note Competition-level math problems with CoT manner solutions Train: 900K
KbsdJames/Omni-MATH
Viewer • Updated • 4.43k • 5.28k • 87Note 4428 competition-level problems Test: 4428 https://github.com/KbsdJames/Omni-MATH
GAIR/OlympicArena
Viewer • Updated • 10.6k • 1.49k • 18Note 11,163 bilingual problems across both text-only and interleaved text-image modalities from 62 distinct Olympic competitions with 13 answer types Test: 11163 https://github.com/GAIR-NLP/OlympicArena
codeparrot/apps
Viewer • Updated • 20k • 8.51k • 167Note Code generation benchmark Train: 10000
heya5/math_oai
Viewer • Updated • 500 • 51Note Math eval benchmark Test: 500
svc-huggingface/minerva-math
Viewer • Updated • 272 • 134Note Math eval benchmark Test: 272
Hothan/OlympiadBench
Viewer • Updated • 8.48k • 1.7k • 21Note Olympiad-level bilingual multimodal scientific benchmark (math + physics) Test: 8,476 problems from Olympiad-level mathematics and physics competitions
BAAI/TACO
Updated • 3.62k • 102Note TACO is a benchmark for code generation with 26,443 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. Train: 26443
GAIR/o1-journey
Viewer • Updated • 327 • 290 • 134
GAIR/LIMO
Viewer • Updated • 817 • 7.17k • 133Note Curated mathematical reasoning data from NuminaMath-CoT, AIME, MATH Train: 817 https://github.com/GAIR-NLP/LIMO
simplescaling/s1K-1.1
Viewer • Updated • 1k • 7.07k • 86Note 1,000 questions as in s1K but with traces instead generated by DeepSeek r1. Train: 1000 https://github.com/simplescaling/s1
simplescaling/data_ablation_full59K
Viewer • Updated • 60.4k • 2.08k • 18Note Full 59K questions of S1: NuminaMATH, MATH, OlympicArena, OmniMath, AGIEval, xword, OlympiadBench, AIME (1983-2023), TheoremQA, USACO, JEEBench, GPQA, SciEval, s1-prob (128 Stanford statistics qualifying exams), LiveCodeBench, s1-teasers (23 interview questions for quantitative trading positions. Each sample consists of a problem and solution taken from PuzzledQuant (https: //www.puzzledquant.com/). We only take examples with the highest difficulty level ("Hard").) Train: 59029
RUC-AIBOX/long_form_thought_data_5k
Viewer • Updated • 4.92k • 284 • 26Note STILL-2 Train: 5K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
RUC-AIBOX/STILL-3-Preview-RL-Data
Viewer • Updated • 29.9k • 1.86k • 10Note STILL-3: MATH, NuminaMathCoT, and AIME 1983-2023 Train: 30K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
bespokelabs/Bespoke-Stratos-17k
Viewer • Updated • 16.7k • 60.7k • 293Note Improved the Berkeley Sky-T1 data pipeline using SFT distillation data from DeepSeek-R1 to create Bespoke-Stratos-17k
NovaSky-AI/Sky-T1_data_17k
Viewer • Updated • 16.4k • 1.28k • 178Note 5k coding data from APPs and TACO, and 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset. In addition, we maintain 1k science and puzzle data from STILL-2. Train: 17K https://novasky-ai.github.io/posts/sky-t1/
open-thoughts/OpenThoughts-114k
Viewer • Updated • 228k • 86.6k • 652Note 114k high-quality examples covering math, science, code, and puzzles distilled from DeepSeek-R1 Code: 1. BAAI/TACO 2. codeparrot/apps 3. deepmind/code_contests 4. MatrixStudio/Codeforces-Python-Submissions Math: 1. AI-MO/NuminaMath-CoT Science: 1. camel-ai/chemistry 2. camel-ai/biology 3. camel-ai/physics Puzzle: 1. INK-USC/riddle_sense Train: 113957 https://github.com/open-thoughts/open-thoughts
open-r1/OpenR1-Math-220k
Viewer • Updated • 450k • 53k • 491Note 400k problems from NuminaMath 1.5 distills from DeepSeek R1. Train: 220K default: 94k problems and that achieves the best performance after SFT. extended: 131k samples where we add data sources like cn_k12, and SFT performance is lower.
FreedomIntelligence/medical-o1-reasoning-SFT
Viewer • Updated • 50.1k • 27.9k • 445Note Advanced medical CoT reasoning distils from GPT-4o Train: 25.4K https://github.com/FreedomIntelligence/HuatuoGPT-o1
open-r1/OpenThoughts-114k-math
Viewer • Updated • 89.1k • 1.99k • 72Note The math subset of OpenThoughts-114k with extra metadata Train: 89120 Of those, 56730/89120 (63%) have correct answers, as checked by Math-Verify
EricLu/SCP-116K
Viewer • Updated • 117k • 916 • 76Note High-quality undergraduate to doctoral-le content filtered from 6.69 million web-crawled academic documents, (physics, chemistry, and biology) with solution distilled from o1-mini and QwQ-32B-preview, along with validation flags. Train: 116,756 https://github.com/AQA6666/SCP-116K-open/tree/main
agentica-org/DeepScaleR-Preview-Dataset
Viewer • Updated • 40.3k • 3.66k • 78Note Unique mathematics problem-answer pairs from: AIME (American Invitational Mathematics Examination) problems (1984-2023) AMC (American Mathematics Competition) problems (prior to 2023) Omni-MATH dataset Still dataset Train: 40,000
math-eval/TAL-SCQ5K
Viewer • Updated • 10k • 395 • 54Note English and Chinese multiple-choice mathematical competition from primary,junior high to high school level. Train: 3K Test: 2K
TIGER-Lab/WebInstructSub
Viewer • Updated • 2.34M • 2.3k • 146Note 10M Math & Sci related Instruction data from the web (This one is partial data coming mostly from the forums like StackExchange) Train: 2.34M
TIGER-Lab/TheoremQA
Viewer • Updated • 800 • 644 • 17Note STEM theorem-based reasoning benchmark, covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. Test: 800
yentinglin/aime_2025
Viewer • Updated • 60 • 3.18kNote This dataset contains 30 problems from the 2025 AIME tests, including: AIME I: 15 problems AIME II: 15 problems
GAIR/LIMR
Viewer • Updated • 1.39k • 494 • 23Note 1,389 selected questions from MATH (level 3-5) Train: 1389
lmms-lab/multimodal-open-r1-8k-verified
Viewer • Updated • 7.69k • 3.32k • 48Note Multimodal reasoning data Generated by GPT4o with reasoning paths and verifiable answers, based on Math360K and Geo170K Train: 8K
SynthLabsAI/Big-Math-RL-Verified
Viewer • Updated • 251k • 5.36k • 149Note Collections of open-source datasets of high-quality mathematical problems (with heavy filter) Uniquely verifiable solutions; Open-ended problem formulations; Closed-form solutions Extra 47,000 problems, Big-Math-Reformulated, reformulated open-ended questions from multiple-choice formats. Train: 251K
open-r1/codeforces-cots
Viewer • Updated • 195k • 783 • 27Note 10k CodeForces problems Train: 10K
open-r1/ioi
Viewer • Updated • 270 • 249 • 4Note International Olympiad in Informatics (IOI) 2020-2024 Train: 229 Test: 41
KodCode/KodCode-V1
Viewer • Updated • 447k • 3.02k • 63Note fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. Train: 444K
GeneralReasoning/GeneralThought-323K
Viewer • Updated • 323k • 385 • 21Note natural sciences, humanities, social sciences, and general conversations. Train: 323K