Long Reasoning - a OrangeEye Collection

HuggingFaceH4/MATH

Viewer • Updated Jan 28 • 13.8k • 576 • 4

Note The high school math contest consists of questions covering several branches of mathematics such as algebra, geometry, probability and number theory. Train: 7500 Test: 5000 https://github.com/hendrycks/math

HuggingFaceH4/MATH-500

Viewer • Updated Nov 15, 2024 • 500 • 49.3k • 124

Note 500 questions selected from the MATH benchmark Test: 500 https://github.com/openai/prm800k

microsoft/orca-math-word-problems-200k

Viewer • Updated Mar 4, 2024 • 200k • 2.17k • 443

Note A variety of elementary math word problem sets (grade school) Train: 200K https://arxiv.org/pdf/2402.14830

openai/gsm8k

Viewer • Updated Jan 4, 2024 • 17.6k • 362k • 633

Note Grade school math word problems (easy) Train: 7473 Test: 1319 https://arxiv.org/abs/2110.14168

AI-MO/aimo-validation-aime

Viewer • Updated Jul 10, 2024 • 90 • 10.5k • 40

Note AIME 22, AIME 23, and AIME 24 Train: 90 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

HuggingFaceH4/aime_2024

Viewer • Updated Jan 26 • 30 • 17.9k • 21

Note 30 problems from the 2024 AIME I and AIME II tests

AI-MO/aimo-validation-amc

Viewer • Updated Jul 10, 2024 • 83 • 1.88k • 14

Note AMC12 2022, AMC12 2023 (Elementary math competition for students in grade 12 and below, examining the more basic high school math knowledge) Train: 83 https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

opencompass/AIME2025

Viewer • Updated 16 days ago • 30 • 2.61k • 10

Note AIME 25 Train: 15

NovaSky-AI/Sky-T1_preference_data_10k

Viewer • Updated Jan 23 • 9.43k • 257 • 13

Note Decrease generation length, while preserving accuracy across domains such as mathematics, coding, science, and general knowledge. Sky-T1-32B-Preview + PRM800K (12K questions) Train: 10K https://novasky-ai.github.io/posts/reduce-overthinking/

TIGER-Lab/MMLU-Pro

Viewer • Updated Nov 27, 2024 • 12.1k • 43.8k • 330

Note Each question has ten multiple-choice options. Train: 12032

tasksource/PRM800K

Preview • Updated May 31, 2023 • 127 • 32

Note Train: 12000 Test: 500 https://github.com/openai/prm800k/tree/main

Idavidrein/gpqa

Viewer • Updated Mar 28, 2024 • 1.25k • 52.9k • 143

Note Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry GPQA Diamond (Rein et al., 2023) consists of 198 PhD level science questions from Biology, Chemistry and Physics. Test: 448 https://github.com/idavidrein/gpqa

livecodebench/code_generation_lite

Updated Jan 14 • 32.1k • 23

Note a continuously updated code benchmark from contests across: LeetCode, AtCoder, and CodeForces Test: 1. release_v1: 400 2. release_v2: 511 3. release_v3: 612 4. release_v4: 713 5. release_v5: 880 https://github.com/LiveCodeBench/LiveCodeBench

AI-MO/NuminaMath-1.5

Viewer • Updated Feb 10 • 896k • 4.4k • 119

Note Competition-level math problems with CoT manner solutions Train: 900K

KbsdJames/Omni-MATH

Viewer • Updated Oct 12, 2024 • 4.43k • 5.28k • 87

Note 4428 competition-level problems Test: 4428 https://github.com/KbsdJames/Omni-MATH

GAIR/OlympicArena

Viewer • Updated Jul 20, 2024 • 10.6k • 1.49k • 18

Note 11,163 bilingual problems across both text-only and interleaved text-image modalities from 62 distinct Olympic competitions with 13 answer types Test: 11163 https://github.com/GAIR-NLP/OlympicArena

codeparrot/apps

Viewer • Updated Oct 20, 2022 • 20k • 8.51k • 167

Note Code generation benchmark Train: 10000

heya5/math_oai

Viewer • Updated Aug 7, 2024 • 500 • 51

Note Math eval benchmark Test: 500

svc-huggingface/minerva-math

Viewer • Updated Jan 22 • 272 • 134

Note Math eval benchmark Test: 272

Hothan/OlympiadBench

Viewer • Updated Jul 17, 2024 • 8.48k • 1.7k • 21

Note Olympiad-level bilingual multimodal scientific benchmark (math + physics) Test: 8,476 problems from Olympiad-level mathematics and physics competitions

BAAI/TACO

Updated Jun 19, 2024 • 3.62k • 102

Note TACO is a benchmark for code generation with 26,443 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. Train: 26443

GAIR/o1-journey

Viewer • Updated Oct 16, 2024 • 327 • 290 • 134

Note https://github.com/GAIR-NLP/O1-Journey

GAIR/LIMO

Viewer • Updated Feb 10 • 817 • 7.17k • 133

Note Curated mathematical reasoning data from NuminaMath-CoT, AIME, MATH Train: 817 https://github.com/GAIR-NLP/LIMO

simplescaling/s1K-1.1

Viewer • Updated 14 days ago • 1k • 7.07k • 86

Note 1,000 questions as in s1K but with traces instead generated by DeepSeek r1. Train: 1000 https://github.com/simplescaling/s1

simplescaling/data_ablation_full59K

Viewer • Updated Feb 3 • 60.4k • 2.08k • 18

Note Full 59K questions of S1: NuminaMATH, MATH, OlympicArena, OmniMath, AGIEval, xword, OlympiadBench, AIME (1983-2023), TheoremQA, USACO, JEEBench, GPQA, SciEval, s1-prob (128 Stanford statistics qualifying exams), LiveCodeBench, s1-teasers (23 interview questions for quantitative trading positions. Each sample consists of a problem and solution taken from PuzzledQuant (https: //www.puzzledquant.com/). We only take examples with the highest difficulty level ("Hard").) Train: 59029

RUC-AIBOX/long_form_thought_data_5k

Viewer • Updated Dec 30, 2024 • 4.92k • 284 • 26

Note STILL-2 Train: 5K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

RUC-AIBOX/STILL-3-Preview-RL-Data

Viewer • Updated Jan 26 • 29.9k • 1.86k • 10

Note STILL-3: MATH, NuminaMathCoT, and AIME 1983-2023 Train: 30K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

bespokelabs/Bespoke-Stratos-17k

Viewer • Updated Jan 31 • 16.7k • 60.7k • 293

Note Improved the Berkeley Sky-T1 data pipeline using SFT distillation data from DeepSeek-R1 to create Bespoke-Stratos-17k

NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14 • 16.4k • 1.28k • 178

Note 5k coding data from APPs and TACO, and 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset. In addition, we maintain 1k science and puzzle data from STILL-2. Train: 17K https://novasky-ai.github.io/posts/sky-t1/

open-thoughts/OpenThoughts-114k

Viewer • Updated 22 days ago • 228k • 86.6k • 652

Note 114k high-quality examples covering math, science, code, and puzzles distilled from DeepSeek-R1 Code: 1. BAAI/TACO 2. codeparrot/apps 3. deepmind/code_contests 4. MatrixStudio/Codeforces-Python-Submissions Math: 1. AI-MO/NuminaMath-CoT Science: 1. camel-ai/chemistry 2. camel-ai/biology 3. camel-ai/physics Puzzle: 1. INK-USC/riddle_sense Train: 113957 https://github.com/open-thoughts/open-thoughts

open-r1/OpenR1-Math-220k

Viewer • Updated 23 days ago • 450k • 53k • 491

Note 400k problems from NuminaMath 1.5 distills from DeepSeek R1. Train: 220K default: 94k problems and that achieves the best performance after SFT. extended: 131k samples where we add data sources like cn_k12, and SFT performance is lower.

FreedomIntelligence/medical-o1-reasoning-SFT

Viewer • Updated 20 days ago • 50.1k • 27.9k • 445

Note Advanced medical CoT reasoning distils from GPT-4o Train: 25.4K https://github.com/FreedomIntelligence/HuatuoGPT-o1

open-r1/OpenThoughts-114k-math

Viewer • Updated Jan 30 • 89.1k • 1.99k • 72

Note The math subset of OpenThoughts-114k with extra metadata Train: 89120 Of those, 56730/89120 (63%) have correct answers, as checked by Math-Verify

EricLu/SCP-116K

Viewer • Updated Feb 7 • 117k • 916 • 76

Note High-quality undergraduate to doctoral-le content filtered from 6.69 million web-crawled academic documents, (physics, chemistry, and biology) with solution distilled from o1-mini and QwQ-32B-preview, along with validation flags. Train: 116,756 https://github.com/AQA6666/SCP-116K-open/tree/main

agentica-org/DeepScaleR-Preview-Dataset

Viewer • Updated Feb 10 • 40.3k • 3.66k • 78

Note Unique mathematics problem-answer pairs from: AIME (American Invitational Mathematics Examination) problems (1984-2023) AMC (American Mathematics Competition) problems (prior to 2023) Omni-MATH dataset Still dataset Train: 40,000

math-eval/TAL-SCQ5K

Viewer • Updated Sep 15, 2023 • 10k • 395 • 54

Note English and Chinese multiple-choice mathematical competition from primary,junior high to high school level. Train: 3K Test: 2K

TIGER-Lab/WebInstructSub

Viewer • Updated Oct 27, 2024 • 2.34M • 2.3k • 146

Note 10M Math & Sci related Instruction data from the web (This one is partial data coming mostly from the forums like StackExchange) Train: 2.34M

TIGER-Lab/TheoremQA

Viewer • Updated May 15, 2024 • 800 • 644 • 17

Note STEM theorem-based reasoning benchmark, covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. Test: 800

yentinglin/aime_2025

Viewer • Updated 26 days ago • 60 • 3.18k

Note This dataset contains 30 problems from the 2025 AIME tests, including: AIME I: 15 problems AIME II: 15 problems

GAIR/LIMR

Viewer • Updated 24 days ago • 1.39k • 494 • 23

Note 1,389 selected questions from MATH (level 3-5) Train: 1389

lmms-lab/multimodal-open-r1-8k-verified

Viewer • Updated Jan 27 • 7.69k • 3.32k • 48

Note Multimodal reasoning data Generated by GPT4o with reasoning paths and verifiable answers, based on Math360K and Geo170K Train: 8K

SynthLabsAI/Big-Math-RL-Verified

Viewer • Updated 7 days ago • 251k • 5.36k • 149

Note Collections of open-source datasets of high-quality mathematical problems (with heavy filter) Uniquely verifiable solutions; Open-ended problem formulations; Closed-form solutions Extra 47,000 problems, Big-Math-Reformulated, reformulated open-ended questions from multiple-choice formats. Train: 251K

open-r1/codeforces-cots

Viewer • Updated about 5 hours ago • 195k • 783 • 27

Note 10k CodeForces problems Train: 10K

open-r1/ioi

Viewer • Updated about 20 hours ago • 270 • 249 • 4

Note International Olympiad in Informatics (IOI) 2020-2024 Train: 229 Test: 41

KodCode/KodCode-V1

Viewer • Updated 4 days ago • 447k • 3.02k • 63

Note fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. Train: 444K

GeneralReasoning/GeneralThought-323K

Viewer • Updated 5 days ago • 323k • 385 • 21

Note natural sciences, humanities, social sciences, and general conversations. Train: 323K