Great Models Think Alike and this Undermines AI Oversight Paper • 2502.04313 • Published 5 days ago • 24
Great Models Think Alike and this Undermines AI Oversight Paper • 2502.04313 • Published 5 days ago • 24
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published Dec 9, 2024 • 6
MALT: Improving Reasoning with Multi-Agent LLM Training Paper • 2412.01928 • Published Dec 2, 2024 • 40
Data Contamination Report from the 2024 CONDA Shared Task Paper • 2407.21530 • Published Jul 31, 2024 • 10
Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry Paper • 2404.06405 • Published Apr 9, 2024 • 2
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4, 2024 • 28
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress Paper • 2402.19472 • Published Feb 29, 2024 • 2
Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right? Paper • 2305.09275 • Published May 16, 2023 • 1