Great Models Think Alike and this Undermines AI Oversight Paper • 2502.04313 • Published 5 days ago • 24
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published Dec 9, 2024 • 6