yunx-z commited on
Commit
8d50b1e
·
verified ·
1 Parent(s): 50e8ae6

Update src/components/tasks.py

Browse files
Files changed (1) hide show
  1. src/components/tasks.py +60 -17
src/components/tasks.py CHANGED
@@ -14,43 +14,86 @@ def render_task_descriptions():
14
 
15
  # Display the MLRC-BENCH information
16
  st.markdown("""
17
- ## MLRC-BENCH: Can Language Agents Solve ML Research Challenges?
18
 
19
- Recent advances in large language models (LLMs) have motivated a critical question in the machine learning community: can AI agents not only propose novel research ideas but also translate them into effective implementations? **MLRC-BENCH** is introduced as a new benchmark to investigate this question by rigorously evaluating the capacity of LLM-based research agents to address contemporary ML competition tasks.
 
 
20
 
21
  ---
22
 
23
- ### Benchmark Overview
24
 
25
- MLRC-BENCH seeks to assess AI-driven research workflows in two primary dimensions:
26
 
27
- - **Idea Proposal**: Generating plausible and potentially innovative methods for addressing current ML research problems.
28
- - **Code Implementation**: Translating these ideas into executable solutions that measurably improve performance over a baseline.
 
29
 
30
- This design contrasts with prior benchmarks that emphasize either (1) full end-to-end paper generation assessed by subjective human or LLM reviews, or (2) isolated code-generation tasks that focus on engineering challenges. By dividing the problem into idea proposal and implementation, MLRC-BENCH provides a clearer measure of how well agents can form and operationalize research insights.
31
 
32
  ---
33
 
34
- ### Evaluation Criteria
 
 
 
 
 
 
 
 
 
 
35
 
36
- For each agent on a given task, MLRC-BENCH measures performance relative to a **baseline** method and a **top human** benchmark. We report two primary metrics, each taken from the maximum result across all experimental runs for a task-model pair:
37
 
38
- - **Relative Improvement to Human**
39
- How effectively the agent closes the gap between the baseline and the best human solution.
 
 
40
 
41
- - **Absolute Improvement to Baseline**
42
- How much better the agent performs compared to the baseline, expressed as a percentage gain.
43
  ---
44
 
45
- ### Significance
 
 
 
 
 
 
46
 
47
- MLRC-BENCH emphasizes rigorous and reproducible evaluations, focusing on tasks drawn from recent machine learning conferences and workshops to ensure that tested methods are both **meaningful** and **nontrivial**. This dynamic approach allows the benchmark to grow as new competition tasks arise, enabling continuous monitoring of progress in agent-driven research. Through its emphasis on objective success criteria, MLRC-BENCH fosters the development of AI agents that more effectively balance conceptual innovation with practical impact.
48
 
49
  ---
50
 
51
- ### Future Directions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- While current results suggest that LLM-based research agents still fall short of human capabilities in creativity and code implementation, MLRC-BENCH provides a **scalable mechanism** to track and accelerate progress. As AI methods advance—and potentially branch into high-stakes domains such as healthcare and climate modeling—this benchmark could serve as a critical resource for aligning agent innovation with **reliability** and **safety**.
54
 
55
  """)
56
 
 
14
 
15
  # Display the MLRC-BENCH information
16
  st.markdown("""
 
17
 
18
+ # Can Language Agents Solve Machine Learning Research Challenges?
19
+
20
+ 🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems.
21
 
22
  ---
23
 
24
+ ## 🤖 What's the Problem?
25
 
26
+ While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**.
27
 
28
+ Most existing efforts either:
29
+ - Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas).
30
+ - Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation.
31
 
32
+ Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**.
33
 
34
  ---
35
 
36
+ ## 🧪 Enter MLRC-BENCH
37
+
38
+ **MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in:
39
+ - LLM safety
40
+ - Multimodal perception
41
+ - Few-shot learning
42
+ - Machine unlearning
43
+ - Meta learning
44
+ - And more!
45
+
46
+ Each task demands novel method design—not just re-implementing existing solutions.
47
 
48
+ ### What Makes MLRC-BENCH Unique?
49
 
50
+ - **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving.
51
+ - **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints.
52
+ - **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden.
53
+ - **Continually Updated**: New competition tasks will be added as ML research progresses.
54
 
 
 
55
  ---
56
 
57
+ ## 📉 What Did We Find?
58
+
59
+ Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**:
60
+
61
+ - The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution.
62
+ - Providing additional ideas from humans or other agents doesn't consistently help.
63
+ - LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform.
64
 
65
+ 📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**.
66
 
67
  ---
68
 
69
+ ## 🔬 Under the Hood
70
+
71
+ MLRC-BENCH comes with:
72
+ - **7 fully prepared tasks** with unified code structure.
73
+ - **Development & test splits** for fair comparison.
74
+ - **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**.
75
+ - A leaderboard showcasing normalized improvements over baselines.
76
+
77
+ > Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline!
78
+
79
+ ---
80
+
81
+ ## 🧠 Why This Matters
82
+
83
+ MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks:
84
+ > Can LLMs **propose and implement** solutions that outperform known baselines on hard problems?
85
+
86
+ If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**.
87
+
88
+ ---
89
+
90
+ ## 📍 Try It Yourself
91
+
92
+ Check out the tasks and submit your own agent:
93
+
94
+ 👉 We will open the link for submission in the near future. Stay tuned!
95
 
96
+ Let’s see if your agent can beat the benchmark!
97
 
98
  """)
99