MLRC_Bench

Running

App Files Files Community

yunx-z commited on Apr 16

Commit

8d50b1e

verified ·

1 Parent(s): 50e8ae6

Update src/components/tasks.py

Browse files

Files changed (1) hide show

src/components/tasks.py +60 -17

src/components/tasks.py CHANGED Viewed

@@ -14,43 +14,86 @@ def render_task_descriptions():
     # Display the MLRC-BENCH information
     st.markdown("""
-   ## MLRC-BENCH: Can Language Agents Solve ML Research Challenges?
-Recent advances in large language models (LLMs) have motivated a critical question in the machine learning community: can AI agents not only propose novel research ideas but also translate them into effective implementations? **MLRC-BENCH** is introduced as a new benchmark to investigate this question by rigorously evaluating the capacity of LLM-based research agents to address contemporary ML competition tasks.
 ---
-### Benchmark Overview
-MLRC-BENCH seeks to assess AI-driven research workflows in two primary dimensions:
-- **Idea Proposal**: Generating plausible and potentially innovative methods for addressing current ML research problems.
-- **Code Implementation**: Translating these ideas into executable solutions that measurably improve performance over a baseline.
-This design contrasts with prior benchmarks that emphasize either (1) full end-to-end paper generation assessed by subjective human or LLM reviews, or (2) isolated code-generation tasks that focus on engineering challenges. By dividing the problem into idea proposal and implementation, MLRC-BENCH provides a clearer measure of how well agents can form and operationalize research insights.
 ---
-### Evaluation Criteria
-For each agent on a given task, MLRC-BENCH measures performance relative to a **baseline** method and a **top human** benchmark. We report two primary metrics, each taken from the maximum result across all experimental runs for a task-model pair:
-- **Relative Improvement to Human**
-  How effectively the agent closes the gap between the baseline and the best human solution.
-- **Absolute Improvement to Baseline**
-  How much better the agent performs compared to the baseline, expressed as a percentage gain.
 ---
-### Significance
-MLRC-BENCH emphasizes rigorous and reproducible evaluations, focusing on tasks drawn from recent machine learning conferences and workshops to ensure that tested methods are both **meaningful** and **nontrivial**. This dynamic approach allows the benchmark to grow as new competition tasks arise, enabling continuous monitoring of progress in agent-driven research. Through its emphasis on objective success criteria, MLRC-BENCH fosters the development of AI agents that more effectively balance conceptual innovation with practical impact.
 ---
-### Future Directions
-While current results suggest that LLM-based research agents still fall short of human capabilities in creativity and code implementation, MLRC-BENCH provides a **scalable mechanism** to track and accelerate progress. As AI methods advance—and potentially branch into high-stakes domains such as healthcare and climate modeling—this benchmark could serve as a critical resource for aligning agent innovation with **reliability** and **safety**.
     """)

     # Display the MLRC-BENCH information
     st.markdown("""
+# Can Language Agents Solve Machine Learning Research Challenges?
+🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems.
 ---
+## 🤖 What's the Problem?
+While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**.
+Most existing efforts either:
+- Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas).
+- Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation.
+Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**.
 ---
+## 🧪 Enter MLRC-BENCH
+**MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in:
+- LLM safety
+- Multimodal perception
+- Few-shot learning
+- Machine unlearning
+- Meta learning
+- And more!
+Each task demands novel method design—not just re-implementing existing solutions.
+### ✅ What Makes MLRC-BENCH Unique?
+- **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving.
+- **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints.
+- **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden.
+- **Continually Updated**: New competition tasks will be added as ML research progresses.
 ---
+## 📉 What Did We Find?
+Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**:
+- The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution.
+- Providing additional ideas from humans or other agents doesn't consistently help.
+- LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform.
+📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**.
 ---
+## 🔬 Under the Hood
+MLRC-BENCH comes with:
+- **7 fully prepared tasks** with unified code structure.
+- **Development & test splits** for fair comparison.
+- **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**.
+- A leaderboard showcasing normalized improvements over baselines.
+> Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline!
+---
+## 🧠 Why This Matters
+MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks:
+> Can LLMs **propose and implement** solutions that outperform known baselines on hard problems?
+If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**.
+---
+## 📍 Try It Yourself
+Check out the tasks and submit your own agent:
+👉 We will open the link for submission in the near future. Stay tuned!
+Let’s see if your agent can beat the benchmark!
     """)