Armeddinosaur commited on
Commit
a10c4ff
·
1 Parent(s): 06d4ee9

Updating details

Browse files
src/components/header.py CHANGED
@@ -17,6 +17,19 @@ def render_page_header():
17
  """,
18
  unsafe_allow_html=True
19
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  def render_section_header(title):
22
  """
 
17
  """,
18
  unsafe_allow_html=True
19
  )
20
+
21
+ # Add the links line separately, outside the header box
22
+ st.markdown(
23
+ f"""
24
+ <div class="links-bar">
25
+ <span class="info-item">📑 Paper</span> |
26
+ <a href="https://github.com/yunx-z/MLRC-Bench" target="_blank" class="link-item">💻 GitHub</a> |
27
+ <a href="https://huggingface.co/spaces/launch/MLRC_Bench" target="_blank" class="link-item">🤗 HuggingFace</a> |
28
+ <span class="info-item">Updated: March 2025</span>
29
+ </div>
30
+ """,
31
+ unsafe_allow_html=True
32
+ )
33
 
34
  def render_section_header(title):
35
  """
src/components/tasks.py CHANGED
@@ -10,44 +10,55 @@ def render_task_descriptions():
10
  Render the benchmark details section
11
  """
12
  # Display the MLRC-BENCH image
13
- st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)
14
 
15
  # Display the MLRC-BENCH information
16
  st.markdown("""
17
- # MLRC-BENCH: Can Language Agents Crack ML Research Challenges?
18
 
19
- Recent advances in large language models (LLMs) have raised an intriguing question for the machine learning community: Can AI agents not only generate novel research ideas but also implement them effectively? A new benchmark, **MLRC-BENCH**, steps into the spotlight to answer this very question.
20
 
21
- ## What Is MLRC-BENCH?
22
 
23
- MLRC-BENCH is a dynamic benchmark designed to objectively evaluate whether LLM-based research agents can tackle cutting-edge ML competition tasks. Unlike previous evaluations that either focused on end-to-end paper generation or narrow engineering challenges, this benchmark splits the research workflow into two core steps:
24
- - **Idea Proposal:** Generating innovative research ideas.
25
- - **Code Implementation:** Translating those ideas into working, performance-improving code.
26
 
27
- The benchmark uses tasks sourced from recent ML conferences and workshops, ensuring the problems are both impactful and non-trivial.
28
 
29
- ## How Does It Work?
 
30
 
31
- MLRC-BENCH emphasizes **objective metrics**:
32
- - **Success Rate:** An agent is deemed successful if its solution improves upon a baseline by at least 5% of the margin by which the top human solution surpasses that baseline.
33
- - **Performance, Efficiency & Simplicity:** Each solution is measured not only by how well it performs but also by how efficient and simple the code is. For example, an ideal solution should achieve higher performance with minimal runtime and code complexity.
34
 
35
- Additionally, the benchmark integrates **LLM-as-a-judge evaluations** to compare subjective assessments of idea novelty with the objective performance gains. Interestingly, the study reveals a weak correlation between perceived novelty and actual performance improvements.
36
-
37
- ## Why It Matters
38
 
39
- The ability for AI agents to contribute to scientific discovery is both exciting and cautionary. While MLRC-BENCH demonstrates that current agents are not yet ready to match human ingenuity, it also provides a scalable framework to track progress and encourage future innovations. The insights gained from this benchmark could guide the development of safer, more effective AI research tools, particularly in high-stakes fields like healthcare, climate science, and AI safety.
40
 
41
- ## Looking Ahead
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- MLRC-BENCH is built to evolve: as new ML competitions emerge, the benchmark can be updated to reflect the latest challenges. This dynamic nature ensures that it remains a relevant tool for pushing the boundaries of AI-assisted scientific research.
44
  """)
45
 
46
  st.markdown("""
47
  <div class="card">
48
  <div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
49
  <p style="margin-bottom: 20px;">
50
- Click on any task to learn more about the original benchmark.
51
  </p>
52
  </div>
53
  """, unsafe_allow_html=True)
 
10
  Render the benchmark details section
11
  """
12
  # Display the MLRC-BENCH image
13
+ st.image("Assests/MLRC_Bench_overview.png", use_container_width=True)
14
 
15
  # Display the MLRC-BENCH information
16
  st.markdown("""
17
+ ## MLRC-BENCH: Can Language Agents Solve ML Research Challenges?
18
 
19
+ Recent advances in large language models (LLMs) have motivated a critical question in the machine learning community: can AI agents not only propose novel research ideas but also translate them into effective implementations? **MLRC-BENCH** is introduced as a new benchmark to investigate this question by rigorously evaluating the capacity of LLM-based research agents to address contemporary ML competition tasks.
20
 
21
+ ---
22
 
23
+ ### Benchmark Overview
 
 
24
 
25
+ MLRC-BENCH seeks to assess AI-driven research workflows in two primary dimensions:
26
 
27
+ - **Idea Proposal**: Generating plausible and potentially innovative methods for addressing current ML research problems.
28
+ - **Code Implementation**: Translating these ideas into executable solutions that measurably improve performance over a baseline.
29
 
30
+ This design contrasts with prior benchmarks that emphasize either (1) full end-to-end paper generation assessed by subjective human or LLM reviews, or (2) isolated code-generation tasks that focus on engineering challenges. By dividing the problem into idea proposal and implementation, MLRC-BENCH provides a clearer measure of how well agents can form and operationalize research insights.
 
 
31
 
32
+ ---
 
 
33
 
34
+ ### Evaluation Criteria
35
 
36
+ For each agent on a given task, MLRC-BENCH measures performance relative to a **baseline** method and a **top human** benchmark. We report two primary metrics, each taken from the maximum result across all experimental runs for a task-model pair:
37
+
38
+ - **Relative Improvement to Human**
39
+ How effectively the agent closes the gap between the baseline and the best human solution.
40
+
41
+ - **Absolute Improvement to Baseline**
42
+ How much better the agent performs compared to the baseline, expressed as a percentage gain.
43
+ ---
44
+
45
+ ### Significance
46
+
47
+ MLRC-BENCH emphasizes rigorous and reproducible evaluations, focusing on tasks drawn from recent machine learning conferences and workshops to ensure that tested methods are both **meaningful** and **nontrivial**. This dynamic approach allows the benchmark to grow as new competition tasks arise, enabling continuous monitoring of progress in agent-driven research. Through its emphasis on objective success criteria, MLRC-BENCH fosters the development of AI agents that more effectively balance conceptual innovation with practical impact.
48
+
49
+ ---
50
+
51
+ ### Future Directions
52
+
53
+ While current results suggest that LLM-based research agents still fall short of human capabilities in creativity and code implementation, MLRC-BENCH provides a **scalable mechanism** to track and accelerate progress. As AI methods advance—and potentially branch into high-stakes domains such as healthcare and climate modeling—this benchmark could serve as a critical resource for aligning agent innovation with **reliability** and **safety**.
54
 
 
55
  """)
56
 
57
  st.markdown("""
58
  <div class="card">
59
  <div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
60
  <p style="margin-bottom: 20px;">
61
+ Click on any task to learn more.
62
  </p>
63
  </div>
64
  """, unsafe_allow_html=True)
src/styles/components.py CHANGED
@@ -35,6 +35,41 @@ def get_container_styles():
35
  }}
36
  """
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  def get_card_styles():
39
  """
40
  Get CSS styles for cards
@@ -226,6 +261,7 @@ def get_all_component_styles():
226
  """
227
  styles = [
228
  get_container_styles(),
 
229
  get_card_styles(),
230
  get_task_card_styles(),
231
  get_button_styles(),
 
35
  }}
36
  """
37
 
38
+ def get_links_bar_styles():
39
+ """
40
+ Get CSS styles for the links bar below the header
41
+
42
+ Returns:
43
+ str: CSS string for links bar
44
+ """
45
+ return f"""
46
+ .links-bar {{
47
+ text-align: center;
48
+ padding: 10px 0;
49
+ margin-bottom: 20px;
50
+ font-size: 14px;
51
+ }}
52
+
53
+ .link-item {{
54
+ color: {dark_theme['primary']};
55
+ text-decoration: none;
56
+ transition: opacity 0.2s ease;
57
+ margin: 0 3px;
58
+ font-weight: 500;
59
+ }}
60
+
61
+ .link-item:hover {{
62
+ opacity: 0.8;
63
+ text-decoration: underline;
64
+ }}
65
+
66
+ .info-item {{
67
+ color: {dark_theme['text_color']};
68
+ margin: 0 3px;
69
+ font-weight: 400;
70
+ }}
71
+ """
72
+
73
  def get_card_styles():
74
  """
75
  Get CSS styles for cards
 
261
  """
262
  styles = [
263
  get_container_styles(),
264
+ get_links_bar_styles(),
265
  get_card_styles(),
266
  get_task_card_styles(),
267
  get_button_styles(),