MLRC_Bench

Running

App Files Files Community

MLRC_Bench / src /components /tasks.py

Armeddinosaur

Back to column_width

9bc8e05 6 months ago

raw

history blame

6.23 kB

	"""
	Task description components for the leaderboard application.
	"""
	import streamlit as st
	from src.utils.config import tasks_info
	from src.utils.task_mapping import get_display_name, get_original_name

	def render_task_descriptions():
	"""
	Render the benchmark details section
	"""
	# Display the MLRC-BENCH image
	st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)

	# Display the MLRC-BENCH information
	st.markdown("""
	## MLRC-BENCH: Can Language Agents Solve ML Research Challenges?

	Recent advances in large language models (LLMs) have motivated a critical question in the machine learning community: can AI agents not only propose novel research ideas but also translate them into effective implementations? MLRC-BENCH is introduced as a new benchmark to investigate this question by rigorously evaluating the capacity of LLM-based research agents to address contemporary ML competition tasks.

	---

	### Benchmark Overview

	MLRC-BENCH seeks to assess AI-driven research workflows in two primary dimensions:

	- Idea Proposal: Generating plausible and potentially innovative methods for addressing current ML research problems.
	- Code Implementation: Translating these ideas into executable solutions that measurably improve performance over a baseline.

	This design contrasts with prior benchmarks that emphasize either (1) full end-to-end paper generation assessed by subjective human or LLM reviews, or (2) isolated code-generation tasks that focus on engineering challenges. By dividing the problem into idea proposal and implementation, MLRC-BENCH provides a clearer measure of how well agents can form and operationalize research insights.

	---

	### Evaluation Criteria

	For each agent on a given task, MLRC-BENCH measures performance relative to a baseline method and a top human benchmark. We report two primary metrics, each taken from the maximum result across all experimental runs for a task-model pair:

	- Relative Improvement to Human
	How effectively the agent closes the gap between the baseline and the best human solution.

	- Absolute Improvement to Baseline
	How much better the agent performs compared to the baseline, expressed as a percentage gain.
	---

	### Significance

	MLRC-BENCH emphasizes rigorous and reproducible evaluations, focusing on tasks drawn from recent machine learning conferences and workshops to ensure that tested methods are both meaningful and nontrivial. This dynamic approach allows the benchmark to grow as new competition tasks arise, enabling continuous monitoring of progress in agent-driven research. Through its emphasis on objective success criteria, MLRC-BENCH fosters the development of AI agents that more effectively balance conceptual innovation with practical impact.

	---

	### Future Directions

	While current results suggest that LLM-based research agents still fall short of human capabilities in creativity and code implementation, MLRC-BENCH provides a scalable mechanism to track and accelerate progress. As AI methods advance—and potentially branch into high-stakes domains such as healthcare and climate modeling—this benchmark could serve as a critical resource for aligning agent innovation with reliability and safety.

	""")

	st.markdown("""
	<div class="card">
	<div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
	<p style="margin-bottom: 20px;">
	Click on any task to learn more.
	</p>
	</div>
	""", unsafe_allow_html=True)

	# Task links mapping - using original task names
	original_task_links = {
	"Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model",
	"Machine Unlearning": "https://unlearning-challenge.github.io/",
	"Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io",
	"Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge",
	"Meta Learning": "https://metalearning.chalearn.org/",
	"Llm Merging": "https://llm-merging.github.io"
	}

	# Update links mapping to use display names as keys
	task_links = {get_display_name(task): link for task, link in original_task_links.items()}

	# Create two columns
	col1, col2 = st.columns(2)

	# Split tasks between the two columns with better styling
	task_items = list(tasks_info.items())
	mid_point = len(task_items) // 2

	with col1:
	for task, description in task_items[:mid_point]:
	link = task_links.get(task, "#")
	st.markdown(f"""
	<a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
	<div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
	<div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
	</div>
	</a>
	""", unsafe_allow_html=True)

	with col2:
	for task, description in task_items[mid_point:]:
	link = task_links.get(task, "#")
	st.markdown(f"""
	<a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
	<div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
	<div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
	</div>
	</a>
	""", unsafe_allow_html=True)