MLRC_Bench

Running

App Files Files Community

Armeddinosaur commited on Mar 19

Commit

ed2eb44

1 Parent(s): df20b20

Adding MLRC Bench

Browse files

Files changed (27) hide show

.gitignore +16 -0
Assests/MLRC_Bench_overview.png +0 -0
Factbench_logo.png +0 -0
README.md +210 -5
app.py +8 -347
factEvalSteps.pdf +0 -0
factEvalSteps.png +0 -3
factbench_data.csv +0 -13
requirements.py → requirements.txt +1 -1
src/app.py +95 -0
src/components/filters.py +117 -0
src/components/header.py +41 -0
src/components/leaderboard.py +128 -0
src/components/tasks.py +93 -0
src/data/metrics/margin_to_human.json +50 -0
src/data/processors.py +87 -0
src/styles/base.py +25 -0
src/styles/components.py +272 -0
src/styles/main_styles.py +346 -0
src/styles/tables.py +263 -0
src/styles/theme.py +31 -0
src/utils/config.py +85 -0
src/utils/data_loader.py +154 -0
tiered_models_data.csv +0 -23
verifact_data.csv +0 -25
verifact_logo.png +0 -0
verifact_steps.png +0 -3

.gitignore ADDED Viewed

	@@ -0,0 +1,16 @@

+env/
+sample_leaderboard/
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.pytest_cache/
+.DS_Store
+.env
+.venv
+htmlcov/
+.mypy_cache/
+.ruff_cache/
+.vscode/
+.idea/
+.vscode/

Assests/MLRC_Bench_overview.png ADDED Viewed

Factbench_logo.png DELETED Viewed

Binary file (305 kB)

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: VeriFact
-emoji: 📈
-colorFrom: blue
-colorTo: gray
 sdk: streamlit
 sdk_version: 1.39.0
 app_file: app.py
@@ -10,4 +10,209 @@ pinned: false
 license: cc-by-4.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Model Capability Leaderboard
+emoji: 📊
+colorFrom: green
+colorTo: blue
 sdk: streamlit
 sdk_version: 1.39.0
 app_file: app.py
 license: cc-by-4.0
 ---
+# Model Capability Leaderboard
+A modern, interactive dashboard for comparing the performance of different AI models across Machine Learning Research Challenges.
+![Model Capability Leaderboard](https://via.placeholder.com/800x400?text=Model+Capability+Leaderboard)
+## Overview
+This application provides a visual leaderboard for comparing AI model performance on challenging Machine Learning Research Competition problems. It uses Streamlit to create an interactive web interface with filtering options, allowing users to select specific models and tasks for comparison.
+The leaderboard uses the MLRC-BENCH benchmark, which measures what percentage of the top human-to-baseline performance gap an agent can close. Success is defined as achieving at least 5% of the margin by which the top human solution surpasses the baseline.
+### Key Features
+- **Interactive Filtering**: Select specific model types and tasks to focus on
+- **Customizable Metrics**: Compare models using "Margin to Human" performance scores
+- **Hierarchical Table Display**: Fixed columns with scrollable metrics section
+- **Conditional Formatting**: Visual indicators for positive/negative values
+- **Model Type Color Coding**: Different colors for Open Source, Open Weights, and Closed Source models
+- **Medal Indicators**: Top-ranked models receive gold, silver, and bronze medals
+- **Task Descriptions**: Detailed explanations of what each task measures
+## Project Structure
+The codebase follows a modular architecture for improved maintainability and separation of concerns:
+```
+app.py (main entry point)
+├── requirements.txt
+└── src/
+    ├── app.py (main application logic)
+    ├── components/
+    │   ├── header.py (header and footer components)
+    │   ├── filters.py (filter selection components)
+    │   ├── leaderboard.py (leaderboard table component)
+    │   └── tasks.py (task descriptions component)
+    ├── data/
+    │   ├── processors.py (data processing utilities)
+    │   └── metrics/
+    │       └── margin_to_human.json (metric data file)
+    ├── styles/
+    │   ├── base.py (combined styles)
+    │   ├── components.py (component styling)
+    │   ├── tables.py (table-specific styling)
+    │   └── theme.py (theme definitions)
+    └── utils/
+        ├── config.py (configuration settings)
+        └── data_loader.py (data loading utilities)
+```
+### Module Descriptions
+#### Core Files
+- `app.py` (root): Simple entry point that imports and calls the main function
+- `src/app.py`: Main application logic, coordinates the overall flow
+#### Components
+- `header.py`: Manages the page header, section headers, and footer components
+- `filters.py`: Handles metric, task, and model type selection interfaces
+- `leaderboard.py`: Renders the custom HTML leaderboard table
+- `tasks.py`: Renders the task descriptions section
+#### Data Processing
+- `processors.py`: Contains utilities for data formatting and styling
+- `data_loader.py`: Functions for loading and processing metric data
+#### Styling
+- `theme.py`: Base theme definitions and color schemes
+- `components.py`: Styling for UI components (buttons, cards, etc.)
+- `tables.py`: Styling for tables and data displays
+- `base.py`: Combines all styles for application-wide use
+#### Configuration
+- `config.py`: Contains all configuration settings including themes, metrics, and model categorizations
+## Benefits of Modular Architecture
+The modular structure provides several advantages:
+1. **Improved Code Organization**: Code is logically separated based on functionality
+2. **Better Separation of Concerns**: Each module has a clear, single responsibility
+3. **Enhanced Maintainability**: Changes to one aspect don't require modifying the entire codebase
+4. **Simplified Testing**: Components can be tested independently
+5. **Easier Collaboration**: Multiple developers can work on different parts simultaneously
+6. **Cleaner Entry Point**: Main app file is simple and focused
+## Installation & Setup
+1. Clone the repository
+   ```bash
+   git clone <repository-url>
+   cd model-capability-leaderboard
+   ```
+2. Install the required dependencies
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. Run the application
+   ```bash
+   streamlit run app.py
+   ```
+## Extending the Application
+### Adding New Metrics
+To add a new metric:
+1. Create a new JSON data file in the `src/data/metrics/` directory (e.g., `src/data/metrics/new_metric.json`)
+2. Update `metrics_config` in `src/utils/config.py`:
+   ```python
+   metrics_config = {
+       "Margin to Human": { ... },
+       "New Metric Name": {
+           "file": "src/data/metrics/new_metric.json",
+           "description": "Description of the new metric",
+           "min_value": 0,
+           "max_value": 100,
+           "color_map": "viridis"
+       }
+   }
+   ```
+3. Ensure your metric JSON file follows the same format as existing metrics:
+   ```json
+   {
+     "task-name": {
+       "model-name-1": value,
+       "model-name-2": value
+     },
+     "another-task": {
+       "model-name-1": value,
+       "model-name-2": value
+     }
+   }
+   ```
+### Adding New Model Types
+To add new model types:
+1. Update `model_categories` in `src/utils/config.py`:
+   ```python
+   model_categories = {
+       "Existing Model": "Category",
+       "New Model Name": "New Category"
+   }
+   ```
+### Modifying the UI Theme
+To change the theme colors:
+1. Update the `dark_theme` dictionary in `src/utils/config.py`
+### Adding New Components
+To add new visualization components:
+1. Create a new file in the `src/components/` directory
+2. Import and use the component in `src/app.py`
+## Data Format
+The application uses JSON files for metric data. The expected format is:
+```json
+{
+  "task-name": {
+    "model-name-1": value,
+    "model-name-2": value
+  },
+  "another-task": {
+    "model-name-1": value,
+    "model-name-2": value
+  }
+}
+```
+## Testing
+This modular structure makes it easier to write focused unit tests:
+```python
+# Example test for data_loader.py
+def test_process_data():
+    test_data = {"task": {"model": 0.5}}
+    df = process_data(test_data)
+    assert "Task" in df.columns
+    assert df.loc["model", "Task"] == 0.5
+```
+## License
+[MIT License](LICENSE)
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## Contact
+For any questions or feedback, please contact [[email protected]](mailto:[email protected]).

app.py CHANGED Viewed

@@ -1,347 +1,8 @@
-import streamlit as st
-import pandas as pd
-from PIL import Image
-import base64
-from io import BytesIO
-# Set up page config
-st.set_page_config(
-    page_title="VeriFact Leaderboard",
-    layout="wide"
-)
-# load header
-with open("_header.md", "r") as f:
-    HEADER_MD = f.read()
-# Load the image
-image = Image.open("verifact_steps.png")
-logo_image = Image.open("verifact_logo.png")
-# Custom CSS for the page
-st.markdown(
-    """
-    <style>
-    @import url('https://fonts.googleapis.com/css2?family=Courier+Prime:wght@400&display=swap');
-    html, body, [class*="css"] {
-        font-family: 'Arial', sans-serif; /* or use a similar sans-serif font */
-        background-color: #f9f9f9;  /* Light grey background */
-    }
-    .title {
-        font-size: 42px;
-        font-weight: bold;
-        text-align: center;
-        color: #333;
-        margin-bottom: 5px;
-    }
-    .description {
-        font-size: 22px;
-        text-align: center;
-        margin-bottom: 30px;
-        color: #555;
-    }
-    .header, .metric {
-        align-items: left;
-        font-family: 'Arial', sans-serif; /* or use a similar sans-serif font */
-        margin-bottom: 20px;
-    }
-    .container {
-        max-width: 1000px;
-        margin: 0 auto;
-        padding: 5px;
-    }
-    table {
-        width: 100%;
-        border-collapse: collapse;
-        border-radius: 10px;
-        overflow: hidden;
-    }
-    th, td {
-        padding: 8px;
-        text-align: center;
-        border: 1px solid #ddd;
-        font-family: 'Arial', sans-serif; /* or use a similar sans-serif font */
-        font-size: 16px;
-        transition: background-color 0.3s;
-    }
-    th {
-        background-color: #f2f2f2;
-        font-weight: bold;
-    }
-    td:hover {
-        background-color: #eaeaea;
-    }
-    </style>
-    """,
-    unsafe_allow_html=True
-)
-# Display title and description
-st.markdown('<div class="container">', unsafe_allow_html=True)
-# st.image(logo_image, output_format="PNG", width=200)
-# Convert the image to base64
-buffered = BytesIO()
-logo_image.save(buffered, format="PNG")
-img_data = base64.b64encode(buffered.getvalue()).decode("utf-8")
-st.markdown(
-    f"""
-    <style>
-    .logo-container {{
-        display: flex;
-        justify-content: flex-start;  /* Aligns to the left */
-    }}
-    .logo-container img {{
-        width: 50%;  /* Adjust this to control the width, e.g., 50% of container width */
-        margin: 0 auto;
-        max-width: 700px;  /* Set a maximum width */
-        background-color: transparent;
-    }}
-    </style>
-    <div class="logo-container">
-        <img src="data:image/png;base64,{img_data}" alt="VeriFact Leaderboard Logo">
-    </div>
-    """,
-    unsafe_allow_html=True
-)
-# header_md_text = HEADER_MD # make some parameters later
-# gr.Markdown(header_md_text, elem_classes="markdown-text")
-st.markdown(
-    '''
-    <div class="header">
-        <br/>
-        <p style="font-size:22px;">
-        VERIFACT: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts
-        </p>
-        <p style="font-size:20px;">
-            # 📑 <a href="">Paper</a> | 💻 <a href="">GitHub</a> | 🤗 <a href="">HuggingFace</a>
-            ⚙️ <strong>Version</strong>: <strong>V1</strong> | <strong># Models</strong>: 8 | Updated: <strong>???</strong>
-        </p>
-    </div>
-    ''',
-    unsafe_allow_html=True
-)
-# st.markdown('<div class="title">VeriFact Leaderboard</div>',
-#             unsafe_allow_html=True)
-# st.markdown('<div class="description">Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts</div>', unsafe_allow_html=True)
-st.markdown('</div>', unsafe_allow_html=True)
-# Load the data
-data_path = "verifact_data.csv"
-df = pd.read_csv(data_path)
-# Assign ranks within each tier based on factuality_score
-df['rank'] = df.groupby('tier')['Overall'].rank(
-    ascending=False, method='min').astype(int)
-# Replace NaN values with '-'
-df.fillna('-', inplace=True)
-df['original_order'] = df.groupby('tier').cumcount()
-# Create tabs
-st.markdown("""
-    <style>
-        .stTabs [data-baseweb="tab-list"] button [data-testid="stMarkdownContainer"] p {
-            font-size: 20px;
-        }
-    </style>
-""", unsafe_allow_html=True)
-tab1, tab2 = st.tabs(["Leaderboard", "Benchmark Details"])
-# Tab 1: Leaderboard
-with tab1:
-    # df['original_order'] = df.groupby('tier').cumcount()
-    # print(df['original_order'])
-    # st.markdown('<div class="title">Leaderboard</div>', unsafe_allow_html=True)
-    st.markdown('<div class="tab-content">', unsafe_allow_html=True)
-    st.markdown("""
-    <div class="metric" style="font-size:20px; font-weight: bold;">
-    Metrics Explanation
-    </div>
-    """, unsafe_allow_html=True)
-    st.markdown("""
-    <div class="metric" style="font-size:16px;">
-        <br/>
-        <p>
-        <strong> 🎯 Factual Precision </strong> measures the ratio of supported units divided by all units averaged over model responses. <strong> 🌀 Hallucination Score </strong> quantifies the incorrect or inconclusive contents within a model response, as described in the paper. We also provide statistics on the average length of the response in terms of the number of tokens, the average verifiable units existing in the model responses (<strong>Avg. # Units</strong>), the average number of units labelled as undecidable (<strong>Avg. # Undecidable</strong>), and the average number of units labelled as unsupported (<strong>Avg. # Unsupported</strong>).
-        </p>
-        <p>
-        🔒 for closed LLMs; 🔑 for open-weights LLMs; 🚨 for newly added models
-        </p>
-    </div>
-    """,
-    unsafe_allow_html=True
-    )
-    st.markdown("""
-    <style>
-        /* Selectbox text */
-        div[data-baseweb="select"] > div {
-            font-size: 20px;
-        }
-        /* Dropdown options */
-        div[role="listbox"] ul li {
-            font-size: 20px !important;
-        }
-        /* Checkbox label */
-        .stCheckbox label p {
-            font-size: 20px !important;
-        }
-        /* Selectbox label */
-        .stSelectbox label p {
-            font-size: 20px !important;
-        }
-    </style>
-""", unsafe_allow_html=True)
-    # Dropdown menu to filter tiers
-    tiers = ['All Metrics', 'Precision', 'Recall', 'F1']
-    selected_tier = st.selectbox('Select metric:', tiers)
-    # Filter the data based on the selected tier
-    if selected_tier != 'All Metrics':
-        filtered_df = df[df['tier'] == selected_tier]
-    else:
-        filtered_df = df
-    sort_by_factuality = st.checkbox('Sort by overall score')
-    # Sort the dataframe based on Factuality Score if the checkbox is selected
-    if sort_by_factuality:
-        updated_filtered_df = filtered_df.sort_values(
-            by=['tier', 'Overall'], ascending=[True, False]
-        )
-    else:
-        updated_filtered_df = filtered_df.sort_values(
-            by=['tier', 'original_order']
-        )
-    # Create HTML for the table
-    if selected_tier == 'All Metrics':
-        html = '''
-        <table>
-            <thead>
-                <tr>
-                    <th>Metric</th>
-                    <th>Rank</th>
-                    <th>Model</th>
-                    <th>Factbench</th>
-                    <th>Reddit</th>
-                    <th>Overall</th>
-                </tr>
-            </thead>
-            <tbody>
-        '''
-    else:
-        html = '''
-        <table>
-            <thead>
-                <tr>
-                    <th>Rank</th>
-                    <th>Model</th>
-                    <th>Factbench</th>
-                    <th>Reddit</th>
-                    <th>Overall</th>
-                </tr>
-            </thead>
-            <tbody>
-        '''
-    # Generate the rows of the table
-    current_tier = None
-    for i, row in updated_filtered_df.iterrows():
-        html += '<tr>'
-        # Only display the 'Tier' column if 'All Tiers' is selected
-        if selected_tier == 'All Metrics':
-            if row['tier'] != current_tier:
-                current_tier = row['tier']
-                html += f'<td rowspan="8" style="vertical-align: middle;">{current_tier}</td>'
-        # Fill in model and scores
-        html += f'''
-            <td>{row['rank']}</td>
-            <td>{row['model']}</td>
-            <td>{row['FactBench']}</td>
-            <td>{row['Reddit']}</td>
-            <td>{row['Overall']}</td>
-        </tr>
-    '''
-    # Close the table
-    html += '''
-    </table>
-    '''
-    # Display the table
-    st.markdown(html, unsafe_allow_html=True)
-    st.markdown('</div>', unsafe_allow_html=True)
-# Tab 2: Details
-with tab2:
-    st.markdown('<div class="tab-content">', unsafe_allow_html=True)
-    # st.markdown('<div class="title"></div>',
-    #             unsafe_allow_html=True)
-    st.image(image, use_column_width=True)
-    st.markdown('### VERIFY: A Pipeline for Factuality Evaluation')
-    st.write(
-        "Language models (LMs) are widely used by an increasing number of users, "
-        "underscoring the challenge of maintaining factual accuracy across a broad range of topics. "
-        "We present VERIFY (Verification and Evidence Retrieval for Factuality evaluation), "
-        "a pipeline to evaluate LMs' factual accuracy in real-world user interactions."
-    )
-    st.markdown('### Content Categorization')
-    st.write(
-        "VERIFY considers the verifiability of LM-generated content and categorizes content units as "
-        "`supported`, `unsupported`, or `undecidable` based on the retrieved web evidence. "
-        "Importantly, VERIFY's factuality judgments correlate better with human evaluations than existing methods."
-    )
-    st.markdown('### Hallucination Prompts & FactBench Dataset')
-    st.write(
-        "Using VERIFY, we identify 'hallucination prompts' across diverse topics—those eliciting the highest rates of "
-        "incorrect or unverifiable LM responses. These prompts form FactBench, a dataset of 985 prompts across 213 "
-        "fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is "
-        "regularly updated with new prompts."
-    )
-    st.markdown('</div>', unsafe_allow_html=True)
-# # Tab 3: Links
-# with tab3:
-#     st.markdown('<div class="tab-content">', unsafe_allow_html=True)
-#     st.markdown('<div class="title">Submit your model information on our Github</div>',
-#                 unsafe_allow_html=True)
-#     st.markdown(
-#         '[Test your model locally!](https://github.com/FarimaFatahi/FactEval)')
-#     st.markdown(
-#         '[Submit results or issues!](https://github.com/FarimaFatahi/FactEval/issues/new)')
-#     st.markdown('</div>', unsafe_allow_html=True)

+"""
+Entry point for the Model Capability Leaderboard application.
+This file serves as a simple wrapper for the main application code in src/app.py.
+"""
+from src.app import main
+if __name__ == "__main__":
+    main()

factEvalSteps.pdf DELETED Viewed

Binary file (983 kB)

factEvalSteps.png DELETED Viewed

Git LFS Details

SHA256: 6d8b3690f826eef47af41c8d80d141ccdb2c8045d76e39c6f2207c99e8e6e35d
Pointer size: 132 Bytes
Size of remote file: 2.16 MB

factbench_data.csv DELETED Viewed

@@ -1,13 +0,0 @@
-Tier,Model,FactScore,SAFE,Factcheck-GPT,VERIFY
-Tier 1: Easy,GPT4-o,53.19,63.31,86.4,71.58
-Tier 1: Easy,Gemini1.5-Pro,51.79,61.24,83.45,69.38
-Tier 1: Easy,Llama3.1-70B-Instruct,52.49,61.29,83.48,67.27
-Tier 1: Easy,Llama3.1-405B-Instruct,53.22,61.63,83.57,64.94
-Tier 2: Moderate,GPT4-o,54.76,65.01,89.39,76.02
-Tier 2: Moderate,Gemini1.5-Pro,52.62,62.68,87.44,74.24
-Tier 2: Moderate,Llama3.1-70B-Instruct,52.53,62.64,85.16,72.01
-Tier 2: Moderate,Llama3.1-405B-Instruct,53.48,63.29,86.37,70.25
-Tier 3: Hard,GPT4-o,69.44,76.17,94.25,90.58
-Tier 3: Hard,Gemini1.5-Pro,66.05,75.69,91.09,87.82
-Tier 3: Hard,Llama3.1-70B-Instruct,69.85,77.55,92.89,86.63
-Tier 3: Hard,Llama3.1-405B-Instruct,70.04,77.01,93.64,85.79

requirements.py → requirements.txt RENAMED Viewed

@@ -1,3 +1,3 @@
 pandas
 streamlit
-scikit-learn == 1.0.2

 pandas
 streamlit
+scikit-learn

src/app.py ADDED Viewed

	@@ -0,0 +1,95 @@

+"""
+Main entry point for the Model Capability Leaderboard application.
+"""
+import streamlit as st
+# Import configuration
+from src.utils.config import app_config, metrics_config
+# Import data functions
+from src.utils.data_loader import (
+    load_metric_data,
+    process_data,
+    filter_and_prepare_data,
+    format_display_dataframe
+)
+# Import styles
+from src.styles.base import load_all_styles
+# Import components
+from src.components.header import render_page_header, render_footer
+from src.components.filters import (
+    initialize_session_state,
+    render_metric_selection,
+    render_task_selection,
+    render_model_type_selection
+)
+from src.components.leaderboard import render_leaderboard_table, render_empty_state
+from src.components.tasks import render_task_descriptions
+def setup_page():
+    """
+    Set up the Streamlit page configuration
+    """
+    st.set_page_config(
+        page_title=app_config['title'],
+        layout=app_config['layout'],
+        initial_sidebar_state=app_config['initial_sidebar_state']
+    )
+    # Load all styles
+    load_all_styles()
+def main():
+    """
+    Main application function
+    """
+    # Set up page
+    setup_page()
+    # Render header
+    render_page_header()
+    # Load data
+    current_metric = list(metrics_config.keys())[0]
+    metric_data = load_metric_data(metrics_config[current_metric]["file"])
+    df = process_data(metric_data)
+    # Initialize session state
+    initialize_session_state(df)
+    # Create tabs
+    tabs = st.tabs(["📊 Leaderboard", "📑 Benchmark Details"])
+    # Tab 1: Leaderboard
+    with tabs[0]:
+        # Render filter components
+        selected_metric = render_metric_selection()
+        selected_tasks = render_task_selection(df)
+        selected_model_types = render_model_type_selection(df)
+        # Render leaderboard if selections are valid
+        if selected_tasks and selected_model_types:
+            # Filter and prepare data
+            filtered_df = filter_and_prepare_data(df, selected_tasks, selected_model_types)
+            # Format data for display
+            display_df, metric_columns = format_display_dataframe(filtered_df, selected_tasks)
+            # Render the leaderboard table
+            render_leaderboard_table(display_df, metric_columns)
+        else:
+            # Show empty state
+            render_empty_state()
+    # Tab 2: Benchmark Details
+    with tabs[1]:
+        # Render task descriptions
+        render_task_descriptions()
+    # Render footer
+    render_footer()
+if __name__ == "__main__":
+    main()

src/components/filters.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""
+Filter components for the leaderboard application.
+"""
+import streamlit as st
+from src.utils.config import metrics_config
+def initialize_session_state(df):
+    """
+    Initialize the session state for filters
+    Args:
+        df (pandas.DataFrame): The DataFrame with model data
+    """
+    # Initialize session states
+    if 'selected_metric' not in st.session_state:
+        st.session_state.selected_metric = list(metrics_config.keys())[0]
+    if 'selected_tasks' not in st.session_state:
+        # Default to first 3 tasks, excluding Model Type
+        st.session_state.selected_tasks = [col for col in df.columns if col not in ['Model Type']][:3]
+    if 'selected_model_types' not in st.session_state:
+        # Ensure all model types are selected by default
+        st.session_state.selected_model_types = list(df['Model Type'].unique())
+def render_metric_selection():
+    """
+    Render the metric selection component
+    Returns:
+        str: Selected metric
+    """
+    st.markdown("### Select Metric")
+    # Create more compact metric buttons with clear selection indicators
+    metric_cols = st.columns(len(metrics_config))
+    for i, metric in enumerate(metrics_config.keys()):
+        with metric_cols[i]:
+            is_selected = st.session_state.selected_metric == metric
+            button_label = f"✓ {metric}" if is_selected else metric
+            button_type = "primary" if is_selected else "secondary"
+            if st.button(button_label, key=f"metric_{metric}", type=button_type):
+                st.session_state.selected_metric = metric
+                st.rerun()  # Force UI update
+    return st.session_state.selected_metric
+def render_task_selection(df):
+    """
+    Render the task selection component
+    Args:
+        df (pandas.DataFrame): The DataFrame with model data
+    Returns:
+        list: Selected tasks
+    """
+    st.markdown("### Select Tasks")
+    # Extract task columns (exclude Model Type and Overall)
+    all_tasks = [col for col in df.columns if col not in ['Model Type']]
+    # Create task buttons in rows of 3
+    num_cols = 3
+    task_rows = [all_tasks[i:i+num_cols] for i in range(0, len(all_tasks), num_cols)]
+    for row in task_rows:
+        cols = st.columns(num_cols)
+        for i, task in enumerate(row):
+            if i < len(row):
+                with cols[i]:
+                    is_selected = task in st.session_state.selected_tasks
+                    button_label = f"✓ {task}" if is_selected else task
+                    button_type = "primary" if is_selected else "secondary"
+                    if st.button(button_label, key=f"task_{task}", type=button_type):
+                        if is_selected:
+                            st.session_state.selected_tasks.remove(task)
+                        else:
+                            st.session_state.selected_tasks.append(task)
+                        st.rerun()  # Force UI update
+    return st.session_state.selected_tasks
+def render_model_type_selection(df):
+    """
+    Render the model type selection component
+    Args:
+        df (pandas.DataFrame): The DataFrame with model data
+    Returns:
+        list: Selected model types
+    """
+    st.markdown("### Select Model Types")
+    # Create model type buttons
+    model_types = df['Model Type'].unique().tolist()
+    model_type_cols = st.columns(len(model_types))
+    for i, model_type in enumerate(model_types):
+        with model_type_cols[i]:
+            is_selected = model_type in st.session_state.selected_model_types
+            button_label = f"✓ {model_type}" if is_selected else model_type
+            button_type = "primary" if is_selected else "secondary"
+            if st.button(button_label, key=f"model_type_{model_type}", type=button_type):
+                if is_selected:
+                    # Prevent deselecting all model types - ensure at least one remains selected
+                    if len(st.session_state.selected_model_types) > 1:
+                        st.session_state.selected_model_types.remove(model_type)
+                else:
+                    st.session_state.selected_model_types.append(model_type)
+                st.rerun()  # Force UI update
+    return st.session_state.selected_model_types

src/components/header.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""
+Header components for the leaderboard application.
+"""
+import streamlit as st
+from src.utils.config import app_config
+def render_page_header():
+    """
+    Render the page header with title and description
+    """
+    st.markdown(
+        f"""
+        <div class="title-container">
+            <h1 class="title">{app_config['title']}</h1>
+            <p class="subtitle">{app_config['description']}</p>
+        </div>
+        """,
+        unsafe_allow_html=True
+    )
+def render_section_header(title):
+    """
+    Render a section header
+    Args:
+        title (str): The section title
+    """
+    st.markdown(f"### {title}")
+def render_footer():
+    """
+    Render the page footer
+    """
+    st.markdown(
+        """
+        <div class="footer">
+            <p>© 2023 Model Capability Leaderboard • Made with Streamlit • Contact: [email protected]</p>
+        </div>
+        """,
+        unsafe_allow_html=True
+    )

src/components/leaderboard.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""
+Leaderboard table components for the leaderboard application.
+"""
+import streamlit as st
+from src.data.processors import get_model_type_style, get_rank_style
+def render_leaderboard_table(display_df, metric_columns):
+    """
+    Render the custom HTML leaderboard table
+    Args:
+        display_df (pandas.DataFrame): The DataFrame with the display data
+        metric_columns (list): List of metric column names
+    """
+    from src.components.header import render_section_header
+    # Display model ranking header without the box
+    render_section_header("Model Rankings")
+    # Start building the HTML table structure
+    html_table = """
+    <div class="fixed-table-container">
+      <div class="scroll-container">
+        <table class="fixed-table">
+          <thead>
+            <tr class="header-row">
+              <th class="fixed-column first-fixed-column" rowspan="2">Rank</th>
+              <th class="fixed-column second-fixed-column" rowspan="2">Model + Scaffolding</th>
+              <th class="model-type-cell" rowspan="2">Model Type</th>
+    """
+    # Add the metric header
+    html_table += f'<th colspan="{len(metric_columns)}" class="metric-header">Margin To Human</th>'
+    # Continue the table structure
+    html_table += """
+            </tr>
+            <tr class="sub-header">
+    """
+    # Add individual column headers for metrics
+    for col in metric_columns:
+        column_class = "overall-cell" if col == "Metric Average" else "metric-cell"
+        html_table += f'<th class="{column_class}">{col}</th>'
+    # Close the header and start the body
+    html_table += """
+            </tr>
+          </thead>
+          <tbody>
+    """
+    # Add the data rows
+    for i, (idx, row) in enumerate(display_df.iterrows()):
+        # Define background colors to ensure consistency
+        row_bg = "#0a0a0a" if i % 2 == 0 else "#111111"
+        # Start the row
+        html_table += f'<tr class="table-row">'
+        # Add Rank with medal styling and consistent background
+        rank_style = f"background-color: {row_bg};" # Add row background to fixed columns
+        rank_styles = get_rank_style(row["Rank"])
+        for style_key, style_value in rank_styles.items():
+            rank_style += f"{style_key}: {style_value};"
+        html_table += f'<td class="fixed-column first-fixed-column" style="{rank_style}">{row["Rank"]}</td>'
+        # Model name fixed column with consistent background
+        html_table += f'<td class="fixed-column second-fixed-column" title="{row["Model Name"]}" style="background-color: {row_bg}; font-weight: 500; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; text-align: center;">{row["Model Name"]}</td>'
+        # Model type cell
+        model_type = row["Model Type"]
+        type_style = f"background-color: {row_bg};"
+        model_type_styles = get_model_type_style(model_type)
+        for style_key, style_value in model_type_styles.items():
+            if style_value:
+                type_style += f"{style_key}: {style_value};"
+        html_table += f'<td class="table-cell model-type-cell" style="{type_style}">{model_type}</td>'
+        # Add metric values with minimal styling
+        for col in metric_columns:
+            cell_class = "table-cell overall-cell" if col == "Metric Average" else "table-cell metric-cell"
+            value_text = row[col]
+            # Simple styling based on positive/negative values
+            try:
+                value = float(str(row[col]).replace(',', ''))
+                if value > 0:
+                    cell_class += " positive-value"
+                elif value < 0:
+                    cell_class += " negative-value"
+            except:
+                pass
+            html_table += f'<td class="{cell_class}" style="background-color: {row_bg};">{value_text}</td>'
+        html_table += "</tr>"
+    # Close the table
+    html_table += """
+          </tbody>
+        </table>
+      </div>
+    </div>
+    """
+    # Add metric definition below the table
+    metric_definition = """
+    <div class="metric-definition">
+        <h4>Margin to Human</h4>
+        <p> This metric measures what percentage of the top 1 human-to-baseline performance gap an agent can close on challenging Machine Learning Research Competition problems. For example, if the baseline is 100, top human performance is 200, and the agent scores 110, the agent has closed 10% of the gap between baseline and top human performance. Higher percentages indicate models that more effectively approach top human-level research capabilities.</p>
+    </div>
+    """
+    # Display the custom HTML table and metric definition
+    st.markdown(html_table + metric_definition, unsafe_allow_html=True)
+def render_empty_state():
+    """
+    Render an empty state when no data is available
+    """
+    st.markdown("""
+    <div class="warning-box">
+        <strong>No data to display.</strong> Please select at least one task and one model type to view the data.
+    </div>
+    """, unsafe_allow_html=True)

src/components/tasks.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""
+Task description components for the leaderboard application.
+"""
+import streamlit as st
+from src.utils.config import tasks_info
+def render_task_descriptions():
+    """
+    Render the benchmark details section
+    """
+    # Display the MLRC-BENCH image
+    st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)
+    # Display the MLRC-BENCH information
+    st.markdown("""
+    # MLRC-BENCH: Can Language Agents Crack ML Research Challenges?
+    Recent advances in large language models (LLMs) have raised an intriguing question for the machine learning community: Can AI agents not only generate novel research ideas but also implement them effectively? A new benchmark, **MLRC-BENCH**, steps into the spotlight to answer this very question.
+    ## What Is MLRC-BENCH?
+    MLRC-BENCH is a dynamic benchmark designed to objectively evaluate whether LLM-based research agents can tackle cutting-edge ML competition tasks. Unlike previous evaluations that either focused on end-to-end paper generation or narrow engineering challenges, this benchmark splits the research workflow into two core steps:
+    - **Idea Proposal:** Generating innovative research ideas.
+    - **Code Implementation:** Translating those ideas into working, performance-improving code.
+    The benchmark uses tasks sourced from recent ML conferences and workshops, ensuring the problems are both impactful and non-trivial.
+    ## How Does It Work?
+    MLRC-BENCH emphasizes **objective metrics**:
+    - **Success Rate:** An agent is deemed successful if its solution improves upon a baseline by at least 5% of the margin by which the top human solution surpasses that baseline.
+    - **Performance, Efficiency & Simplicity:** Each solution is measured not only by how well it performs but also by how efficient and simple the code is. For example, an ideal solution should achieve higher performance with minimal runtime and code complexity.
+    Additionally, the benchmark integrates **LLM-as-a-judge evaluations** to compare subjective assessments of idea novelty with the objective performance gains. Interestingly, the study reveals a weak correlation between perceived novelty and actual performance improvements.
+    ## Why It Matters
+    The ability for AI agents to contribute to scientific discovery is both exciting and cautionary. While MLRC-BENCH demonstrates that current agents are not yet ready to match human ingenuity, it also provides a scalable framework to track progress and encourage future innovations. The insights gained from this benchmark could guide the development of safer, more effective AI research tools, particularly in high-stakes fields like healthcare, climate science, and AI safety.
+    ## Looking Ahead
+    MLRC-BENCH is built to evolve: as new ML competitions emerge, the benchmark can be updated to reflect the latest challenges. This dynamic nature ensures that it remains a relevant tool for pushing the boundaries of AI-assisted scientific research.
+    """)
+    st.markdown("""
+    <div class="card">
+        <div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
+        <p style="margin-bottom: 20px;">
+            Click on any task to learn more about the original benchmark.
+        </p>
+    </div>
+    """, unsafe_allow_html=True)
+    # Task links mapping
+    task_links = {
+        "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model",
+        "Machine Unlearning": "https://unlearning-challenge.github.io/",
+        "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io",
+        "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge",
+        "Meta Learning": "https://metalearning.chalearn.org/",
+        "Llm Merging": "https://llm-merging.github.io"
+    }
+    # Create two columns
+    col1, col2 = st.columns(2)
+    # Split tasks between the two columns with better styling
+    task_items = list(tasks_info.items())
+    mid_point = len(task_items) // 2
+    with col1:
+        for task, description in task_items[:mid_point]:
+            link = task_links.get(task, "#")
+            st.markdown(f"""
+            <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
+                <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
+                    <div class="task-title">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
+                    <div class="task-description">{description}</div>
+                </div>
+            </a>
+            """, unsafe_allow_html=True)
+    with col2:
+        for task, description in task_items[mid_point:]:
+            link = task_links.get(task, "#")
+            st.markdown(f"""
+            <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
+                <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
+                    <div class="task-title">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
+                    <div class="task-description">{description}</div>
+                </div>
+            </a>
+            """, unsafe_allow_html=True)

src/data/metrics/margin_to_human.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "perception_temporal_action_loc": {
+    "MLAB (claude-3-5-sonnet-v2)": 0.7810185077440877,
+    "MLAB (gemini-exp-1206)": -0.4731328246392113,
+    "MLAB (o3-mini)": 0.3066106841553126,
+    "MLAB (gpt-4o)": 0.3298075630252947,
+    "MLAB (llama3-1-405b-instruct)": 0.5183240203504569,
+    "CoI-Agent (o1) + MLAB (gpt-4o)": 0.3475212791527979
+  },
+  "llm-merging": {
+    "CoI-Agent (o1) + MLAB (gpt-4o)": -0.9900989999019761,
+    "MLAB (claude-3-5-sonnet-v2)": 4.950495058915793,
+    "MLAB (gemini-exp-1206)": 4.950495058915793,
+    "MLAB (o3-mini)": -0.9900989999019761,
+    "MLAB (gpt-4o)": 1.9801980295069084,
+    "MLAB (llama3-1-405b-instruct)": -0.9900989999019761
+  },
+  "meta-learning": {
+    "CoI-Agent (o1) + MLAB (gpt-4o)": 1.781401026144938,
+    "MLAB (claude-3-5-sonnet-v2)": 1.781401026144938,
+    "MLAB (gemini-exp-1206)": 1.781401026144938,
+    "MLAB (o3-mini)": -4.900331256476853,
+    "MLAB (gpt-4o)": 1.781401026144938,
+    "MLAB (llama3-1-405b-instruct)": 1.781401026144938
+  },
+  "product-recommendation": {
+    "CoI-Agent (o1) + MLAB (gpt-4o)": 0.1459345029718814,
+    "MLAB (claude-3-5-sonnet-v2)": 2.9771372473170388,
+    "MLAB (gemini-exp-1206)": 0.1459345029718814,
+    "MLAB (o3-mini)": 0.1462759705510577,
+    "MLAB (gpt-4o)": 0.6398666846799662,
+    "MLAB (llama3-1-405b-instruct)": -7.044800459739471e-10
+  },
+  "machine_unlearning": {
+    "CoI-Agent (o1) + MLAB (gpt-4o)": 11.832138969791846,
+    "MLAB (claude-3-5-sonnet-v2)": -94.71778374121965,
+    "MLAB (gemini-exp-1206)": 5.632371576335568,
+    "MLAB (o3-mini)": 3.623856546073656,
+    "MLAB (gpt-4o)": -17.996962489965668,
+    "MLAB (llama3-1-405b-instruct)": 6.2098517833311
+  },
+  "backdoor-trigger-recovery": {
+    "CoI-Agent (o1) + MLAB (gpt-4o)": 6.1572772457753295,
+    "MLAB (claude-3-5-sonnet-v2)": 39.903815022493674,
+    "MLAB (gemini-exp-1206)": 12.94287662739089,
+    "MLAB (o3-mini)": 6.238823700218141,
+    "MLAB (gpt-4o)": 10.386627431983776,
+    "MLAB (llama3-1-405b-instruct)": 11.542228789066877
+  }
+}

src/data/processors.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+Data processing utilities for the leaderboard application.
+"""
+import pandas as pd
+import numpy as np
+def apply_value_formatting(value, is_numeric=True):
+    """
+    Apply formatting to a value based on its properties
+    Args:
+        value: The value to format
+        is_numeric (bool): Whether the value is numeric
+    Returns:
+        dict: Dictionary with formatting information
+    """
+    if not is_numeric or value == '-':
+        return {'value': value, 'class': ''}
+    numeric_value = float(value)
+    if numeric_value > 0:
+        return {'value': value, 'class': 'positive-value'}
+    elif numeric_value < 0:
+        return {'value': value, 'class': 'negative-value'}
+    else:
+        return {'value': value, 'class': ''}
+def get_model_type_style(model_type):
+    """
+    Get styling for different model types
+    Args:
+        model_type (str): The model type
+    Returns:
+        dict: Dictionary with styling information
+    """
+    if model_type == "Open Source":
+        return {'color': '#4ade80'}  # Brighter green
+    elif model_type == "Open Weights":
+        return {'color': '#93c5fd'}  # Brighter blue
+    elif model_type == "Closed Source":
+        return {'color': '#cbd5e1'}  # Lighter gray
+    else:
+        return {'color': ''}
+def get_rank_style(rank):
+    """
+    Get styling for different ranks
+    Args:
+        rank (str): The rank
+    Returns:
+        dict: Dictionary with styling information
+    """
+    if "🥇" in str(rank):
+        return {'color': 'gold', 'font-weight': '700', 'font-size': '16px'}
+    elif "🥈" in str(rank):
+        return {'color': 'silver', 'font-weight': '700', 'font-size': '16px'}
+    elif "🥉" in str(rank):
+        return {'color': '#cd7f32', 'font-weight': '700', 'font-size': '16px'}
+    else:
+        return {}
+def calculate_task_statistics(metric_data):
+    """
+    Calculate statistics for each task
+    Args:
+        metric_data (dict): Dictionary containing the metric data
+    Returns:
+        dict: Dictionary with task statistics
+    """
+    stats = {}
+    for task, models in metric_data.items():
+        values = list(models.values())
+        stats[task] = {
+            'mean': np.mean(values),
+            'median': np.median(values),
+            'min': min(values),
+            'max': max(values),
+            'std': np.std(values)
+        }
+    return stats

src/styles/base.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+Base styles for the leaderboard application.
+"""
+import streamlit as st
+from src.styles.theme import get_theme_css
+from src.styles.components import get_all_component_styles
+from src.styles.tables import get_all_table_styles
+def load_all_styles():
+    """
+    Load and apply all CSS styles for the application
+    """
+    styles = [
+        get_theme_css(),
+        get_all_component_styles(),
+        get_all_table_styles()
+    ]
+    combined_styles = "\n".join(styles)
+    # Apply all styles to the page
+    st.markdown(
+        f"<style>{combined_styles}</style>",
+        unsafe_allow_html=True
+    )

src/styles/components.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""
+CSS styles for UI components in the leaderboard application.
+"""
+from src.utils.config import dark_theme
+def get_container_styles():
+    """
+    Get CSS styles for page containers
+    Returns:
+        str: CSS string for containers
+    """
+    return f"""
+    .title-container {{
+        padding: 2rem 0;
+        text-align: center;
+        background: {dark_theme['gradient']};
+        border-radius: 12px;
+        color: white;
+        margin-bottom: 2rem;
+    }}
+    .title {{
+        font-size: 42px;
+        font-weight: 700;
+        margin-bottom: 10px;
+        color: {dark_theme['title_color']};
+    }}
+    .subtitle {{
+        font-size: 20px;
+        font-weight: 400;
+        opacity: 0.9;
+        color: {dark_theme['subtitle_color']};
+    }}
+    """
+def get_card_styles():
+    """
+    Get CSS styles for cards
+    Returns:
+        str: CSS string for cards
+    """
+    return f"""
+    .card {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 12px;
+        padding: 24px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
+        margin-bottom: 24px;
+        transition: transform 0.2s, box-shadow 0.2s;
+    }}
+    .card:hover {{
+        transform: translateY(-2px);
+        box-shadow: 0 8px 15px rgba(0, 0, 0, 0.2);
+    }}
+    .card-title {{
+        font-size: 20px;
+        font-weight: 600;
+        margin-bottom: 16px;
+        color: {dark_theme['text_color']};
+        display: flex;
+        align-items: center;
+    }}
+    .card-title-icon {{
+        margin-right: 10px;
+        font-size: 22px;
+    }}
+    """
+def get_task_card_styles():
+    """
+    Get CSS styles for task cards
+    Returns:
+        str: CSS string for task cards
+    """
+    return f"""
+    .task-card {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 12px;
+        padding: 20px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
+        margin-bottom: 16px;
+        border-left: 4px solid {dark_theme['task_border']};
+    }}
+    .task-title {{
+        font-size: 18px;
+        font-weight: 600;
+        color: {dark_theme['task_title']};
+        margin-bottom: 8px;
+    }}
+    .task-description {{
+        font-size: 15px;
+        color: {dark_theme['text_color']};
+        line-height: 1.5;
+    }}
+    """
+def get_button_styles():
+    """
+    Get CSS styles for buttons
+    Returns:
+        str: CSS string for buttons
+    """
+    return f"""
+    /* Button styling - completely new and modern */
+    div.stButton > button {{
+        background-color: {dark_theme['card_bg']};
+        color: {dark_theme['text_color']};
+        border: 1px solid {dark_theme['border']};
+        border-radius: 8px;
+        padding: 8px 16px;
+        font-size: 14px;
+        font-weight: 500;
+        margin: 4px;
+        transition: all 0.2s ease;
+    }}
+    div.stButton > button:hover {{
+        background-color: {dark_theme['hover']};
+        border-color: {dark_theme['border']};
+        transform: translateY(-1px);
+    }}
+    /* Active button styling */
+    div.stButton > button.selected {{
+        background-color: {dark_theme['primary']} !important;
+        color: white !important;
+        border-color: {dark_theme['primary']} !important;
+    }}
+    """
+def get_tabs_styles():
+    """
+    Get CSS styles for tabs
+    Returns:
+        str: CSS string for tabs
+    """
+    return f"""
+    /* Tabs styling */
+    .stTabs [data-baseweb="tab-list"] {{
+        gap: 8px;
+        margin-bottom: 20px;
+    }}
+    .stTabs [data-baseweb="tab"] {{
+        border-radius: 8px 8px 0 0;
+        padding: 12px 24px;
+        font-weight: 500;
+        background-color: {dark_theme['hover']};
+        color: {dark_theme['text_color']};
+    }}
+    .stTabs [data-baseweb="tab"][aria-selected="true"] {{
+        background-color: {dark_theme['primary']};
+        color: white;
+    }}
+    .stTabs [data-baseweb="tab-highlight"] {{
+        background-color: transparent;
+    }}
+    """
+def get_alert_styles():
+    """
+    Get CSS styles for alerts and information boxes
+    Returns:
+        str: CSS string for alerts
+    """
+    return f"""
+    /* Alert/info box styling */
+    .info-box {{
+        background-color: {dark_theme['info_bg']};
+        border-left: 4px solid {dark_theme['info_border']};
+        border-radius: 8px;
+        padding: 16px;
+        margin-bottom: 16px;
+        color: {dark_theme['text_color']};
+    }}
+    .warning-box {{
+        background-color: {dark_theme['warning_bg']};
+        border-left: 4px solid {dark_theme['warning_border']};
+        border-radius: 8px;
+        padding: 16px;
+        margin-bottom: 16px;
+        color: {dark_theme['text_color']};
+    }}
+    """
+def get_footer_styles():
+    """
+    Get CSS styles for the footer
+    Returns:
+        str: CSS string for the footer
+    """
+    return f"""
+    /* Footer styling */
+    .footer {{
+        text-align: center;
+        padding: 24px;
+        margin-top: 40px;
+        color: {dark_theme['footer_color']};
+        font-size: 14px;
+        border-top: 1px solid {dark_theme['footer_border']};
+    }}
+    """
+def get_all_component_styles():
+    """
+    Get all component styles combined
+    Returns:
+        str: Combined CSS string for all components
+    """
+    styles = [
+        get_container_styles(),
+        get_card_styles(),
+        get_task_card_styles(),
+        get_button_styles(),
+        get_tabs_styles(),
+        get_alert_styles(),
+        get_footer_styles(),
+        get_metric_definition_styles()
+    ]
+    return '\n'.join(styles)
+def get_metric_definition_styles():
+    """
+    Get CSS styles for metric definition component
+    Returns:
+        str: CSS string for metric definition
+    """
+    return f"""
+    /* Metric definition styling */
+    .metric-definition {{
+        margin-top: 20px;
+        background-color: {dark_theme['card_bg']};
+        border-radius: 8px;
+        padding: 16px 20px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
+        border-left: 4px solid {dark_theme['secondary']};
+    }}
+    .metric-definition h4 {{
+        color: {dark_theme['secondary']};
+        margin-top: 0;
+        margin-bottom: 8px;
+        font-size: 18px;
+        font-weight: 600;
+    }}
+    .metric-definition p {{
+        color: {dark_theme['text_color']};
+        font-size: 14px;
+        line-height: 1.6;
+        margin-bottom: 0;
+    }}
+    """

src/styles/main_styles.py ADDED Viewed

	@@ -0,0 +1,346 @@

+from src.utils.config import dark_theme
+def get_base_styles():
+    """Returns the base CSS styles for the page"""
+    return f"""
+    <style>
+    @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
+    html, body, [class*="css"] {{
+        font-family: 'Inter', sans-serif;
+        background-color: {dark_theme['bg_color']};
+        color: {dark_theme['text_color']};
+    }}
+    h1, h2, h3, h4, h5, h6 {{
+        font-family: 'Inter', sans-serif;
+        font-weight: 600;
+        color: {dark_theme['heading_color']};
+    }}
+    .main {{
+        background-color: {dark_theme['bg_color']};
+    }}
+    .title-container {{
+        padding: 2rem 0;
+        text-align: center;
+        background: {dark_theme['gradient']};
+        border-radius: 12px;
+        color: white;
+        margin-bottom: 2rem;
+    }}
+    .title {{
+        font-size: 42px;
+        font-weight: 700;
+        margin-bottom: 10px;
+        color: {dark_theme['title_color']};
+    }}
+    .subtitle {{
+        font-size: 20px;
+        font-weight: 400;
+        opacity: 0.9;
+        color: {dark_theme['subtitle_color']};
+    }}
+    .card {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 12px;
+        padding: 24px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
+        margin-bottom: 24px;
+        transition: transform 0.2s, box-shadow 0.2s;
+    }}
+    .card:hover {{
+        transform: translateY(-2px);
+        box-shadow: 0 8px 15px rgba(0, 0, 0, 0.2);
+    }}
+    .card-title {{
+        font-size: 20px;
+        font-weight: 600;
+        margin-bottom: 16px;
+        color: {dark_theme['text_color']};
+        display: flex;
+        align-items: center;
+    }}
+    .card-title-icon {{
+        margin-right: 10px;
+        font-size: 22px;
+    }}
+    .task-card {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 12px;
+        padding: 20px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
+        margin-bottom: 16px;
+        border-left: 4px solid {dark_theme['task_border']};
+    }}
+    .task-title {{
+        font-size: 18px;
+        font-weight: 600;
+        color: {dark_theme['task_title']};
+        margin-bottom: 8px;
+    }}
+    .task-description {{
+        font-size: 15px;
+        color: {dark_theme['text_color']};
+        line-height: 1.5;
+    }}
+    /* Button styling - completely new and modern */
+    div.stButton > button {{
+        background-color: {dark_theme['card_bg']};
+        color: {dark_theme['text_color']};
+        border: 1px solid {dark_theme['border']};
+        border-radius: 8px;
+        padding: 8px 16px;
+        font-size: 14px;
+        font-weight: 500;
+        margin: 4px;
+        transition: all 0.2s ease;
+    }}
+    div.stButton > button:hover {{
+        background-color: {dark_theme['hover']};
+        border-color: {dark_theme['border']};
+        transform: translateY(-1px);
+    }}
+    /* Active button styling */
+    div.stButton > button.selected {{
+        background-color: {dark_theme['primary']} !important;
+        color: white !important;
+        border-color: {dark_theme['primary']} !important;
+    }}
+    /* Table styling */
+    [data-testid="stDataFrame"] {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 12px;
+        padding: 1px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
+    }}
+    [data-testid="stDataFrame"] table {{
+        border-collapse: separate !important;
+        border-spacing: 0 !important;
+        border-radius: 8px !important;
+        overflow: hidden !important;
+    }}
+    [data-testid="stDataFrame"] th {{
+        background-color: {dark_theme['table_header']} !important;
+        color: {dark_theme['text_color']} !important;
+        font-weight: 600 !important;
+        text-transform: uppercase !important;
+        font-size: 13px !important;
+        padding: 16px 10px !important;
+    }}
+    [data-testid="stDataFrame"] td {{
+        padding: 12px 10px !important;
+        border-bottom: 1px solid {dark_theme['table_border']} !important;
+        font-size: 14px !important;
+        color: {dark_theme['text_color']} !important;
+    }}
+    /* Hide row numbers */
+    [data-testid="stDataFrame"] [data-testid="stDataFrameRowNumber"] {{
+        display: none !important;
+    }}
+    /* Metric header styling */
+    .metric-header {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 8px;
+        padding: 16px;
+        margin-bottom: 16px;
+        border-left: 4px solid {dark_theme['primary']};
+    }}
+    .metric-header h3 {{
+        margin: 0;
+        color: {dark_theme['primary']};
+    }}
+    .metric-header p {{
+        margin: 8px 0 0 0;
+        font-size: 14px;
+        opacity: 0.8;
+    }}
+    /* Rank column styling */
+    .rank-cell {{
+        font-weight: 700 !important;
+        background-color: {dark_theme['primary'] + '22'};
+        border-radius: 50%;
+        width: 28px;
+        height: 28px;
+        display: flex;
+        align-items: center;
+        justify-content: center;
+        margin: 0 auto;
+    }}
+    .rank-1 {{
+        background-color: gold !important;
+        color: #333 !important;
+    }}
+    .rank-2 {{
+        background-color: silver !important;
+        color: #333 !important;
+    }}
+    .rank-3 {{
+        background-color: #cd7f32 !important; /* bronze */
+        color: #fff !important;
+    }}
+    /* Tabs styling */
+    .stTabs [data-baseweb="tab-list"] {{
+        gap: 8px;
+        margin-bottom: 20px;
+    }}
+    .stTabs [data-baseweb="tab"] {{
+        border-radius: 8px 8px 0 0;
+        padding: 12px 24px;
+        font-weight: 500;
+        background-color: {dark_theme['hover']};
+        color: {dark_theme['text_color']};
+    }}
+    .stTabs [data-baseweb="tab"][aria-selected="true"] {{
+        background-color: {dark_theme['primary']};
+        color: white;
+    }}
+    .stTabs [data-baseweb="tab-highlight"] {{
+        background-color: transparent;
+    }}
+    /* Alert/info box styling */
+    .info-box {{
+        background-color: {dark_theme['info_bg']};
+        border-left: 4px solid {dark_theme['info_border']};
+        border-radius: 8px;
+        padding: 16px;
+        margin-bottom: 16px;
+        color: {dark_theme['text_color']};
+    }}
+    .warning-box {{
+        background-color: {dark_theme['warning_bg']};
+        border-left: 4px solid {dark_theme['warning_border']};
+        border-radius: 8px;
+        padding: 16px;
+        margin-bottom: 16px;
+        color: {dark_theme['text_color']};
+    }}
+    /* Download buttons styling */
+    .download-button {{
+        background-color: {dark_theme['bg_color']};
+        border: 1px solid {dark_theme['border']};
+        border-radius: 8px;
+        padding: 12px 20px;
+        display: inline-flex;
+        align-items: center;
+        justify-content: center;
+        color: {dark_theme['text_color']};
+        font-weight: 500;
+        margin-top: 16px;
+        cursor: pointer;
+        transition: all 0.2s;
+    }}
+    .download-button:hover {{
+        background-color: {dark_theme['hover']};
+        border-color: {dark_theme['border']};
+        transform: translateY(-1px);
+    }}
+    .download-button .icon {{
+        margin-right: 8px;
+    }}
+    /* Footer styling */
+    .footer {{
+        text-align: center;
+        padding: 24px;
+        margin-top: 40px;
+        color: {dark_theme['footer_color']};
+        font-size: 14px;
+        border-top: 1px solid {dark_theme['footer_border']};
+    }}
+    /* Badge styling for model types */
+    .badge {{
+        display: inline-block;
+        padding: 4px 8px;
+        font-size: 12px;
+        font-weight: 500;
+        border-radius: 6px;
+        margin-right: 6px;
+    }}
+    .badge-purple {{
+        background-color: {dark_theme['primary'] + '33'};
+        color: {dark_theme['primary']};
+    }}
+    .badge-blue {{
+        background-color: {dark_theme['info_border'] + '33'};
+        color: {dark_theme['info_border']};
+    }}
+    .selected-items {{
+        background-color: {dark_theme['hover']};
+        border-radius: 8px;
+        padding: 12px 16px;
+        margin-top: 16px;
+        font-size: 14px;
+    }}
+    /* Selectbox styling */
+    div[data-baseweb="select"] {{
+        border-radius: 8px !important;
+    }}
+    div[data-baseweb="select"] > div {{
+        background-color: {dark_theme['card_bg']} !important;
+        border-radius: 8px !important;
+        border-color: {dark_theme['border']} !important;
+        color: {dark_theme['text_color']} !important;
+    }}
+    div[data-baseweb="select"] > div:hover {{
+        border-color: {dark_theme['border']} !important;
+    }}
+    /* Table hover and value styling */
+    .table-row:hover td {{
+        background-color: #1a1a1a !important;
+    }}
+    .table-row:hover td.fixed-column {{
+        background-color: #1a1a1a !important;
+    }}
+    .positive-value {{
+        color: #4ade80 !important; /* Bright green for positive values */
+        font-weight: 500;
+    }}
+    .negative-value {{
+        color: #f87171 !important; /* Bright red for negative values */
+        font-weight: 500;
+    }}
+    </style>
+    """

src/styles/tables.py ADDED Viewed

	@@ -0,0 +1,263 @@

+"""
+CSS styles for tables in the leaderboard application.
+"""
+from src.utils.config import dark_theme
+def get_streamlit_table_styles():
+    """
+    Get CSS styles for standard Streamlit tables
+    Returns:
+        str: CSS string for Streamlit tables
+    """
+    return f"""
+    /* Standard Streamlit table styling */
+    [data-testid="stDataFrame"] {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 12px;
+        padding: 1px;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
+    }}
+    [data-testid="stDataFrame"] table {{
+        border-collapse: separate !important;
+        border-spacing: 0 !important;
+        border-radius: 8px !important;
+        overflow: hidden !important;
+    }}
+    [data-testid="stDataFrame"] th {{
+        background-color: {dark_theme['table_header']} !important;
+        color: {dark_theme['text_color']} !important;
+        font-weight: 600 !important;
+        text-transform: uppercase !important;
+        font-size: 13px !important;
+        padding: 16px 10px !important;
+    }}
+    [data-testid="stDataFrame"] td {{
+        padding: 12px 10px !important;
+        border-bottom: 1px solid {dark_theme['table_border']} !important;
+        font-size: 14px !important;
+        color: {dark_theme['text_color']} !important;
+    }}
+    /* Hide row numbers */
+    [data-testid="stDataFrame"] [data-testid="stDataFrameRowNumber"] {{
+        display: none !important;
+    }}
+    """
+def get_custom_leaderboard_table_styles():
+    """
+    Get CSS styles for the custom leaderboard table
+    Returns:
+        str: CSS string for the custom leaderboard table
+    """
+    return f"""
+    /* Custom leaderboard table styling */
+    .fixed-table-container {{
+        position: relative;
+        max-width: 100%;
+        margin-top: 20px;
+        border-radius: 8px;
+        box-shadow: 0 4px 12px rgba(0,0,0,0.5);
+        background: {dark_theme['table_bg']};
+        border: 1px solid {dark_theme['table_border_color']};
+    }}
+    .fixed-table {{
+        width: 100%;
+        border-collapse: collapse;
+        font-family: 'Inter', sans-serif;
+    }}
+    .fixed-column {{
+        position: sticky;
+        left: 0;
+        z-index: 2;
+        background-color: {dark_theme['table_bg']};
+    }}
+    .first-fixed-column {{
+        width: 60px;
+        text-align: center;
+        left: 0;
+        z-index: 3;
+        border-right: 1px solid {dark_theme['table_border_color']};
+        box-shadow: 2px 0 4px rgba(0,0,0,0.3);
+    }}
+    .second-fixed-column {{
+        width: 280px;
+        text-align: center;
+        left: 60px;
+        z-index: 2;
+        border-right: 1px solid {dark_theme['table_border_color']};
+        box-shadow: 2px 0 4px rgba(0,0,0,0.3);
+    }}
+    /* Fix for the gap between fixed columns */
+    .first-fixed-column::after {{
+        content: "";
+        position: absolute;
+        top: 0;
+        right: -1px;
+        height: 100%;
+        width: 1px;
+        background-color: {dark_theme['table_border_color']};
+    }}
+    .model-type-cell {{
+        width: 120px;
+        text-align: center;
+    }}
+    .scroll-container {{
+        overflow-x: auto;
+        border-radius: 8px;
+    }}
+    .header-row th {{
+        padding: 14px 8px;
+        background-color: {dark_theme['table_bg']};
+        color: {dark_theme['text_color']};
+        font-weight: 600;
+        border-bottom: 1px solid {dark_theme['table_border_color']};
+    }}
+    .metric-header {{
+        background-color: {dark_theme['table_header_bg']} !important;
+        color: #ffffff;
+        padding: 14px 0px !important;
+        text-align: center;
+        font-weight: 600;
+        letter-spacing: 0.5px;
+    }}
+    .sub-header th {{
+        padding: 12px 8px;
+        background-color: {dark_theme['table_subheader_bg']};
+        color: {dark_theme['text_color']};
+        font-weight: 500;
+        text-align: center;
+        border-bottom: 1px solid {dark_theme['table_border_color']};
+    }}
+    .sub-header th.overall-cell {{
+        background-color: {dark_theme['table_average_column_bg']}; /* Slightly lighter black for average column */
+        font-weight: 600; /* Make it bolder */
+        border-right: 1px solid #444; /* Add a subtle border to separate it */
+    }}
+    .table-row:nth-child(odd) {{
+        background-color: {dark_theme['table_row_odd']};
+    }}
+    .table-row:nth-child(even) {{
+        background-color: {dark_theme['table_row_even']};
+    }}
+    .table-row:hover td {{
+        background-color: {dark_theme['table_hover_bg']} !important;
+    }}
+    .table-row:hover td.fixed-column {{
+        background-color: {dark_theme['table_hover_bg']} !important;
+    }}
+    .table-cell {{
+        padding: 12px 8px;
+        text-align: center;
+        border-bottom: 1px solid #222;
+    }}
+    .table-cell.overall-cell {{
+        background-color: rgba(80, 80, 80, 0.1); /* Subtle highlight */
+        font-weight: 600; /* Make average values bolder */
+        border-right: 1px solid #333; /* Add a border to separate it from task metrics */
+    }}
+    .positive-value {{
+        color: {dark_theme['positive_value_color']} !important;
+        font-weight: 500;
+    }}
+    .negative-value {{
+        color: {dark_theme['negative_value_color']} !important;
+        font-weight: 500;
+    }}
+    """
+def get_metric_styles():
+    """
+    Get CSS styles for metric displays
+    Returns:
+        str: CSS string for metric displays
+    """
+    return f"""
+    /* Metric styling */
+    .metric-header {{
+        background-color: {dark_theme['card_bg']};
+        border-radius: 8px;
+        padding: 16px;
+        margin-bottom: 16px;
+        border-left: 4px solid {dark_theme['primary']};
+    }}
+    .metric-header h3 {{
+        margin: 0;
+        color: {dark_theme['primary']};
+    }}
+    .metric-header p {{
+        margin: 8px 0 0 0;
+        font-size: 14px;
+        opacity: 0.8;
+    }}
+    /* Rank column styling */
+    .rank-cell {{
+        font-weight: 700 !important;
+        background-color: {dark_theme['primary'] + '22'};
+        border-radius: 50%;
+        width: 28px;
+        height: 28px;
+        display: flex;
+        align-items: center;
+        justify-content: center;
+        margin: 0 auto;
+    }}
+    .rank-1 {{
+        background-color: gold !important;
+        color: #333 !important;
+    }}
+    .rank-2 {{
+        background-color: silver !important;
+        color: #333 !important;
+    }}
+    .rank-3 {{
+        background-color: #cd7f32 !important; /* bronze */
+        color: #fff !important;
+    }}
+    """
+def get_all_table_styles():
+    """
+    Get all table styles combined
+    Returns:
+        str: Combined CSS string for all tables
+    """
+    styles = [
+        get_streamlit_table_styles(),
+        get_custom_leaderboard_table_styles(),
+        get_metric_styles()
+    ]
+    return '\n'.join(styles)

src/styles/theme.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+Theme definitions and color scheme for the leaderboard application.
+"""
+from src.utils.config import dark_theme
+def get_theme_css():
+    """
+    Get the base theme CSS for the application
+    Returns:
+        str: CSS string for the theme
+    """
+    return f"""
+    @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
+    html, body, [class*="css"] {{
+        font-family: 'Inter', sans-serif;
+        background-color: {dark_theme['bg_color']};
+        color: {dark_theme['text_color']};
+    }}
+    h1, h2, h3, h4, h5, h6 {{
+        font-family: 'Inter', sans-serif;
+        font-weight: 600;
+        color: {dark_theme['heading_color']};
+    }}
+    .main {{
+        background-color: {dark_theme['bg_color']};
+    }}
+    """

src/utils/config.py ADDED Viewed

	@@ -0,0 +1,85 @@

+# Theme and configuration settings for the Model Capability Leaderboard application
+# Theme colors - using dark mode by default
+dark_theme = {
+    'bg_color': '#1a202c',
+    'text_color': '#e2e8f0',
+    'card_bg': '#2d3748',
+    'primary': '#818cf8',
+    'secondary': '#a78bfa',
+    'border': '#4a5568',
+    'hover': '#4a5568',
+    'table_header': '#2d3748',
+    'table_border': '#4a5568',
+    'heading_color': '#e2e8f0',
+    'gradient': 'linear-gradient(135deg, #818cf8 0%, #a78bfa 100%)',
+    'warning_bg': '#7c2d12',
+    'warning_border': '#f97316',
+    'info_bg': '#1e3a8a',
+    'info_border': '#3b82f6',
+    'footer_color': '#a0aec0',
+    'title_color': 'white',
+    'subtitle_color': 'rgba(255, 255, 255, 0.9)',
+    'footer_border': '#4a5568',
+    'task_title': '#a5b4fc',
+    'task_border': '#818cf8',
+    # Table-specific colors for the custom table
+    'table_bg': '#0a0a0a',
+    'table_border_color': '#333',
+    'table_header_bg': '#191919',
+    'table_subheader_bg': '#141414',
+    'table_average_column_bg': '#202020',
+    'table_row_odd': '#0a0a0a',
+    'table_row_even': '#111111',
+    'table_hover_bg': '#1a1a1a',
+    'positive_value_color': '#4ade80',
+    'negative_value_color': '#f87171'
+}
+# Application settings
+app_config = {
+    'title': 'MLRC-Bench Leaderboard',
+    'description': 'Machine Learning Research Challenges Benchmark for AI Agents',
+    'layout': 'wide',
+    'initial_sidebar_state': 'collapsed'
+}
+# Metrics configuration
+metrics_config = {
+    "Margin to Human": {
+        "file": "src/data/metrics/margin_to_human.json",
+        "description": "Performance on Machine Learning Research Challenges. Higher values indicate better research capabilities.",
+        "min_value": -100,  # Approximate, adjust as needed
+        "max_value": 50,    # Approximate, adjust as needed
+        "color_map": "RdYlGn"
+    }
+    # Future metrics can be added here
+    # "Another Metric": {
+    #     "file": "src/data/metrics/another_metric.json",
+    #     "description": "Description of another metric",
+    #     "min_value": 0,
+    #     "max_value": 100,
+    #     "color_map": "viridis"
+    # }
+}
+# Model type categories
+model_categories = {
+    "MLAB (claude-3-5-sonnet-v2)": "Closed Source",
+    "MLAB (gemini-exp-1206)": "Closed Source",
+    "MLAB (o3-mini)": "Closed Source",
+    "MLAB (gpt-4o)": "Closed Source",
+    "MLAB (llama3-1-405b-instruct)": "Open Weights",
+    "CoI-Agent (o1) + MLAB (gpt-4o)": "Closed Source"
+    # More models would be added here as needed
+}
+# Task descriptions
+tasks_info = {
+    "Perception Temporal Action Loc": "Testing the model's ability to understand and localize actions within temporal sequences of events.",
+    "Llm Merging": "Assessing the capability to effectively merge knowledge from multiple language models.",
+    "Meta Learning": "Evaluating the model's ability to learn how to learn - adapting quickly to new tasks.",
+    "Product Recommendation": "Testing the model's ability to recommend relevant products based on user preferences and behavior.",
+    "Machine Unlearning": "Evaluating how well models can 'unlearn' specific information when required.",
+    "Backdoor Trigger Recovery": "Testing resilience against backdoor attacks and ability to recover from triggered behaviors."
+}

src/utils/data_loader.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""
+Data loading and processing utilities for the leaderboard application.
+"""
+import pandas as pd
+import json
+from src.utils.config import model_categories
+def load_metric_data(file_path):
+    """
+    Load metric data from a JSON file
+    Args:
+        file_path (str): Path to the JSON file containing metric data
+    Returns:
+        dict: Dictionary containing the loaded metric data
+    """
+    try:
+        with open(file_path, "r") as f:
+            return json.load(f)
+    except FileNotFoundError:
+        print(f"Error: File {file_path} not found.")
+        return {}
+    except json.JSONDecodeError:
+        print(f"Error: File {file_path} is not a valid JSON file.")
+        return {}
+def process_data(metric_data):
+    """
+    Process the metric data into a pandas DataFrame
+    Args:
+        metric_data (dict): Dictionary containing the metric data
+    Returns:
+        pandas.DataFrame: DataFrame containing the processed data
+    """
+    # Create a DataFrame to store the model metric data
+    tasks = list(metric_data.keys())
+    models = []
+    model_data = {}
+    # Extract model names and their metric values for each task
+    for task in tasks:
+        for model in metric_data[task]:
+            if model not in models:
+                models.append(model)
+                model_data[model] = {}
+            # Store the metric value for this task
+            model_data[model][task] = metric_data[task][model]
+    # Create DataFrame from the model_data dictionary
+    df = pd.DataFrame.from_dict(model_data, orient='index')
+    # Replace NaN values with '-'
+    df.fillna('-', inplace=True)
+    # Rename the columns to more readable format
+    df.columns = [task.replace("-", " ").replace("_", " ").title() for task in df.columns]
+    # Add a model type column to the dataframe
+    df['Model Type'] = df.index.map(lambda x: model_categories.get(x, "Unknown"))
+    return df
+def calculate_selected_overall(row, selected_tasks):
+    """
+    Calculate overall average for selected tasks
+    Args:
+        row (pandas.Series): Row of data
+        selected_tasks (list): List of task names to include in the average
+    Returns:
+        float or str: The calculated average or '-' if no numeric values
+    """
+    numeric_values = []
+    for task in selected_tasks:
+        value = row[task]
+        # Check if the value is numeric (could be float or string representing float)
+        if isinstance(value, (int, float)) or (isinstance(value, str) and value.replace('.', '', 1).replace('-', '', 1).isdigit()):
+            numeric_values.append(float(value))
+    # Calculate average if there are numeric values
+    if numeric_values:
+        return sum(numeric_values) / len(numeric_values)
+    else:
+        return '-'
+def filter_and_prepare_data(df, selected_tasks, selected_model_types):
+    """
+    Filter and prepare data based on selections
+    Args:
+        df (pandas.DataFrame): The original DataFrame
+        selected_tasks (list): List of selected task names
+        selected_model_types (list): List of selected model types
+    Returns:
+        pandas.DataFrame: Filtered and prepared DataFrame
+    """
+    # Filter the dataframe based on selected model types
+    filtered_df = df[df['Model Type'].isin(selected_model_types)]
+    # Calculate the average for selected tasks only
+    selected_tasks_df = filtered_df[selected_tasks]
+    filtered_df['Selected Overall'] = selected_tasks_df.mean(axis=1)
+    # Sort by Selected Overall and add rank
+    filtered_df = filtered_df.sort_values('Selected Overall', ascending=False)
+    filtered_df.insert(0, 'Rank', range(1, len(filtered_df) + 1))
+    # Add a Model Name column that shows the index (actual model name)
+    filtered_df['Model Name'] = filtered_df.index
+    return filtered_df
+def format_display_dataframe(filtered_df, selected_tasks):
+    """
+    Create and format the display DataFrame for the leaderboard table
+    Args:
+        filtered_df (pandas.DataFrame): The filtered DataFrame
+        selected_tasks (list): List of selected task names
+    Returns:
+        tuple: (pandas.DataFrame, list) - The display DataFrame and the metric columns
+    """
+    # Create a fixed display DataFrame with only the model info
+    display_df = filtered_df[['Rank', 'Model Name', 'Model Type']].copy()
+    # Format the rank column with medals
+    medal_ranks = {1: "🥇 1", 2: "🥈 2", 3: "🥉 3"}
+    display_df['Rank'] = display_df['Rank'].apply(lambda x: medal_ranks.get(x, str(x)))
+    # Add metrics columns (Selected Overall and individual tasks)
+    metric_columns = ['Selected Overall'] + selected_tasks
+    for col in metric_columns:
+        if col in filtered_df.columns:
+            # Format numeric columns to 3 decimal places
+            if filtered_df[col].dtype in ['float64', 'float32']:
+                display_df[col] = filtered_df[col].apply(lambda x: f"{x:.3f}" if isinstance(x, (int, float)) else x)
+            else:
+                display_df[col] = filtered_df[col]
+    # Rename "Selected Overall" to "Metric Average" in display_df
+    if "Selected Overall" in display_df.columns:
+        display_df = display_df.rename(columns={"Selected Overall": "Metric Average"})
+        # Also update the metric_columns list to reflect the rename
+        metric_columns = ['Metric Average'] + selected_tasks
+    return display_df, metric_columns

tiered_models_data.csv DELETED Viewed

@@ -1,23 +0,0 @@
-tier,model,factuality_score,hallucination_score,avg_tokens,avg_factual_units,avg_undecidable_units,avg_unsupported_units
-Tier 1: Hard,🔒 GPT4-o,75.65,0.64,563.15,24.01,4.62,1.01
-Tier 1: Hard,🔒 Gemini1.5-Pro,73.78,0.68,517.31,22.25,4.48,1.13
-Tier 1: Hard,🔑 Llama3.1-70B-Instruct,70.07,0.89,532.41,27.17,5.67, 2.13
-Tier 1: Hard,🔑 Llama3.1-405B-Instruct,68.59,0.93,551.28,26.71,6.19,2.2
-Tier 1: Hard,🔒 Claude-3.5-Sonnet 🚨,74.95,0.65,395.77,22.64,4.03,1.19
-Tier 1: Hard,🔒 CommandR+ 🚨,73.15,0.71,440.93,23.55,4.51,1.4
-Tier 1: Hard,🔑 Mistral-Large-2 🚨,75.19,0.67,485.58,23.21,4.09,1.36
-Tier 2: Moderate,🔒 GPT4-o,80.72,0.5,624.67,24.42,3.59,0.89
-Tier 2: Moderate,🔒 Gemini1.5-Pro,78.02,0.57,565.97,22.16,3.71,0.97
-Tier 2: Moderate,🔑 Llama3.1-70B-Instruct,75.76,0.71,607.44,25.35,4.33,1.76
-Tier 2: Moderate,🔑 Llama3.1-405B-Instruct,75.05,0.7,599.3,25.24,4.74,1.41
-Tier 2: Moderate,🔒 Claude-3.5-Sonnet 🚨,79.92,0.54,414.32,22.15,3.32,1.09
-Tier 2: Moderate,🔒 CommandR+ 🚨,80.71,0.52,483.32,24.1,3.17,1.09
-Tier 2: Moderate,🔑 Mistral-Large-2 🚨,79.97,0.52,528.44,22.65,3.21,1.02
-Tier 3: Easy,🔒 GPT4-o,91.63,0.26,640.84,29.29,2.01,0.53
-Tier 3: Easy,🔒 Gemini1.5-Pro,89.86,0.31,551.81,25.6,1.88,0.71
-Tier 3: Easy,🔑 Llama3.1-70B-Instruct,89.3,0.33,607.75,31.38,2.08,0.83
-Tier 3: Easy,🔑 Llama3.1-405B-Instruct,86.57,0.4,599.87,30.12,2.88,0.85
-Tier 3: Easy,🔒 Claude-3.5-Sonnet 🚨,89.61,0.3,411.2,26.72,1.49,0.81
-Tier 3: Easy,🔒 CommandR+ 🚨,91.65,0.25,499.06,27.95,1.57,0.54
-Tier 3: Easy,🔑 Mistral-Large-2 🚨,92.0,0.25,523.57,27.8,1.8,0.55

verifact_data.csv DELETED Viewed

@@ -1,25 +0,0 @@
-tier,model,FactBench,Reddit,Overall
-F1,GPT4o,80.93,42.76,67.41
-F1,Claude 3.5-Sonnet,75.68,42.90,63.65
-F1,Gemini 1.5-Flash,77.38,40.26,64.10
-F1,Llama3.1-8b,60.71,28.86,48.62
-F1,Llama3.1-70b,65.83,38.61,55.12
-F1,Llama3.1-405B,73.23,38.98,60.61
-F1,Qwen2.5-8b,69.23,37.25,55.78
-F1,Qwen2.5-32b,71.31,37.34,60.00
-Recall,GPT4o,77.13,30.06,57.93
-Recall,Claude 3.5-Sonnet,69.35,30.69,53.58
-Recall,Gemini 1.5-Flash,70.71,27.67,53.16
-Recall,Llama3.1-8b,54.28,20.39,40.46
-Recall,Llama3.1-70b,58.00,29.31,46.30
-Recall,Llama3.1-405B,68.40,28.00,51.92
-Recall,Qwen2.5-8b,58.66,26.01,45.34
-Recall,Qwen2.5-32b,62.77,25.38,47.52
-Precision,GPT4o,85.11,74.04,80.59
-Precision,Claude 3.5-Sonnet,83.28,71.25,78.37
-Precision,Gemini 1.5-Flash,85.45,73.87,80.72
-Precision,Llama3.1-8b,68.87,49.36,60.91
-Precision,Llama3.1-70b,76.05,56.54,68.09
-Precision,Llama3.1-405B,78.80,64.10,72.80
-Precision,Qwen2.5-8b,77.18,65.58,72.45
-Precision,Qwen2.5-32b,82.74,70.60,77.79

verifact_logo.png DELETED Viewed

Binary file (63.1 kB)

verifact_steps.png DELETED Viewed

Git LFS Details

SHA256: f5574607f26cff315614cea74bdca918b4d3661f3b13025cd9920603d173b58f
Pointer size: 131 Bytes
Size of remote file: 533 kB