Spaces:
Running
title: MLRC-BENCH
emoji: π
colorFrom: green
colorTo: blue
sdk: streamlit
sdk_version: 1.39.0
app_file: app.py
pinned: false
license: cc-by-4.0
Overview
This application provides a visual leaderboard for comparing AI model performance on challenging Machine Learning Research Competition problems. It uses Streamlit to create an interactive web interface with filtering options, allowing users to select specific models and tasks for comparison.
The leaderboard uses the MLRC-BENCH benchmark, which measures what percentage of the top human-to-baseline performance gap an agent can close. Success is defined as achieving at least 5% of the margin by which the top human solution surpasses the baseline.
Key Features
- Interactive Filtering: Select specific model types and tasks to focus on
- Customizable Metrics: Compare models using "Margin to Human" performance scores
- Hierarchical Table Display: Fixed columns with scrollable metrics section
- Conditional Formatting: Visual indicators for positive/negative values
- Model Type Color Coding: Different colors for Open Source, Open Weights, and Closed Source models
- Medal Indicators: Top-ranked models receive gold, silver, and bronze medals
- Task Descriptions: Detailed explanations of what each task measures
Project Structure
The codebase follows a modular architecture for improved maintainability and separation of concerns:
app.py (main entry point)
βββ requirements.txt
βββ src/
βββ app.py (main application logic)
βββ components/
β βββ header.py (header and footer components)
β βββ filters.py (filter selection components)
β βββ leaderboard.py (leaderboard table component)
β βββ tasks.py (task descriptions component)
βββ data/
β βββ processors.py (data processing utilities)
β βββ metrics/
β βββ margin_to_human.json (metric data file)
βββ styles/
β βββ base.py (combined styles)
β βββ components.py (component styling)
β βββ tables.py (table-specific styling)
β βββ theme.py (theme definitions)
βββ utils/
βββ config.py (configuration settings)
βββ data_loader.py (data loading utilities)
Module Descriptions
Core Files
app.py
(root): Simple entry point that imports and calls the main functionsrc/app.py
: Main application logic, coordinates the overall flow
Components
header.py
: Manages the page header, section headers, and footer componentsfilters.py
: Handles metric, task, and model type selection interfacesleaderboard.py
: Renders the custom HTML leaderboard tabletasks.py
: Renders the task descriptions section
Data Processing
processors.py
: Contains utilities for data formatting and stylingdata_loader.py
: Functions for loading and processing metric data
Styling
theme.py
: Base theme definitions and color schemescomponents.py
: Styling for UI components (buttons, cards, etc.)tables.py
: Styling for tables and data displaysbase.py
: Combines all styles for application-wide use
Configuration
config.py
: Contains all configuration settings including themes, metrics, and model categorizations
Benefits of Modular Architecture
The modular structure provides several advantages:
- Improved Code Organization: Code is logically separated based on functionality
- Better Separation of Concerns: Each module has a clear, single responsibility
- Enhanced Maintainability: Changes to one aspect don't require modifying the entire codebase
- Simplified Testing: Components can be tested independently
- Easier Collaboration: Multiple developers can work on different parts simultaneously
- Cleaner Entry Point: Main app file is simple and focused
Installation & Setup
Clone the repository
git clone <repository-url> cd model-capability-leaderboard
Install the required dependencies
pip install -r requirements.txt
Run the application
streamlit run app.py
Extending the Application
Adding New Metrics
To add a new metric:
Create a new JSON data file in the
src/data/metrics/
directory (e.g.,src/data/metrics/new_metric.json
)Update
metrics_config
insrc/utils/config.py
:metrics_config = { "Margin to Human": { ... }, "New Metric Name": { "file": "src/data/metrics/new_metric.json", "description": "Description of the new metric", "min_value": 0, "max_value": 100, "color_map": "viridis" } }
Ensure your metric JSON file follows the same format as existing metrics:
{ "task-name": { "model-name-1": value, "model-name-2": value }, "another-task": { "model-name-1": value, "model-name-2": value } }
Adding New Model Types
To add new model types:
- Update
model_categories
insrc/utils/config.py
:model_categories = { "Existing Model": "Category", "New Model Name": "New Category" }
Modifying the UI Theme
To change the theme colors:
- Update the
dark_theme
dictionary insrc/utils/config.py
Adding New Components
To add new visualization components:
- Create a new file in the
src/components/
directory - Import and use the component in
src/app.py
Data Format
The application uses JSON files for metric data. The expected format is:
{
"task-name": {
"model-name-1": value,
"model-name-2": value
},
"another-task": {
"model-name-1": value,
"model-name-2": value
}
}
Testing
This modular structure makes it easier to write focused unit tests:
# Example test for data_loader.py
def test_process_data():
test_data = {"task": {"model": 0.5}}
df = process_data(test_data)
assert "Task" in df.columns
assert df.loc["model", "Task"] == 0.5
License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Contact
For any questions or feedback, please contact [email protected].