Spaces:
Running
Running
Commit
Β·
17ad9a6
1
Parent(s):
cf2253a
Updating readme
Browse files
README.md
CHANGED
@@ -16,90 +16,18 @@ This application provides a visual leaderboard for comparing AI model performanc
|
|
16 |
|
17 |
The leaderboard uses the MLRC-BENCH benchmark, which measures what percentage of the top human-to-baseline performance gap an agent can close. Success is defined as achieving at least 5% of the margin by which the top human solution surpasses the baseline.
|
18 |
|
19 |
-
### Key Features
|
20 |
-
|
21 |
-
- **Interactive Filtering**: Select specific model types and tasks to focus on
|
22 |
-
- **Customizable Metrics**: Compare models using "Margin to Human" performance scores
|
23 |
-
- **Hierarchical Table Display**: Fixed columns with scrollable metrics section
|
24 |
-
- **Conditional Formatting**: Visual indicators for positive/negative values
|
25 |
-
- **Model Type Color Coding**: Different colors for Open Source, Open Weights, and Closed Source models
|
26 |
-
- **Medal Indicators**: Top-ranked models receive gold, silver, and bronze medals
|
27 |
-
- **Task Descriptions**: Detailed explanations of what each task measures
|
28 |
-
|
29 |
-
## Project Structure
|
30 |
-
|
31 |
-
The codebase follows a modular architecture for improved maintainability and separation of concerns:
|
32 |
-
|
33 |
-
```
|
34 |
-
app.py (main entry point)
|
35 |
-
βββ requirements.txt
|
36 |
-
βββ src/
|
37 |
-
βββ app.py (main application logic)
|
38 |
-
βββ components/
|
39 |
-
β βββ header.py (header and footer components)
|
40 |
-
β βββ filters.py (filter selection components)
|
41 |
-
β βββ leaderboard.py (leaderboard table component)
|
42 |
-
β βββ tasks.py (task descriptions component)
|
43 |
-
βββ data/
|
44 |
-
β βββ processors.py (data processing utilities)
|
45 |
-
β βββ metrics/
|
46 |
-
β βββ margin_to_human.json (metric data file)
|
47 |
-
βββ styles/
|
48 |
-
β βββ base.py (combined styles)
|
49 |
-
β βββ components.py (component styling)
|
50 |
-
β βββ tables.py (table-specific styling)
|
51 |
-
β βββ theme.py (theme definitions)
|
52 |
-
βββ utils/
|
53 |
-
βββ config.py (configuration settings)
|
54 |
-
βββ data_loader.py (data loading utilities)
|
55 |
-
```
|
56 |
-
|
57 |
-
### Module Descriptions
|
58 |
-
|
59 |
-
#### Core Files
|
60 |
-
- `app.py` (root): Simple entry point that imports and calls the main function
|
61 |
-
- `src/app.py`: Main application logic, coordinates the overall flow
|
62 |
-
|
63 |
-
#### Components
|
64 |
-
- `header.py`: Manages the page header, section headers, and footer components
|
65 |
-
- `filters.py`: Handles metric, task, and model type selection interfaces
|
66 |
-
- `leaderboard.py`: Renders the custom HTML leaderboard table
|
67 |
-
- `tasks.py`: Renders the task descriptions section
|
68 |
-
|
69 |
-
#### Data Processing
|
70 |
-
- `processors.py`: Contains utilities for data formatting and styling
|
71 |
-
- `data_loader.py`: Functions for loading and processing metric data
|
72 |
-
|
73 |
-
#### Styling
|
74 |
-
- `theme.py`: Base theme definitions and color schemes
|
75 |
-
- `components.py`: Styling for UI components (buttons, cards, etc.)
|
76 |
-
- `tables.py`: Styling for tables and data displays
|
77 |
-
- `base.py`: Combines all styles for application-wide use
|
78 |
-
|
79 |
-
#### Configuration
|
80 |
-
- `config.py`: Contains all configuration settings including themes, metrics, and model categorizations
|
81 |
-
|
82 |
-
## Benefits of Modular Architecture
|
83 |
-
|
84 |
-
The modular structure provides several advantages:
|
85 |
-
|
86 |
-
1. **Improved Code Organization**: Code is logically separated based on functionality
|
87 |
-
2. **Better Separation of Concerns**: Each module has a clear, single responsibility
|
88 |
-
3. **Enhanced Maintainability**: Changes to one aspect don't require modifying the entire codebase
|
89 |
-
4. **Simplified Testing**: Components can be tested independently
|
90 |
-
5. **Easier Collaboration**: Multiple developers can work on different parts simultaneously
|
91 |
-
6. **Cleaner Entry Point**: Main app file is simple and focused
|
92 |
-
|
93 |
## Installation & Setup
|
94 |
|
95 |
1. Clone the repository
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
|
101 |
-
2.
|
102 |
```bash
|
|
|
|
|
103 |
pip install -r requirements.txt
|
104 |
```
|
105 |
|
@@ -108,7 +36,14 @@ The modular structure provides several advantages:
|
|
108 |
streamlit run app.py
|
109 |
```
|
110 |
|
111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
### Adding New Metrics
|
114 |
|
@@ -156,57 +91,6 @@ To add new model types:
|
|
156 |
}
|
157 |
```
|
158 |
|
159 |
-
### Modifying the UI Theme
|
160 |
-
|
161 |
-
To change the theme colors:
|
162 |
-
|
163 |
-
1. Update the `dark_theme` dictionary in `src/utils/config.py`
|
164 |
-
|
165 |
-
### Adding New Components
|
166 |
-
|
167 |
-
To add new visualization components:
|
168 |
-
|
169 |
-
1. Create a new file in the `src/components/` directory
|
170 |
-
2. Import and use the component in `src/app.py`
|
171 |
-
|
172 |
-
## Data Format
|
173 |
-
|
174 |
-
The application uses JSON files for metric data. The expected format is:
|
175 |
-
|
176 |
-
```json
|
177 |
-
{
|
178 |
-
"task-name": {
|
179 |
-
"model-name-1": value,
|
180 |
-
"model-name-2": value
|
181 |
-
},
|
182 |
-
"another-task": {
|
183 |
-
"model-name-1": value,
|
184 |
-
"model-name-2": value
|
185 |
-
}
|
186 |
-
}
|
187 |
-
```
|
188 |
-
|
189 |
-
## Testing
|
190 |
-
|
191 |
-
This modular structure makes it easier to write focused unit tests:
|
192 |
-
|
193 |
-
```python
|
194 |
-
# Example test for data_loader.py
|
195 |
-
def test_process_data():
|
196 |
-
test_data = {"task": {"model": 0.5}}
|
197 |
-
df = process_data(test_data)
|
198 |
-
assert "Task" in df.columns
|
199 |
-
assert df.loc["model", "Task"] == 0.5
|
200 |
-
```
|
201 |
-
|
202 |
## License
|
203 |
|
204 |
[MIT License](LICENSE)
|
205 |
-
|
206 |
-
## Contributing
|
207 |
-
|
208 |
-
Contributions are welcome! Please feel free to submit a Pull Request.
|
209 |
-
|
210 |
-
## Contact
|
211 |
-
|
212 |
-
For any questions or feedback, please contact [[email protected]](mailto:[email protected]).
|
|
|
16 |
|
17 |
The leaderboard uses the MLRC-BENCH benchmark, which measures what percentage of the top human-to-baseline performance gap an agent can close. Success is defined as achieving at least 5% of the margin by which the top human solution surpasses the baseline.
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## Installation & Setup
|
20 |
|
21 |
1. Clone the repository
|
22 |
+
```bash
|
23 |
+
git clone https://huggingface.co/spaces/launch/MLRC_Bench
|
24 |
+
cd MLRC_Bench
|
25 |
+
```
|
26 |
|
27 |
+
2. Setup virtual env and install the required dependencies
|
28 |
```bash
|
29 |
+
python -m venv env
|
30 |
+
source env/bin/activate
|
31 |
pip install -r requirements.txt
|
32 |
```
|
33 |
|
|
|
36 |
streamlit run app.py
|
37 |
```
|
38 |
|
39 |
+
### Updating Metrics
|
40 |
+
|
41 |
+
To update the table, update the respective metric file in `src/data/metrics` directory
|
42 |
+
|
43 |
+
### Updating Text
|
44 |
+
|
45 |
+
To update the tab on Benchmark details, make changes to the the following file - `src/components/tasks.py`
|
46 |
+
To update the metric definitions, make changes to the following file - `src/components/tasks.py`
|
47 |
|
48 |
### Adding New Metrics
|
49 |
|
|
|
91 |
}
|
92 |
```
|
93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
## License
|
95 |
|
96 |
[MIT License](LICENSE)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|