Armeddinosaur commited on
Commit
ed2eb44
Β·
1 Parent(s): df20b20

Adding MLRC Bench

Browse files
.gitignore ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ env/
2
+ sample_leaderboard/
3
+ __pycache__/
4
+ *.py[cod]
5
+ *$py.class
6
+ *.so
7
+ .pytest_cache/
8
+ .DS_Store
9
+ .env
10
+ .venv
11
+ htmlcov/
12
+ .mypy_cache/
13
+ .ruff_cache/
14
+ .vscode/
15
+ .idea/
16
+ .vscode/
Assests/MLRC_Bench_overview.png ADDED
Factbench_logo.png DELETED
Binary file (305 kB)
 
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
- title: VeriFact
3
- emoji: πŸ“ˆ
4
- colorFrom: blue
5
- colorTo: gray
6
  sdk: streamlit
7
  sdk_version: 1.39.0
8
  app_file: app.py
@@ -10,4 +10,209 @@ pinned: false
10
  license: cc-by-4.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Model Capability Leaderboard
3
+ emoji: πŸ“Š
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: streamlit
7
  sdk_version: 1.39.0
8
  app_file: app.py
 
10
  license: cc-by-4.0
11
  ---
12
 
13
+ # Model Capability Leaderboard
14
+
15
+ A modern, interactive dashboard for comparing the performance of different AI models across Machine Learning Research Challenges.
16
+
17
+ ![Model Capability Leaderboard](https://via.placeholder.com/800x400?text=Model+Capability+Leaderboard)
18
+
19
+ ## Overview
20
+
21
+ This application provides a visual leaderboard for comparing AI model performance on challenging Machine Learning Research Competition problems. It uses Streamlit to create an interactive web interface with filtering options, allowing users to select specific models and tasks for comparison.
22
+
23
+ The leaderboard uses the MLRC-BENCH benchmark, which measures what percentage of the top human-to-baseline performance gap an agent can close. Success is defined as achieving at least 5% of the margin by which the top human solution surpasses the baseline.
24
+
25
+ ### Key Features
26
+
27
+ - **Interactive Filtering**: Select specific model types and tasks to focus on
28
+ - **Customizable Metrics**: Compare models using "Margin to Human" performance scores
29
+ - **Hierarchical Table Display**: Fixed columns with scrollable metrics section
30
+ - **Conditional Formatting**: Visual indicators for positive/negative values
31
+ - **Model Type Color Coding**: Different colors for Open Source, Open Weights, and Closed Source models
32
+ - **Medal Indicators**: Top-ranked models receive gold, silver, and bronze medals
33
+ - **Task Descriptions**: Detailed explanations of what each task measures
34
+
35
+ ## Project Structure
36
+
37
+ The codebase follows a modular architecture for improved maintainability and separation of concerns:
38
+
39
+ ```
40
+ app.py (main entry point)
41
+ β”œβ”€β”€ requirements.txt
42
+ └── src/
43
+ β”œβ”€β”€ app.py (main application logic)
44
+ β”œβ”€β”€ components/
45
+ β”‚ β”œβ”€β”€ header.py (header and footer components)
46
+ β”‚ β”œβ”€β”€ filters.py (filter selection components)
47
+ β”‚ β”œβ”€β”€ leaderboard.py (leaderboard table component)
48
+ β”‚ └── tasks.py (task descriptions component)
49
+ β”œβ”€β”€ data/
50
+ β”‚ β”œβ”€β”€ processors.py (data processing utilities)
51
+ β”‚ └── metrics/
52
+ β”‚ └── margin_to_human.json (metric data file)
53
+ β”œβ”€β”€ styles/
54
+ β”‚ β”œβ”€β”€ base.py (combined styles)
55
+ β”‚ β”œβ”€β”€ components.py (component styling)
56
+ β”‚ β”œβ”€β”€ tables.py (table-specific styling)
57
+ β”‚ └── theme.py (theme definitions)
58
+ └── utils/
59
+ β”œβ”€β”€ config.py (configuration settings)
60
+ └── data_loader.py (data loading utilities)
61
+ ```
62
+
63
+ ### Module Descriptions
64
+
65
+ #### Core Files
66
+ - `app.py` (root): Simple entry point that imports and calls the main function
67
+ - `src/app.py`: Main application logic, coordinates the overall flow
68
+
69
+ #### Components
70
+ - `header.py`: Manages the page header, section headers, and footer components
71
+ - `filters.py`: Handles metric, task, and model type selection interfaces
72
+ - `leaderboard.py`: Renders the custom HTML leaderboard table
73
+ - `tasks.py`: Renders the task descriptions section
74
+
75
+ #### Data Processing
76
+ - `processors.py`: Contains utilities for data formatting and styling
77
+ - `data_loader.py`: Functions for loading and processing metric data
78
+
79
+ #### Styling
80
+ - `theme.py`: Base theme definitions and color schemes
81
+ - `components.py`: Styling for UI components (buttons, cards, etc.)
82
+ - `tables.py`: Styling for tables and data displays
83
+ - `base.py`: Combines all styles for application-wide use
84
+
85
+ #### Configuration
86
+ - `config.py`: Contains all configuration settings including themes, metrics, and model categorizations
87
+
88
+ ## Benefits of Modular Architecture
89
+
90
+ The modular structure provides several advantages:
91
+
92
+ 1. **Improved Code Organization**: Code is logically separated based on functionality
93
+ 2. **Better Separation of Concerns**: Each module has a clear, single responsibility
94
+ 3. **Enhanced Maintainability**: Changes to one aspect don't require modifying the entire codebase
95
+ 4. **Simplified Testing**: Components can be tested independently
96
+ 5. **Easier Collaboration**: Multiple developers can work on different parts simultaneously
97
+ 6. **Cleaner Entry Point**: Main app file is simple and focused
98
+
99
+ ## Installation & Setup
100
+
101
+ 1. Clone the repository
102
+ ```bash
103
+ git clone <repository-url>
104
+ cd model-capability-leaderboard
105
+ ```
106
+
107
+ 2. Install the required dependencies
108
+ ```bash
109
+ pip install -r requirements.txt
110
+ ```
111
+
112
+ 3. Run the application
113
+ ```bash
114
+ streamlit run app.py
115
+ ```
116
+
117
+ ## Extending the Application
118
+
119
+ ### Adding New Metrics
120
+
121
+ To add a new metric:
122
+
123
+ 1. Create a new JSON data file in the `src/data/metrics/` directory (e.g., `src/data/metrics/new_metric.json`)
124
+
125
+ 2. Update `metrics_config` in `src/utils/config.py`:
126
+ ```python
127
+ metrics_config = {
128
+ "Margin to Human": { ... },
129
+ "New Metric Name": {
130
+ "file": "src/data/metrics/new_metric.json",
131
+ "description": "Description of the new metric",
132
+ "min_value": 0,
133
+ "max_value": 100,
134
+ "color_map": "viridis"
135
+ }
136
+ }
137
+ ```
138
+
139
+ 3. Ensure your metric JSON file follows the same format as existing metrics:
140
+ ```json
141
+ {
142
+ "task-name": {
143
+ "model-name-1": value,
144
+ "model-name-2": value
145
+ },
146
+ "another-task": {
147
+ "model-name-1": value,
148
+ "model-name-2": value
149
+ }
150
+ }
151
+ ```
152
+
153
+ ### Adding New Model Types
154
+
155
+ To add new model types:
156
+
157
+ 1. Update `model_categories` in `src/utils/config.py`:
158
+ ```python
159
+ model_categories = {
160
+ "Existing Model": "Category",
161
+ "New Model Name": "New Category"
162
+ }
163
+ ```
164
+
165
+ ### Modifying the UI Theme
166
+
167
+ To change the theme colors:
168
+
169
+ 1. Update the `dark_theme` dictionary in `src/utils/config.py`
170
+
171
+ ### Adding New Components
172
+
173
+ To add new visualization components:
174
+
175
+ 1. Create a new file in the `src/components/` directory
176
+ 2. Import and use the component in `src/app.py`
177
+
178
+ ## Data Format
179
+
180
+ The application uses JSON files for metric data. The expected format is:
181
+
182
+ ```json
183
+ {
184
+ "task-name": {
185
+ "model-name-1": value,
186
+ "model-name-2": value
187
+ },
188
+ "another-task": {
189
+ "model-name-1": value,
190
+ "model-name-2": value
191
+ }
192
+ }
193
+ ```
194
+
195
+ ## Testing
196
+
197
+ This modular structure makes it easier to write focused unit tests:
198
+
199
+ ```python
200
+ # Example test for data_loader.py
201
+ def test_process_data():
202
+ test_data = {"task": {"model": 0.5}}
203
+ df = process_data(test_data)
204
+ assert "Task" in df.columns
205
+ assert df.loc["model", "Task"] == 0.5
206
+ ```
207
+
208
+ ## License
209
+
210
+ [MIT License](LICENSE)
211
+
212
+ ## Contributing
213
+
214
+ Contributions are welcome! Please feel free to submit a Pull Request.
215
+
216
+ ## Contact
217
+
218
+ For any questions or feedback, please contact [[email protected]](mailto:[email protected]).
app.py CHANGED
@@ -1,347 +1,8 @@
1
- import streamlit as st
2
- import pandas as pd
3
- from PIL import Image
4
- import base64
5
- from io import BytesIO
6
-
7
- # Set up page config
8
- st.set_page_config(
9
- page_title="VeriFact Leaderboard",
10
- layout="wide"
11
- )
12
-
13
- # load header
14
- with open("_header.md", "r") as f:
15
- HEADER_MD = f.read()
16
-
17
- # Load the image
18
- image = Image.open("verifact_steps.png")
19
- logo_image = Image.open("verifact_logo.png")
20
-
21
- # Custom CSS for the page
22
- st.markdown(
23
- """
24
- <style>
25
- @import url('https://fonts.googleapis.com/css2?family=Courier+Prime:wght@400&display=swap');
26
-
27
- html, body, [class*="css"] {
28
- font-family: 'Arial', sans-serif; /* or use a similar sans-serif font */
29
- background-color: #f9f9f9; /* Light grey background */
30
- }
31
-
32
- .title {
33
- font-size: 42px;
34
- font-weight: bold;
35
- text-align: center;
36
- color: #333;
37
- margin-bottom: 5px;
38
- }
39
-
40
- .description {
41
- font-size: 22px;
42
- text-align: center;
43
- margin-bottom: 30px;
44
- color: #555;
45
- }
46
-
47
- .header, .metric {
48
- align-items: left;
49
- font-family: 'Arial', sans-serif; /* or use a similar sans-serif font */
50
- margin-bottom: 20px;
51
- }
52
-
53
- .container {
54
- max-width: 1000px;
55
- margin: 0 auto;
56
- padding: 5px;
57
- }
58
-
59
- table {
60
- width: 100%;
61
- border-collapse: collapse;
62
- border-radius: 10px;
63
- overflow: hidden;
64
- }
65
-
66
- th, td {
67
- padding: 8px;
68
- text-align: center;
69
- border: 1px solid #ddd;
70
- font-family: 'Arial', sans-serif; /* or use a similar sans-serif font */
71
- font-size: 16px;
72
- transition: background-color 0.3s;
73
- }
74
-
75
- th {
76
- background-color: #f2f2f2;
77
- font-weight: bold;
78
- }
79
-
80
- td:hover {
81
- background-color: #eaeaea;
82
- }
83
- </style>
84
- """,
85
- unsafe_allow_html=True
86
- )
87
-
88
- # Display title and description
89
- st.markdown('<div class="container">', unsafe_allow_html=True)
90
- # st.image(logo_image, output_format="PNG", width=200)
91
-
92
- # Convert the image to base64
93
- buffered = BytesIO()
94
- logo_image.save(buffered, format="PNG")
95
- img_data = base64.b64encode(buffered.getvalue()).decode("utf-8")
96
- st.markdown(
97
- f"""
98
- <style>
99
- .logo-container {{
100
- display: flex;
101
- justify-content: flex-start; /* Aligns to the left */
102
- }}
103
- .logo-container img {{
104
- width: 50%; /* Adjust this to control the width, e.g., 50% of container width */
105
- margin: 0 auto;
106
- max-width: 700px; /* Set a maximum width */
107
- background-color: transparent;
108
- }}
109
- </style>
110
- <div class="logo-container">
111
- <img src="data:image/png;base64,{img_data}" alt="VeriFact Leaderboard Logo">
112
- </div>
113
- """,
114
- unsafe_allow_html=True
115
- )
116
-
117
- # header_md_text = HEADER_MD # make some parameters later
118
- # gr.Markdown(header_md_text, elem_classes="markdown-text")
119
-
120
- st.markdown(
121
- '''
122
- <div class="header">
123
- <br/>
124
- <p style="font-size:22px;">
125
- VERIFACT: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts
126
- </p>
127
- <p style="font-size:20px;">
128
- # πŸ“‘ <a href="">Paper</a> | πŸ’» <a href="">GitHub</a> | πŸ€— <a href="">HuggingFace</a>
129
- βš™οΈ <strong>Version</strong>: <strong>V1</strong> | <strong># Models</strong>: 8 | Updated: <strong>???</strong>
130
- </p>
131
- </div>
132
- ''',
133
- unsafe_allow_html=True
134
- )
135
-
136
-
137
- # st.markdown('<div class="title">VeriFact Leaderboard</div>',
138
- # unsafe_allow_html=True)
139
- # st.markdown('<div class="description">Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts</div>', unsafe_allow_html=True)
140
- st.markdown('</div>', unsafe_allow_html=True)
141
-
142
- # Load the data
143
- data_path = "verifact_data.csv"
144
- df = pd.read_csv(data_path)
145
-
146
- # Assign ranks within each tier based on factuality_score
147
- df['rank'] = df.groupby('tier')['Overall'].rank(
148
- ascending=False, method='min').astype(int)
149
-
150
- # Replace NaN values with '-'
151
- df.fillna('-', inplace=True)
152
-
153
- df['original_order'] = df.groupby('tier').cumcount()
154
-
155
- # Create tabs
156
- st.markdown("""
157
- <style>
158
- .stTabs [data-baseweb="tab-list"] button [data-testid="stMarkdownContainer"] p {
159
- font-size: 20px;
160
- }
161
- </style>
162
- """, unsafe_allow_html=True)
163
-
164
- tab1, tab2 = st.tabs(["Leaderboard", "Benchmark Details"])
165
-
166
- # Tab 1: Leaderboard
167
- with tab1:
168
- # df['original_order'] = df.groupby('tier').cumcount()
169
- # print(df['original_order'])
170
-
171
- # st.markdown('<div class="title">Leaderboard</div>', unsafe_allow_html=True)
172
- st.markdown('<div class="tab-content">', unsafe_allow_html=True)
173
-
174
- st.markdown("""
175
- <div class="metric" style="font-size:20px; font-weight: bold;">
176
- Metrics Explanation
177
- </div>
178
- """, unsafe_allow_html=True)
179
-
180
- st.markdown("""
181
- <div class="metric" style="font-size:16px;">
182
- <br/>
183
- <p>
184
- <strong> 🎯 Factual Precision </strong> measures the ratio of supported units divided by all units averaged over model responses. <strong> πŸŒ€ Hallucination Score </strong> quantifies the incorrect or inconclusive contents within a model response, as described in the paper. We also provide statistics on the average length of the response in terms of the number of tokens, the average verifiable units existing in the model responses (<strong>Avg. # Units</strong>), the average number of units labelled as undecidable (<strong>Avg. # Undecidable</strong>), and the average number of units labelled as unsupported (<strong>Avg. # Unsupported</strong>).
185
- </p>
186
- <p>
187
- πŸ”’ for closed LLMs; πŸ”‘ for open-weights LLMs; 🚨 for newly added models
188
- </p>
189
- </div>
190
- """,
191
- unsafe_allow_html=True
192
- )
193
-
194
- st.markdown("""
195
- <style>
196
- /* Selectbox text */
197
- div[data-baseweb="select"] > div {
198
- font-size: 20px;
199
- }
200
-
201
- /* Dropdown options */
202
- div[role="listbox"] ul li {
203
- font-size: 20px !important;
204
- }
205
-
206
- /* Checkbox label */
207
- .stCheckbox label p {
208
- font-size: 20px !important;
209
- }
210
-
211
- /* Selectbox label */
212
- .stSelectbox label p {
213
- font-size: 20px !important;
214
- }
215
- </style>
216
- """, unsafe_allow_html=True)
217
-
218
- # Dropdown menu to filter tiers
219
- tiers = ['All Metrics', 'Precision', 'Recall', 'F1']
220
- selected_tier = st.selectbox('Select metric:', tiers)
221
-
222
- # Filter the data based on the selected tier
223
- if selected_tier != 'All Metrics':
224
- filtered_df = df[df['tier'] == selected_tier]
225
- else:
226
- filtered_df = df
227
-
228
- sort_by_factuality = st.checkbox('Sort by overall score')
229
-
230
- # Sort the dataframe based on Factuality Score if the checkbox is selected
231
- if sort_by_factuality:
232
- updated_filtered_df = filtered_df.sort_values(
233
- by=['tier', 'Overall'], ascending=[True, False]
234
- )
235
- else:
236
- updated_filtered_df = filtered_df.sort_values(
237
- by=['tier', 'original_order']
238
- )
239
-
240
- # Create HTML for the table
241
- if selected_tier == 'All Metrics':
242
- html = '''
243
- <table>
244
- <thead>
245
- <tr>
246
- <th>Metric</th>
247
- <th>Rank</th>
248
- <th>Model</th>
249
- <th>Factbench</th>
250
- <th>Reddit</th>
251
- <th>Overall</th>
252
- </tr>
253
- </thead>
254
- <tbody>
255
- '''
256
- else:
257
- html = '''
258
- <table>
259
- <thead>
260
- <tr>
261
- <th>Rank</th>
262
- <th>Model</th>
263
- <th>Factbench</th>
264
- <th>Reddit</th>
265
- <th>Overall</th>
266
- </tr>
267
- </thead>
268
- <tbody>
269
- '''
270
-
271
- # Generate the rows of the table
272
- current_tier = None
273
- for i, row in updated_filtered_df.iterrows():
274
- html += '<tr>'
275
-
276
- # Only display the 'Tier' column if 'All Tiers' is selected
277
- if selected_tier == 'All Metrics':
278
- if row['tier'] != current_tier:
279
- current_tier = row['tier']
280
- html += f'<td rowspan="8" style="vertical-align: middle;">{current_tier}</td>'
281
-
282
- # Fill in model and scores
283
- html += f'''
284
- <td>{row['rank']}</td>
285
- <td>{row['model']}</td>
286
- <td>{row['FactBench']}</td>
287
- <td>{row['Reddit']}</td>
288
- <td>{row['Overall']}</td>
289
- </tr>
290
- '''
291
-
292
- # Close the table
293
- html += '''
294
- </table>
295
- '''
296
-
297
- # Display the table
298
- st.markdown(html, unsafe_allow_html=True)
299
-
300
- st.markdown('</div>', unsafe_allow_html=True)
301
-
302
- # Tab 2: Details
303
- with tab2:
304
- st.markdown('<div class="tab-content">', unsafe_allow_html=True)
305
-
306
- # st.markdown('<div class="title"></div>',
307
- # unsafe_allow_html=True)
308
- st.image(image, use_column_width=True)
309
-
310
- st.markdown('### VERIFY: A Pipeline for Factuality Evaluation')
311
- st.write(
312
- "Language models (LMs) are widely used by an increasing number of users, "
313
- "underscoring the challenge of maintaining factual accuracy across a broad range of topics. "
314
- "We present VERIFY (Verification and Evidence Retrieval for Factuality evaluation), "
315
- "a pipeline to evaluate LMs' factual accuracy in real-world user interactions."
316
- )
317
-
318
- st.markdown('### Content Categorization')
319
- st.write(
320
- "VERIFY considers the verifiability of LM-generated content and categorizes content units as "
321
- "`supported`, `unsupported`, or `undecidable` based on the retrieved web evidence. "
322
- "Importantly, VERIFY's factuality judgments correlate better with human evaluations than existing methods."
323
- )
324
-
325
- st.markdown('### Hallucination Prompts & FactBench Dataset')
326
- st.write(
327
- "Using VERIFY, we identify 'hallucination prompts' across diverse topicsβ€”those eliciting the highest rates of "
328
- "incorrect or unverifiable LM responses. These prompts form FactBench, a dataset of 985 prompts across 213 "
329
- "fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is "
330
- "regularly updated with new prompts."
331
- )
332
-
333
- st.markdown('</div>', unsafe_allow_html=True)
334
-
335
- # # Tab 3: Links
336
- # with tab3:
337
- # st.markdown('<div class="tab-content">', unsafe_allow_html=True)
338
-
339
- # st.markdown('<div class="title">Submit your model information on our Github</div>',
340
- # unsafe_allow_html=True)
341
-
342
- # st.markdown(
343
- # '[Test your model locally!](https://github.com/FarimaFatahi/FactEval)')
344
- # st.markdown(
345
- # '[Submit results or issues!](https://github.com/FarimaFatahi/FactEval/issues/new)')
346
-
347
- # st.markdown('</div>', unsafe_allow_html=True)
 
1
+ """
2
+ Entry point for the Model Capability Leaderboard application.
3
+ This file serves as a simple wrapper for the main application code in src/app.py.
4
+ """
5
+ from src.app import main
6
+
7
+ if __name__ == "__main__":
8
+ main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
factEvalSteps.pdf DELETED
Binary file (983 kB)
 
factEvalSteps.png DELETED

Git LFS Details

  • SHA256: 6d8b3690f826eef47af41c8d80d141ccdb2c8045d76e39c6f2207c99e8e6e35d
  • Pointer size: 132 Bytes
  • Size of remote file: 2.16 MB
factbench_data.csv DELETED
@@ -1,13 +0,0 @@
1
- Tier,Model,FactScore,SAFE,Factcheck-GPT,VERIFY
2
- Tier 1: Easy,GPT4-o,53.19,63.31,86.4,71.58
3
- Tier 1: Easy,Gemini1.5-Pro,51.79,61.24,83.45,69.38
4
- Tier 1: Easy,Llama3.1-70B-Instruct,52.49,61.29,83.48,67.27
5
- Tier 1: Easy,Llama3.1-405B-Instruct,53.22,61.63,83.57,64.94
6
- Tier 2: Moderate,GPT4-o,54.76,65.01,89.39,76.02
7
- Tier 2: Moderate,Gemini1.5-Pro,52.62,62.68,87.44,74.24
8
- Tier 2: Moderate,Llama3.1-70B-Instruct,52.53,62.64,85.16,72.01
9
- Tier 2: Moderate,Llama3.1-405B-Instruct,53.48,63.29,86.37,70.25
10
- Tier 3: Hard,GPT4-o,69.44,76.17,94.25,90.58
11
- Tier 3: Hard,Gemini1.5-Pro,66.05,75.69,91.09,87.82
12
- Tier 3: Hard,Llama3.1-70B-Instruct,69.85,77.55,92.89,86.63
13
- Tier 3: Hard,Llama3.1-405B-Instruct,70.04,77.01,93.64,85.79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.py β†’ requirements.txt RENAMED
@@ -1,3 +1,3 @@
1
  pandas
2
  streamlit
3
- scikit-learn == 1.0.2
 
1
  pandas
2
  streamlit
3
+ scikit-learn
src/app.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main entry point for the Model Capability Leaderboard application.
3
+ """
4
+ import streamlit as st
5
+
6
+ # Import configuration
7
+ from src.utils.config import app_config, metrics_config
8
+
9
+ # Import data functions
10
+ from src.utils.data_loader import (
11
+ load_metric_data,
12
+ process_data,
13
+ filter_and_prepare_data,
14
+ format_display_dataframe
15
+ )
16
+
17
+ # Import styles
18
+ from src.styles.base import load_all_styles
19
+
20
+ # Import components
21
+ from src.components.header import render_page_header, render_footer
22
+ from src.components.filters import (
23
+ initialize_session_state,
24
+ render_metric_selection,
25
+ render_task_selection,
26
+ render_model_type_selection
27
+ )
28
+ from src.components.leaderboard import render_leaderboard_table, render_empty_state
29
+ from src.components.tasks import render_task_descriptions
30
+
31
+ def setup_page():
32
+ """
33
+ Set up the Streamlit page configuration
34
+ """
35
+ st.set_page_config(
36
+ page_title=app_config['title'],
37
+ layout=app_config['layout'],
38
+ initial_sidebar_state=app_config['initial_sidebar_state']
39
+ )
40
+
41
+ # Load all styles
42
+ load_all_styles()
43
+
44
+ def main():
45
+ """
46
+ Main application function
47
+ """
48
+ # Set up page
49
+ setup_page()
50
+
51
+ # Render header
52
+ render_page_header()
53
+
54
+ # Load data
55
+ current_metric = list(metrics_config.keys())[0]
56
+ metric_data = load_metric_data(metrics_config[current_metric]["file"])
57
+ df = process_data(metric_data)
58
+
59
+ # Initialize session state
60
+ initialize_session_state(df)
61
+
62
+ # Create tabs
63
+ tabs = st.tabs(["πŸ“Š Leaderboard", "πŸ“‘ Benchmark Details"])
64
+
65
+ # Tab 1: Leaderboard
66
+ with tabs[0]:
67
+ # Render filter components
68
+ selected_metric = render_metric_selection()
69
+ selected_tasks = render_task_selection(df)
70
+ selected_model_types = render_model_type_selection(df)
71
+
72
+ # Render leaderboard if selections are valid
73
+ if selected_tasks and selected_model_types:
74
+ # Filter and prepare data
75
+ filtered_df = filter_and_prepare_data(df, selected_tasks, selected_model_types)
76
+
77
+ # Format data for display
78
+ display_df, metric_columns = format_display_dataframe(filtered_df, selected_tasks)
79
+
80
+ # Render the leaderboard table
81
+ render_leaderboard_table(display_df, metric_columns)
82
+ else:
83
+ # Show empty state
84
+ render_empty_state()
85
+
86
+ # Tab 2: Benchmark Details
87
+ with tabs[1]:
88
+ # Render task descriptions
89
+ render_task_descriptions()
90
+
91
+ # Render footer
92
+ render_footer()
93
+
94
+ if __name__ == "__main__":
95
+ main()
src/components/filters.py ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Filter components for the leaderboard application.
3
+ """
4
+ import streamlit as st
5
+ from src.utils.config import metrics_config
6
+
7
+ def initialize_session_state(df):
8
+ """
9
+ Initialize the session state for filters
10
+
11
+ Args:
12
+ df (pandas.DataFrame): The DataFrame with model data
13
+ """
14
+ # Initialize session states
15
+ if 'selected_metric' not in st.session_state:
16
+ st.session_state.selected_metric = list(metrics_config.keys())[0]
17
+
18
+ if 'selected_tasks' not in st.session_state:
19
+ # Default to first 3 tasks, excluding Model Type
20
+ st.session_state.selected_tasks = [col for col in df.columns if col not in ['Model Type']][:3]
21
+
22
+ if 'selected_model_types' not in st.session_state:
23
+ # Ensure all model types are selected by default
24
+ st.session_state.selected_model_types = list(df['Model Type'].unique())
25
+
26
+ def render_metric_selection():
27
+ """
28
+ Render the metric selection component
29
+
30
+ Returns:
31
+ str: Selected metric
32
+ """
33
+ st.markdown("### Select Metric")
34
+
35
+ # Create more compact metric buttons with clear selection indicators
36
+ metric_cols = st.columns(len(metrics_config))
37
+ for i, metric in enumerate(metrics_config.keys()):
38
+ with metric_cols[i]:
39
+ is_selected = st.session_state.selected_metric == metric
40
+ button_label = f"βœ“ {metric}" if is_selected else metric
41
+ button_type = "primary" if is_selected else "secondary"
42
+
43
+ if st.button(button_label, key=f"metric_{metric}", type=button_type):
44
+ st.session_state.selected_metric = metric
45
+ st.rerun() # Force UI update
46
+
47
+ return st.session_state.selected_metric
48
+
49
+ def render_task_selection(df):
50
+ """
51
+ Render the task selection component
52
+
53
+ Args:
54
+ df (pandas.DataFrame): The DataFrame with model data
55
+
56
+ Returns:
57
+ list: Selected tasks
58
+ """
59
+ st.markdown("### Select Tasks")
60
+
61
+ # Extract task columns (exclude Model Type and Overall)
62
+ all_tasks = [col for col in df.columns if col not in ['Model Type']]
63
+
64
+ # Create task buttons in rows of 3
65
+ num_cols = 3
66
+ task_rows = [all_tasks[i:i+num_cols] for i in range(0, len(all_tasks), num_cols)]
67
+
68
+ for row in task_rows:
69
+ cols = st.columns(num_cols)
70
+ for i, task in enumerate(row):
71
+ if i < len(row):
72
+ with cols[i]:
73
+ is_selected = task in st.session_state.selected_tasks
74
+ button_label = f"βœ“ {task}" if is_selected else task
75
+ button_type = "primary" if is_selected else "secondary"
76
+
77
+ if st.button(button_label, key=f"task_{task}", type=button_type):
78
+ if is_selected:
79
+ st.session_state.selected_tasks.remove(task)
80
+ else:
81
+ st.session_state.selected_tasks.append(task)
82
+ st.rerun() # Force UI update
83
+
84
+ return st.session_state.selected_tasks
85
+
86
+ def render_model_type_selection(df):
87
+ """
88
+ Render the model type selection component
89
+
90
+ Args:
91
+ df (pandas.DataFrame): The DataFrame with model data
92
+
93
+ Returns:
94
+ list: Selected model types
95
+ """
96
+ st.markdown("### Select Model Types")
97
+
98
+ # Create model type buttons
99
+ model_types = df['Model Type'].unique().tolist()
100
+ model_type_cols = st.columns(len(model_types))
101
+
102
+ for i, model_type in enumerate(model_types):
103
+ with model_type_cols[i]:
104
+ is_selected = model_type in st.session_state.selected_model_types
105
+ button_label = f"βœ“ {model_type}" if is_selected else model_type
106
+ button_type = "primary" if is_selected else "secondary"
107
+
108
+ if st.button(button_label, key=f"model_type_{model_type}", type=button_type):
109
+ if is_selected:
110
+ # Prevent deselecting all model types - ensure at least one remains selected
111
+ if len(st.session_state.selected_model_types) > 1:
112
+ st.session_state.selected_model_types.remove(model_type)
113
+ else:
114
+ st.session_state.selected_model_types.append(model_type)
115
+ st.rerun() # Force UI update
116
+
117
+ return st.session_state.selected_model_types
src/components/header.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Header components for the leaderboard application.
3
+ """
4
+ import streamlit as st
5
+ from src.utils.config import app_config
6
+
7
+ def render_page_header():
8
+ """
9
+ Render the page header with title and description
10
+ """
11
+ st.markdown(
12
+ f"""
13
+ <div class="title-container">
14
+ <h1 class="title">{app_config['title']}</h1>
15
+ <p class="subtitle">{app_config['description']}</p>
16
+ </div>
17
+ """,
18
+ unsafe_allow_html=True
19
+ )
20
+
21
+ def render_section_header(title):
22
+ """
23
+ Render a section header
24
+
25
+ Args:
26
+ title (str): The section title
27
+ """
28
+ st.markdown(f"### {title}")
29
+
30
+ def render_footer():
31
+ """
32
+ Render the page footer
33
+ """
34
+ st.markdown(
35
+ """
36
+ <div class="footer">
37
+ <p>Β© 2023 Model Capability Leaderboard β€’ Made with Streamlit β€’ Contact: [email protected]</p>
38
+ </div>
39
+ """,
40
+ unsafe_allow_html=True
41
+ )
src/components/leaderboard.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Leaderboard table components for the leaderboard application.
3
+ """
4
+ import streamlit as st
5
+ from src.data.processors import get_model_type_style, get_rank_style
6
+
7
+ def render_leaderboard_table(display_df, metric_columns):
8
+ """
9
+ Render the custom HTML leaderboard table
10
+
11
+ Args:
12
+ display_df (pandas.DataFrame): The DataFrame with the display data
13
+ metric_columns (list): List of metric column names
14
+ """
15
+ from src.components.header import render_section_header
16
+
17
+ # Display model ranking header without the box
18
+ render_section_header("Model Rankings")
19
+
20
+ # Start building the HTML table structure
21
+ html_table = """
22
+ <div class="fixed-table-container">
23
+ <div class="scroll-container">
24
+ <table class="fixed-table">
25
+ <thead>
26
+ <tr class="header-row">
27
+ <th class="fixed-column first-fixed-column" rowspan="2">Rank</th>
28
+ <th class="fixed-column second-fixed-column" rowspan="2">Model + Scaffolding</th>
29
+ <th class="model-type-cell" rowspan="2">Model Type</th>
30
+ """
31
+
32
+ # Add the metric header
33
+ html_table += f'<th colspan="{len(metric_columns)}" class="metric-header">Margin To Human</th>'
34
+
35
+ # Continue the table structure
36
+ html_table += """
37
+ </tr>
38
+ <tr class="sub-header">
39
+ """
40
+
41
+ # Add individual column headers for metrics
42
+ for col in metric_columns:
43
+ column_class = "overall-cell" if col == "Metric Average" else "metric-cell"
44
+ html_table += f'<th class="{column_class}">{col}</th>'
45
+
46
+ # Close the header and start the body
47
+ html_table += """
48
+ </tr>
49
+ </thead>
50
+ <tbody>
51
+ """
52
+
53
+ # Add the data rows
54
+ for i, (idx, row) in enumerate(display_df.iterrows()):
55
+ # Define background colors to ensure consistency
56
+ row_bg = "#0a0a0a" if i % 2 == 0 else "#111111"
57
+
58
+ # Start the row
59
+ html_table += f'<tr class="table-row">'
60
+
61
+ # Add Rank with medal styling and consistent background
62
+ rank_style = f"background-color: {row_bg};" # Add row background to fixed columns
63
+ rank_styles = get_rank_style(row["Rank"])
64
+ for style_key, style_value in rank_styles.items():
65
+ rank_style += f"{style_key}: {style_value};"
66
+
67
+ html_table += f'<td class="fixed-column first-fixed-column" style="{rank_style}">{row["Rank"]}</td>'
68
+
69
+ # Model name fixed column with consistent background
70
+ html_table += f'<td class="fixed-column second-fixed-column" title="{row["Model Name"]}" style="background-color: {row_bg}; font-weight: 500; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; text-align: center;">{row["Model Name"]}</td>'
71
+
72
+ # Model type cell
73
+ model_type = row["Model Type"]
74
+ type_style = f"background-color: {row_bg};"
75
+ model_type_styles = get_model_type_style(model_type)
76
+ for style_key, style_value in model_type_styles.items():
77
+ if style_value:
78
+ type_style += f"{style_key}: {style_value};"
79
+
80
+ html_table += f'<td class="table-cell model-type-cell" style="{type_style}">{model_type}</td>'
81
+
82
+ # Add metric values with minimal styling
83
+ for col in metric_columns:
84
+ cell_class = "table-cell overall-cell" if col == "Metric Average" else "table-cell metric-cell"
85
+ value_text = row[col]
86
+
87
+ # Simple styling based on positive/negative values
88
+ try:
89
+ value = float(str(row[col]).replace(',', ''))
90
+ if value > 0:
91
+ cell_class += " positive-value"
92
+ elif value < 0:
93
+ cell_class += " negative-value"
94
+ except:
95
+ pass
96
+
97
+ html_table += f'<td class="{cell_class}" style="background-color: {row_bg};">{value_text}</td>'
98
+
99
+ html_table += "</tr>"
100
+
101
+ # Close the table
102
+ html_table += """
103
+ </tbody>
104
+ </table>
105
+ </div>
106
+ </div>
107
+ """
108
+
109
+ # Add metric definition below the table
110
+ metric_definition = """
111
+ <div class="metric-definition">
112
+ <h4>Margin to Human</h4>
113
+ <p> This metric measures what percentage of the top 1 human-to-baseline performance gap an agent can close on challenging Machine Learning Research Competition problems. For example, if the baseline is 100, top human performance is 200, and the agent scores 110, the agent has closed 10% of the gap between baseline and top human performance. Higher percentages indicate models that more effectively approach top human-level research capabilities.</p>
114
+ </div>
115
+ """
116
+
117
+ # Display the custom HTML table and metric definition
118
+ st.markdown(html_table + metric_definition, unsafe_allow_html=True)
119
+
120
+ def render_empty_state():
121
+ """
122
+ Render an empty state when no data is available
123
+ """
124
+ st.markdown("""
125
+ <div class="warning-box">
126
+ <strong>No data to display.</strong> Please select at least one task and one model type to view the data.
127
+ </div>
128
+ """, unsafe_allow_html=True)
src/components/tasks.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Task description components for the leaderboard application.
3
+ """
4
+ import streamlit as st
5
+ from src.utils.config import tasks_info
6
+
7
+ def render_task_descriptions():
8
+ """
9
+ Render the benchmark details section
10
+ """
11
+ # Display the MLRC-BENCH image
12
+ st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)
13
+
14
+ # Display the MLRC-BENCH information
15
+ st.markdown("""
16
+ # MLRC-BENCH: Can Language Agents Crack ML Research Challenges?
17
+
18
+ Recent advances in large language models (LLMs) have raised an intriguing question for the machine learning community: Can AI agents not only generate novel research ideas but also implement them effectively? A new benchmark, **MLRC-BENCH**, steps into the spotlight to answer this very question.
19
+
20
+ ## What Is MLRC-BENCH?
21
+
22
+ MLRC-BENCH is a dynamic benchmark designed to objectively evaluate whether LLM-based research agents can tackle cutting-edge ML competition tasks. Unlike previous evaluations that either focused on end-to-end paper generation or narrow engineering challenges, this benchmark splits the research workflow into two core steps:
23
+ - **Idea Proposal:** Generating innovative research ideas.
24
+ - **Code Implementation:** Translating those ideas into working, performance-improving code.
25
+
26
+ The benchmark uses tasks sourced from recent ML conferences and workshops, ensuring the problems are both impactful and non-trivial.
27
+
28
+ ## How Does It Work?
29
+
30
+ MLRC-BENCH emphasizes **objective metrics**:
31
+ - **Success Rate:** An agent is deemed successful if its solution improves upon a baseline by at least 5% of the margin by which the top human solution surpasses that baseline.
32
+ - **Performance, Efficiency & Simplicity:** Each solution is measured not only by how well it performs but also by how efficient and simple the code is. For example, an ideal solution should achieve higher performance with minimal runtime and code complexity.
33
+
34
+ Additionally, the benchmark integrates **LLM-as-a-judge evaluations** to compare subjective assessments of idea novelty with the objective performance gains. Interestingly, the study reveals a weak correlation between perceived novelty and actual performance improvements.
35
+
36
+ ## Why It Matters
37
+
38
+ The ability for AI agents to contribute to scientific discovery is both exciting and cautionary. While MLRC-BENCH demonstrates that current agents are not yet ready to match human ingenuity, it also provides a scalable framework to track progress and encourage future innovations. The insights gained from this benchmark could guide the development of safer, more effective AI research tools, particularly in high-stakes fields like healthcare, climate science, and AI safety.
39
+
40
+ ## Looking Ahead
41
+
42
+ MLRC-BENCH is built to evolve: as new ML competitions emerge, the benchmark can be updated to reflect the latest challenges. This dynamic nature ensures that it remains a relevant tool for pushing the boundaries of AI-assisted scientific research.
43
+ """)
44
+
45
+ st.markdown("""
46
+ <div class="card">
47
+ <div class="card-title"><span class="card-title-icon">πŸ”</span> Tasks in the Benchmark</div>
48
+ <p style="margin-bottom: 20px;">
49
+ Click on any task to learn more about the original benchmark.
50
+ </p>
51
+ </div>
52
+ """, unsafe_allow_html=True)
53
+
54
+ # Task links mapping
55
+ task_links = {
56
+ "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model",
57
+ "Machine Unlearning": "https://unlearning-challenge.github.io/",
58
+ "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io",
59
+ "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge",
60
+ "Meta Learning": "https://metalearning.chalearn.org/",
61
+ "Llm Merging": "https://llm-merging.github.io"
62
+ }
63
+
64
+ # Create two columns
65
+ col1, col2 = st.columns(2)
66
+
67
+ # Split tasks between the two columns with better styling
68
+ task_items = list(tasks_info.items())
69
+ mid_point = len(task_items) // 2
70
+
71
+ with col1:
72
+ for task, description in task_items[:mid_point]:
73
+ link = task_links.get(task, "#")
74
+ st.markdown(f"""
75
+ <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
76
+ <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
77
+ <div class="task-title">{task} <span style="font-size: 14px; opacity: 0.7;">πŸ”—</span></div>
78
+ <div class="task-description">{description}</div>
79
+ </div>
80
+ </a>
81
+ """, unsafe_allow_html=True)
82
+
83
+ with col2:
84
+ for task, description in task_items[mid_point:]:
85
+ link = task_links.get(task, "#")
86
+ st.markdown(f"""
87
+ <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
88
+ <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
89
+ <div class="task-title">{task} <span style="font-size: 14px; opacity: 0.7;">πŸ”—</span></div>
90
+ <div class="task-description">{description}</div>
91
+ </div>
92
+ </a>
93
+ """, unsafe_allow_html=True)
src/data/metrics/margin_to_human.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "perception_temporal_action_loc": {
3
+ "MLAB (claude-3-5-sonnet-v2)": 0.7810185077440877,
4
+ "MLAB (gemini-exp-1206)": -0.4731328246392113,
5
+ "MLAB (o3-mini)": 0.3066106841553126,
6
+ "MLAB (gpt-4o)": 0.3298075630252947,
7
+ "MLAB (llama3-1-405b-instruct)": 0.5183240203504569,
8
+ "CoI-Agent (o1) + MLAB (gpt-4o)": 0.3475212791527979
9
+ },
10
+ "llm-merging": {
11
+ "CoI-Agent (o1) + MLAB (gpt-4o)": -0.9900989999019761,
12
+ "MLAB (claude-3-5-sonnet-v2)": 4.950495058915793,
13
+ "MLAB (gemini-exp-1206)": 4.950495058915793,
14
+ "MLAB (o3-mini)": -0.9900989999019761,
15
+ "MLAB (gpt-4o)": 1.9801980295069084,
16
+ "MLAB (llama3-1-405b-instruct)": -0.9900989999019761
17
+ },
18
+ "meta-learning": {
19
+ "CoI-Agent (o1) + MLAB (gpt-4o)": 1.781401026144938,
20
+ "MLAB (claude-3-5-sonnet-v2)": 1.781401026144938,
21
+ "MLAB (gemini-exp-1206)": 1.781401026144938,
22
+ "MLAB (o3-mini)": -4.900331256476853,
23
+ "MLAB (gpt-4o)": 1.781401026144938,
24
+ "MLAB (llama3-1-405b-instruct)": 1.781401026144938
25
+ },
26
+ "product-recommendation": {
27
+ "CoI-Agent (o1) + MLAB (gpt-4o)": 0.1459345029718814,
28
+ "MLAB (claude-3-5-sonnet-v2)": 2.9771372473170388,
29
+ "MLAB (gemini-exp-1206)": 0.1459345029718814,
30
+ "MLAB (o3-mini)": 0.1462759705510577,
31
+ "MLAB (gpt-4o)": 0.6398666846799662,
32
+ "MLAB (llama3-1-405b-instruct)": -7.044800459739471e-10
33
+ },
34
+ "machine_unlearning": {
35
+ "CoI-Agent (o1) + MLAB (gpt-4o)": 11.832138969791846,
36
+ "MLAB (claude-3-5-sonnet-v2)": -94.71778374121965,
37
+ "MLAB (gemini-exp-1206)": 5.632371576335568,
38
+ "MLAB (o3-mini)": 3.623856546073656,
39
+ "MLAB (gpt-4o)": -17.996962489965668,
40
+ "MLAB (llama3-1-405b-instruct)": 6.2098517833311
41
+ },
42
+ "backdoor-trigger-recovery": {
43
+ "CoI-Agent (o1) + MLAB (gpt-4o)": 6.1572772457753295,
44
+ "MLAB (claude-3-5-sonnet-v2)": 39.903815022493674,
45
+ "MLAB (gemini-exp-1206)": 12.94287662739089,
46
+ "MLAB (o3-mini)": 6.238823700218141,
47
+ "MLAB (gpt-4o)": 10.386627431983776,
48
+ "MLAB (llama3-1-405b-instruct)": 11.542228789066877
49
+ }
50
+ }
src/data/processors.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data processing utilities for the leaderboard application.
3
+ """
4
+ import pandas as pd
5
+ import numpy as np
6
+
7
+ def apply_value_formatting(value, is_numeric=True):
8
+ """
9
+ Apply formatting to a value based on its properties
10
+
11
+ Args:
12
+ value: The value to format
13
+ is_numeric (bool): Whether the value is numeric
14
+
15
+ Returns:
16
+ dict: Dictionary with formatting information
17
+ """
18
+ if not is_numeric or value == '-':
19
+ return {'value': value, 'class': ''}
20
+
21
+ numeric_value = float(value)
22
+ if numeric_value > 0:
23
+ return {'value': value, 'class': 'positive-value'}
24
+ elif numeric_value < 0:
25
+ return {'value': value, 'class': 'negative-value'}
26
+ else:
27
+ return {'value': value, 'class': ''}
28
+
29
+ def get_model_type_style(model_type):
30
+ """
31
+ Get styling for different model types
32
+
33
+ Args:
34
+ model_type (str): The model type
35
+
36
+ Returns:
37
+ dict: Dictionary with styling information
38
+ """
39
+ if model_type == "Open Source":
40
+ return {'color': '#4ade80'} # Brighter green
41
+ elif model_type == "Open Weights":
42
+ return {'color': '#93c5fd'} # Brighter blue
43
+ elif model_type == "Closed Source":
44
+ return {'color': '#cbd5e1'} # Lighter gray
45
+ else:
46
+ return {'color': ''}
47
+
48
+ def get_rank_style(rank):
49
+ """
50
+ Get styling for different ranks
51
+
52
+ Args:
53
+ rank (str): The rank
54
+
55
+ Returns:
56
+ dict: Dictionary with styling information
57
+ """
58
+ if "πŸ₯‡" in str(rank):
59
+ return {'color': 'gold', 'font-weight': '700', 'font-size': '16px'}
60
+ elif "πŸ₯ˆ" in str(rank):
61
+ return {'color': 'silver', 'font-weight': '700', 'font-size': '16px'}
62
+ elif "πŸ₯‰" in str(rank):
63
+ return {'color': '#cd7f32', 'font-weight': '700', 'font-size': '16px'}
64
+ else:
65
+ return {}
66
+
67
+ def calculate_task_statistics(metric_data):
68
+ """
69
+ Calculate statistics for each task
70
+
71
+ Args:
72
+ metric_data (dict): Dictionary containing the metric data
73
+
74
+ Returns:
75
+ dict: Dictionary with task statistics
76
+ """
77
+ stats = {}
78
+ for task, models in metric_data.items():
79
+ values = list(models.values())
80
+ stats[task] = {
81
+ 'mean': np.mean(values),
82
+ 'median': np.median(values),
83
+ 'min': min(values),
84
+ 'max': max(values),
85
+ 'std': np.std(values)
86
+ }
87
+ return stats
src/styles/base.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Base styles for the leaderboard application.
3
+ """
4
+ import streamlit as st
5
+ from src.styles.theme import get_theme_css
6
+ from src.styles.components import get_all_component_styles
7
+ from src.styles.tables import get_all_table_styles
8
+
9
+ def load_all_styles():
10
+ """
11
+ Load and apply all CSS styles for the application
12
+ """
13
+ styles = [
14
+ get_theme_css(),
15
+ get_all_component_styles(),
16
+ get_all_table_styles()
17
+ ]
18
+
19
+ combined_styles = "\n".join(styles)
20
+
21
+ # Apply all styles to the page
22
+ st.markdown(
23
+ f"<style>{combined_styles}</style>",
24
+ unsafe_allow_html=True
25
+ )
src/styles/components.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ CSS styles for UI components in the leaderboard application.
3
+ """
4
+ from src.utils.config import dark_theme
5
+
6
+ def get_container_styles():
7
+ """
8
+ Get CSS styles for page containers
9
+
10
+ Returns:
11
+ str: CSS string for containers
12
+ """
13
+ return f"""
14
+ .title-container {{
15
+ padding: 2rem 0;
16
+ text-align: center;
17
+ background: {dark_theme['gradient']};
18
+ border-radius: 12px;
19
+ color: white;
20
+ margin-bottom: 2rem;
21
+ }}
22
+
23
+ .title {{
24
+ font-size: 42px;
25
+ font-weight: 700;
26
+ margin-bottom: 10px;
27
+ color: {dark_theme['title_color']};
28
+ }}
29
+
30
+ .subtitle {{
31
+ font-size: 20px;
32
+ font-weight: 400;
33
+ opacity: 0.9;
34
+ color: {dark_theme['subtitle_color']};
35
+ }}
36
+ """
37
+
38
+ def get_card_styles():
39
+ """
40
+ Get CSS styles for cards
41
+
42
+ Returns:
43
+ str: CSS string for cards
44
+ """
45
+ return f"""
46
+ .card {{
47
+ background-color: {dark_theme['card_bg']};
48
+ border-radius: 12px;
49
+ padding: 24px;
50
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
51
+ margin-bottom: 24px;
52
+ transition: transform 0.2s, box-shadow 0.2s;
53
+ }}
54
+
55
+ .card:hover {{
56
+ transform: translateY(-2px);
57
+ box-shadow: 0 8px 15px rgba(0, 0, 0, 0.2);
58
+ }}
59
+
60
+ .card-title {{
61
+ font-size: 20px;
62
+ font-weight: 600;
63
+ margin-bottom: 16px;
64
+ color: {dark_theme['text_color']};
65
+ display: flex;
66
+ align-items: center;
67
+ }}
68
+
69
+ .card-title-icon {{
70
+ margin-right: 10px;
71
+ font-size: 22px;
72
+ }}
73
+ """
74
+
75
+ def get_task_card_styles():
76
+ """
77
+ Get CSS styles for task cards
78
+
79
+ Returns:
80
+ str: CSS string for task cards
81
+ """
82
+ return f"""
83
+ .task-card {{
84
+ background-color: {dark_theme['card_bg']};
85
+ border-radius: 12px;
86
+ padding: 20px;
87
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
88
+ margin-bottom: 16px;
89
+ border-left: 4px solid {dark_theme['task_border']};
90
+ }}
91
+
92
+ .task-title {{
93
+ font-size: 18px;
94
+ font-weight: 600;
95
+ color: {dark_theme['task_title']};
96
+ margin-bottom: 8px;
97
+ }}
98
+
99
+ .task-description {{
100
+ font-size: 15px;
101
+ color: {dark_theme['text_color']};
102
+ line-height: 1.5;
103
+ }}
104
+ """
105
+
106
+ def get_button_styles():
107
+ """
108
+ Get CSS styles for buttons
109
+
110
+ Returns:
111
+ str: CSS string for buttons
112
+ """
113
+ return f"""
114
+ /* Button styling - completely new and modern */
115
+ div.stButton > button {{
116
+ background-color: {dark_theme['card_bg']};
117
+ color: {dark_theme['text_color']};
118
+ border: 1px solid {dark_theme['border']};
119
+ border-radius: 8px;
120
+ padding: 8px 16px;
121
+ font-size: 14px;
122
+ font-weight: 500;
123
+ margin: 4px;
124
+ transition: all 0.2s ease;
125
+ }}
126
+
127
+ div.stButton > button:hover {{
128
+ background-color: {dark_theme['hover']};
129
+ border-color: {dark_theme['border']};
130
+ transform: translateY(-1px);
131
+ }}
132
+
133
+ /* Active button styling */
134
+ div.stButton > button.selected {{
135
+ background-color: {dark_theme['primary']} !important;
136
+ color: white !important;
137
+ border-color: {dark_theme['primary']} !important;
138
+ }}
139
+ """
140
+
141
+ def get_tabs_styles():
142
+ """
143
+ Get CSS styles for tabs
144
+
145
+ Returns:
146
+ str: CSS string for tabs
147
+ """
148
+ return f"""
149
+ /* Tabs styling */
150
+ .stTabs [data-baseweb="tab-list"] {{
151
+ gap: 8px;
152
+ margin-bottom: 20px;
153
+ }}
154
+
155
+ .stTabs [data-baseweb="tab"] {{
156
+ border-radius: 8px 8px 0 0;
157
+ padding: 12px 24px;
158
+ font-weight: 500;
159
+ background-color: {dark_theme['hover']};
160
+ color: {dark_theme['text_color']};
161
+ }}
162
+
163
+ .stTabs [data-baseweb="tab"][aria-selected="true"] {{
164
+ background-color: {dark_theme['primary']};
165
+ color: white;
166
+ }}
167
+
168
+ .stTabs [data-baseweb="tab-highlight"] {{
169
+ background-color: transparent;
170
+ }}
171
+ """
172
+
173
+ def get_alert_styles():
174
+ """
175
+ Get CSS styles for alerts and information boxes
176
+
177
+ Returns:
178
+ str: CSS string for alerts
179
+ """
180
+ return f"""
181
+ /* Alert/info box styling */
182
+ .info-box {{
183
+ background-color: {dark_theme['info_bg']};
184
+ border-left: 4px solid {dark_theme['info_border']};
185
+ border-radius: 8px;
186
+ padding: 16px;
187
+ margin-bottom: 16px;
188
+ color: {dark_theme['text_color']};
189
+ }}
190
+
191
+ .warning-box {{
192
+ background-color: {dark_theme['warning_bg']};
193
+ border-left: 4px solid {dark_theme['warning_border']};
194
+ border-radius: 8px;
195
+ padding: 16px;
196
+ margin-bottom: 16px;
197
+ color: {dark_theme['text_color']};
198
+ }}
199
+ """
200
+
201
+ def get_footer_styles():
202
+ """
203
+ Get CSS styles for the footer
204
+
205
+ Returns:
206
+ str: CSS string for the footer
207
+ """
208
+ return f"""
209
+ /* Footer styling */
210
+ .footer {{
211
+ text-align: center;
212
+ padding: 24px;
213
+ margin-top: 40px;
214
+ color: {dark_theme['footer_color']};
215
+ font-size: 14px;
216
+ border-top: 1px solid {dark_theme['footer_border']};
217
+ }}
218
+ """
219
+
220
+ def get_all_component_styles():
221
+ """
222
+ Get all component styles combined
223
+
224
+ Returns:
225
+ str: Combined CSS string for all components
226
+ """
227
+ styles = [
228
+ get_container_styles(),
229
+ get_card_styles(),
230
+ get_task_card_styles(),
231
+ get_button_styles(),
232
+ get_tabs_styles(),
233
+ get_alert_styles(),
234
+ get_footer_styles(),
235
+ get_metric_definition_styles()
236
+ ]
237
+
238
+ return '\n'.join(styles)
239
+
240
+ def get_metric_definition_styles():
241
+ """
242
+ Get CSS styles for metric definition component
243
+
244
+ Returns:
245
+ str: CSS string for metric definition
246
+ """
247
+ return f"""
248
+ /* Metric definition styling */
249
+ .metric-definition {{
250
+ margin-top: 20px;
251
+ background-color: {dark_theme['card_bg']};
252
+ border-radius: 8px;
253
+ padding: 16px 20px;
254
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
255
+ border-left: 4px solid {dark_theme['secondary']};
256
+ }}
257
+
258
+ .metric-definition h4 {{
259
+ color: {dark_theme['secondary']};
260
+ margin-top: 0;
261
+ margin-bottom: 8px;
262
+ font-size: 18px;
263
+ font-weight: 600;
264
+ }}
265
+
266
+ .metric-definition p {{
267
+ color: {dark_theme['text_color']};
268
+ font-size: 14px;
269
+ line-height: 1.6;
270
+ margin-bottom: 0;
271
+ }}
272
+ """
src/styles/main_styles.py ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from src.utils.config import dark_theme
2
+
3
+ def get_base_styles():
4
+ """Returns the base CSS styles for the page"""
5
+ return f"""
6
+ <style>
7
+ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
8
+
9
+ html, body, [class*="css"] {{
10
+ font-family: 'Inter', sans-serif;
11
+ background-color: {dark_theme['bg_color']};
12
+ color: {dark_theme['text_color']};
13
+ }}
14
+
15
+ h1, h2, h3, h4, h5, h6 {{
16
+ font-family: 'Inter', sans-serif;
17
+ font-weight: 600;
18
+ color: {dark_theme['heading_color']};
19
+ }}
20
+
21
+ .main {{
22
+ background-color: {dark_theme['bg_color']};
23
+ }}
24
+
25
+ .title-container {{
26
+ padding: 2rem 0;
27
+ text-align: center;
28
+ background: {dark_theme['gradient']};
29
+ border-radius: 12px;
30
+ color: white;
31
+ margin-bottom: 2rem;
32
+ }}
33
+
34
+ .title {{
35
+ font-size: 42px;
36
+ font-weight: 700;
37
+ margin-bottom: 10px;
38
+ color: {dark_theme['title_color']};
39
+ }}
40
+
41
+ .subtitle {{
42
+ font-size: 20px;
43
+ font-weight: 400;
44
+ opacity: 0.9;
45
+ color: {dark_theme['subtitle_color']};
46
+ }}
47
+
48
+ .card {{
49
+ background-color: {dark_theme['card_bg']};
50
+ border-radius: 12px;
51
+ padding: 24px;
52
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
53
+ margin-bottom: 24px;
54
+ transition: transform 0.2s, box-shadow 0.2s;
55
+ }}
56
+
57
+ .card:hover {{
58
+ transform: translateY(-2px);
59
+ box-shadow: 0 8px 15px rgba(0, 0, 0, 0.2);
60
+ }}
61
+
62
+ .card-title {{
63
+ font-size: 20px;
64
+ font-weight: 600;
65
+ margin-bottom: 16px;
66
+ color: {dark_theme['text_color']};
67
+ display: flex;
68
+ align-items: center;
69
+ }}
70
+
71
+ .card-title-icon {{
72
+ margin-right: 10px;
73
+ font-size: 22px;
74
+ }}
75
+
76
+ .task-card {{
77
+ background-color: {dark_theme['card_bg']};
78
+ border-radius: 12px;
79
+ padding: 20px;
80
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
81
+ margin-bottom: 16px;
82
+ border-left: 4px solid {dark_theme['task_border']};
83
+ }}
84
+
85
+ .task-title {{
86
+ font-size: 18px;
87
+ font-weight: 600;
88
+ color: {dark_theme['task_title']};
89
+ margin-bottom: 8px;
90
+ }}
91
+
92
+ .task-description {{
93
+ font-size: 15px;
94
+ color: {dark_theme['text_color']};
95
+ line-height: 1.5;
96
+ }}
97
+
98
+ /* Button styling - completely new and modern */
99
+ div.stButton > button {{
100
+ background-color: {dark_theme['card_bg']};
101
+ color: {dark_theme['text_color']};
102
+ border: 1px solid {dark_theme['border']};
103
+ border-radius: 8px;
104
+ padding: 8px 16px;
105
+ font-size: 14px;
106
+ font-weight: 500;
107
+ margin: 4px;
108
+ transition: all 0.2s ease;
109
+ }}
110
+
111
+ div.stButton > button:hover {{
112
+ background-color: {dark_theme['hover']};
113
+ border-color: {dark_theme['border']};
114
+ transform: translateY(-1px);
115
+ }}
116
+
117
+ /* Active button styling */
118
+ div.stButton > button.selected {{
119
+ background-color: {dark_theme['primary']} !important;
120
+ color: white !important;
121
+ border-color: {dark_theme['primary']} !important;
122
+ }}
123
+
124
+ /* Table styling */
125
+ [data-testid="stDataFrame"] {{
126
+ background-color: {dark_theme['card_bg']};
127
+ border-radius: 12px;
128
+ padding: 1px;
129
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
130
+ }}
131
+
132
+ [data-testid="stDataFrame"] table {{
133
+ border-collapse: separate !important;
134
+ border-spacing: 0 !important;
135
+ border-radius: 8px !important;
136
+ overflow: hidden !important;
137
+ }}
138
+
139
+ [data-testid="stDataFrame"] th {{
140
+ background-color: {dark_theme['table_header']} !important;
141
+ color: {dark_theme['text_color']} !important;
142
+ font-weight: 600 !important;
143
+ text-transform: uppercase !important;
144
+ font-size: 13px !important;
145
+ padding: 16px 10px !important;
146
+ }}
147
+
148
+ [data-testid="stDataFrame"] td {{
149
+ padding: 12px 10px !important;
150
+ border-bottom: 1px solid {dark_theme['table_border']} !important;
151
+ font-size: 14px !important;
152
+ color: {dark_theme['text_color']} !important;
153
+ }}
154
+
155
+ /* Hide row numbers */
156
+ [data-testid="stDataFrame"] [data-testid="stDataFrameRowNumber"] {{
157
+ display: none !important;
158
+ }}
159
+
160
+ /* Metric header styling */
161
+ .metric-header {{
162
+ background-color: {dark_theme['card_bg']};
163
+ border-radius: 8px;
164
+ padding: 16px;
165
+ margin-bottom: 16px;
166
+ border-left: 4px solid {dark_theme['primary']};
167
+ }}
168
+
169
+ .metric-header h3 {{
170
+ margin: 0;
171
+ color: {dark_theme['primary']};
172
+ }}
173
+
174
+ .metric-header p {{
175
+ margin: 8px 0 0 0;
176
+ font-size: 14px;
177
+ opacity: 0.8;
178
+ }}
179
+
180
+ /* Rank column styling */
181
+ .rank-cell {{
182
+ font-weight: 700 !important;
183
+ background-color: {dark_theme['primary'] + '22'};
184
+ border-radius: 50%;
185
+ width: 28px;
186
+ height: 28px;
187
+ display: flex;
188
+ align-items: center;
189
+ justify-content: center;
190
+ margin: 0 auto;
191
+ }}
192
+
193
+ .rank-1 {{
194
+ background-color: gold !important;
195
+ color: #333 !important;
196
+ }}
197
+
198
+ .rank-2 {{
199
+ background-color: silver !important;
200
+ color: #333 !important;
201
+ }}
202
+
203
+ .rank-3 {{
204
+ background-color: #cd7f32 !important; /* bronze */
205
+ color: #fff !important;
206
+ }}
207
+
208
+ /* Tabs styling */
209
+ .stTabs [data-baseweb="tab-list"] {{
210
+ gap: 8px;
211
+ margin-bottom: 20px;
212
+ }}
213
+
214
+ .stTabs [data-baseweb="tab"] {{
215
+ border-radius: 8px 8px 0 0;
216
+ padding: 12px 24px;
217
+ font-weight: 500;
218
+ background-color: {dark_theme['hover']};
219
+ color: {dark_theme['text_color']};
220
+ }}
221
+
222
+ .stTabs [data-baseweb="tab"][aria-selected="true"] {{
223
+ background-color: {dark_theme['primary']};
224
+ color: white;
225
+ }}
226
+
227
+ .stTabs [data-baseweb="tab-highlight"] {{
228
+ background-color: transparent;
229
+ }}
230
+
231
+ /* Alert/info box styling */
232
+ .info-box {{
233
+ background-color: {dark_theme['info_bg']};
234
+ border-left: 4px solid {dark_theme['info_border']};
235
+ border-radius: 8px;
236
+ padding: 16px;
237
+ margin-bottom: 16px;
238
+ color: {dark_theme['text_color']};
239
+ }}
240
+
241
+ .warning-box {{
242
+ background-color: {dark_theme['warning_bg']};
243
+ border-left: 4px solid {dark_theme['warning_border']};
244
+ border-radius: 8px;
245
+ padding: 16px;
246
+ margin-bottom: 16px;
247
+ color: {dark_theme['text_color']};
248
+ }}
249
+
250
+ /* Download buttons styling */
251
+ .download-button {{
252
+ background-color: {dark_theme['bg_color']};
253
+ border: 1px solid {dark_theme['border']};
254
+ border-radius: 8px;
255
+ padding: 12px 20px;
256
+ display: inline-flex;
257
+ align-items: center;
258
+ justify-content: center;
259
+ color: {dark_theme['text_color']};
260
+ font-weight: 500;
261
+ margin-top: 16px;
262
+ cursor: pointer;
263
+ transition: all 0.2s;
264
+ }}
265
+
266
+ .download-button:hover {{
267
+ background-color: {dark_theme['hover']};
268
+ border-color: {dark_theme['border']};
269
+ transform: translateY(-1px);
270
+ }}
271
+
272
+ .download-button .icon {{
273
+ margin-right: 8px;
274
+ }}
275
+
276
+ /* Footer styling */
277
+ .footer {{
278
+ text-align: center;
279
+ padding: 24px;
280
+ margin-top: 40px;
281
+ color: {dark_theme['footer_color']};
282
+ font-size: 14px;
283
+ border-top: 1px solid {dark_theme['footer_border']};
284
+ }}
285
+
286
+ /* Badge styling for model types */
287
+ .badge {{
288
+ display: inline-block;
289
+ padding: 4px 8px;
290
+ font-size: 12px;
291
+ font-weight: 500;
292
+ border-radius: 6px;
293
+ margin-right: 6px;
294
+ }}
295
+
296
+ .badge-purple {{
297
+ background-color: {dark_theme['primary'] + '33'};
298
+ color: {dark_theme['primary']};
299
+ }}
300
+
301
+ .badge-blue {{
302
+ background-color: {dark_theme['info_border'] + '33'};
303
+ color: {dark_theme['info_border']};
304
+ }}
305
+
306
+ .selected-items {{
307
+ background-color: {dark_theme['hover']};
308
+ border-radius: 8px;
309
+ padding: 12px 16px;
310
+ margin-top: 16px;
311
+ font-size: 14px;
312
+ }}
313
+
314
+ /* Selectbox styling */
315
+ div[data-baseweb="select"] {{
316
+ border-radius: 8px !important;
317
+ }}
318
+
319
+ div[data-baseweb="select"] > div {{
320
+ background-color: {dark_theme['card_bg']} !important;
321
+ border-radius: 8px !important;
322
+ border-color: {dark_theme['border']} !important;
323
+ color: {dark_theme['text_color']} !important;
324
+ }}
325
+
326
+ div[data-baseweb="select"] > div:hover {{
327
+ border-color: {dark_theme['border']} !important;
328
+ }}
329
+
330
+ /* Table hover and value styling */
331
+ .table-row:hover td {{
332
+ background-color: #1a1a1a !important;
333
+ }}
334
+ .table-row:hover td.fixed-column {{
335
+ background-color: #1a1a1a !important;
336
+ }}
337
+ .positive-value {{
338
+ color: #4ade80 !important; /* Bright green for positive values */
339
+ font-weight: 500;
340
+ }}
341
+ .negative-value {{
342
+ color: #f87171 !important; /* Bright red for negative values */
343
+ font-weight: 500;
344
+ }}
345
+ </style>
346
+ """
src/styles/tables.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ CSS styles for tables in the leaderboard application.
3
+ """
4
+ from src.utils.config import dark_theme
5
+
6
+ def get_streamlit_table_styles():
7
+ """
8
+ Get CSS styles for standard Streamlit tables
9
+
10
+ Returns:
11
+ str: CSS string for Streamlit tables
12
+ """
13
+ return f"""
14
+ /* Standard Streamlit table styling */
15
+ [data-testid="stDataFrame"] {{
16
+ background-color: {dark_theme['card_bg']};
17
+ border-radius: 12px;
18
+ padding: 1px;
19
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.15);
20
+ }}
21
+
22
+ [data-testid="stDataFrame"] table {{
23
+ border-collapse: separate !important;
24
+ border-spacing: 0 !important;
25
+ border-radius: 8px !important;
26
+ overflow: hidden !important;
27
+ }}
28
+
29
+ [data-testid="stDataFrame"] th {{
30
+ background-color: {dark_theme['table_header']} !important;
31
+ color: {dark_theme['text_color']} !important;
32
+ font-weight: 600 !important;
33
+ text-transform: uppercase !important;
34
+ font-size: 13px !important;
35
+ padding: 16px 10px !important;
36
+ }}
37
+
38
+ [data-testid="stDataFrame"] td {{
39
+ padding: 12px 10px !important;
40
+ border-bottom: 1px solid {dark_theme['table_border']} !important;
41
+ font-size: 14px !important;
42
+ color: {dark_theme['text_color']} !important;
43
+ }}
44
+
45
+ /* Hide row numbers */
46
+ [data-testid="stDataFrame"] [data-testid="stDataFrameRowNumber"] {{
47
+ display: none !important;
48
+ }}
49
+ """
50
+
51
+ def get_custom_leaderboard_table_styles():
52
+ """
53
+ Get CSS styles for the custom leaderboard table
54
+
55
+ Returns:
56
+ str: CSS string for the custom leaderboard table
57
+ """
58
+ return f"""
59
+ /* Custom leaderboard table styling */
60
+ .fixed-table-container {{
61
+ position: relative;
62
+ max-width: 100%;
63
+ margin-top: 20px;
64
+ border-radius: 8px;
65
+ box-shadow: 0 4px 12px rgba(0,0,0,0.5);
66
+ background: {dark_theme['table_bg']};
67
+ border: 1px solid {dark_theme['table_border_color']};
68
+ }}
69
+
70
+ .fixed-table {{
71
+ width: 100%;
72
+ border-collapse: collapse;
73
+ font-family: 'Inter', sans-serif;
74
+ }}
75
+
76
+ .fixed-column {{
77
+ position: sticky;
78
+ left: 0;
79
+ z-index: 2;
80
+ background-color: {dark_theme['table_bg']};
81
+ }}
82
+
83
+ .first-fixed-column {{
84
+ width: 60px;
85
+ text-align: center;
86
+ left: 0;
87
+ z-index: 3;
88
+ border-right: 1px solid {dark_theme['table_border_color']};
89
+ box-shadow: 2px 0 4px rgba(0,0,0,0.3);
90
+ }}
91
+
92
+ .second-fixed-column {{
93
+ width: 280px;
94
+ text-align: center;
95
+ left: 60px;
96
+ z-index: 2;
97
+ border-right: 1px solid {dark_theme['table_border_color']};
98
+ box-shadow: 2px 0 4px rgba(0,0,0,0.3);
99
+ }}
100
+
101
+ /* Fix for the gap between fixed columns */
102
+ .first-fixed-column::after {{
103
+ content: "";
104
+ position: absolute;
105
+ top: 0;
106
+ right: -1px;
107
+ height: 100%;
108
+ width: 1px;
109
+ background-color: {dark_theme['table_border_color']};
110
+ }}
111
+
112
+ .model-type-cell {{
113
+ width: 120px;
114
+ text-align: center;
115
+ }}
116
+
117
+ .scroll-container {{
118
+ overflow-x: auto;
119
+ border-radius: 8px;
120
+ }}
121
+
122
+ .header-row th {{
123
+ padding: 14px 8px;
124
+ background-color: {dark_theme['table_bg']};
125
+ color: {dark_theme['text_color']};
126
+ font-weight: 600;
127
+ border-bottom: 1px solid {dark_theme['table_border_color']};
128
+ }}
129
+
130
+ .metric-header {{
131
+ background-color: {dark_theme['table_header_bg']} !important;
132
+ color: #ffffff;
133
+ padding: 14px 0px !important;
134
+ text-align: center;
135
+ font-weight: 600;
136
+ letter-spacing: 0.5px;
137
+ }}
138
+
139
+ .sub-header th {{
140
+ padding: 12px 8px;
141
+ background-color: {dark_theme['table_subheader_bg']};
142
+ color: {dark_theme['text_color']};
143
+ font-weight: 500;
144
+ text-align: center;
145
+ border-bottom: 1px solid {dark_theme['table_border_color']};
146
+ }}
147
+
148
+ .sub-header th.overall-cell {{
149
+ background-color: {dark_theme['table_average_column_bg']}; /* Slightly lighter black for average column */
150
+ font-weight: 600; /* Make it bolder */
151
+ border-right: 1px solid #444; /* Add a subtle border to separate it */
152
+ }}
153
+
154
+ .table-row:nth-child(odd) {{
155
+ background-color: {dark_theme['table_row_odd']};
156
+ }}
157
+
158
+ .table-row:nth-child(even) {{
159
+ background-color: {dark_theme['table_row_even']};
160
+ }}
161
+
162
+ .table-row:hover td {{
163
+ background-color: {dark_theme['table_hover_bg']} !important;
164
+ }}
165
+
166
+ .table-row:hover td.fixed-column {{
167
+ background-color: {dark_theme['table_hover_bg']} !important;
168
+ }}
169
+
170
+ .table-cell {{
171
+ padding: 12px 8px;
172
+ text-align: center;
173
+ border-bottom: 1px solid #222;
174
+ }}
175
+
176
+ .table-cell.overall-cell {{
177
+ background-color: rgba(80, 80, 80, 0.1); /* Subtle highlight */
178
+ font-weight: 600; /* Make average values bolder */
179
+ border-right: 1px solid #333; /* Add a border to separate it from task metrics */
180
+ }}
181
+
182
+ .positive-value {{
183
+ color: {dark_theme['positive_value_color']} !important;
184
+ font-weight: 500;
185
+ }}
186
+
187
+ .negative-value {{
188
+ color: {dark_theme['negative_value_color']} !important;
189
+ font-weight: 500;
190
+ }}
191
+ """
192
+
193
+ def get_metric_styles():
194
+ """
195
+ Get CSS styles for metric displays
196
+
197
+ Returns:
198
+ str: CSS string for metric displays
199
+ """
200
+ return f"""
201
+ /* Metric styling */
202
+ .metric-header {{
203
+ background-color: {dark_theme['card_bg']};
204
+ border-radius: 8px;
205
+ padding: 16px;
206
+ margin-bottom: 16px;
207
+ border-left: 4px solid {dark_theme['primary']};
208
+ }}
209
+
210
+ .metric-header h3 {{
211
+ margin: 0;
212
+ color: {dark_theme['primary']};
213
+ }}
214
+
215
+ .metric-header p {{
216
+ margin: 8px 0 0 0;
217
+ font-size: 14px;
218
+ opacity: 0.8;
219
+ }}
220
+
221
+ /* Rank column styling */
222
+ .rank-cell {{
223
+ font-weight: 700 !important;
224
+ background-color: {dark_theme['primary'] + '22'};
225
+ border-radius: 50%;
226
+ width: 28px;
227
+ height: 28px;
228
+ display: flex;
229
+ align-items: center;
230
+ justify-content: center;
231
+ margin: 0 auto;
232
+ }}
233
+
234
+ .rank-1 {{
235
+ background-color: gold !important;
236
+ color: #333 !important;
237
+ }}
238
+
239
+ .rank-2 {{
240
+ background-color: silver !important;
241
+ color: #333 !important;
242
+ }}
243
+
244
+ .rank-3 {{
245
+ background-color: #cd7f32 !important; /* bronze */
246
+ color: #fff !important;
247
+ }}
248
+ """
249
+
250
+ def get_all_table_styles():
251
+ """
252
+ Get all table styles combined
253
+
254
+ Returns:
255
+ str: Combined CSS string for all tables
256
+ """
257
+ styles = [
258
+ get_streamlit_table_styles(),
259
+ get_custom_leaderboard_table_styles(),
260
+ get_metric_styles()
261
+ ]
262
+
263
+ return '\n'.join(styles)
src/styles/theme.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Theme definitions and color scheme for the leaderboard application.
3
+ """
4
+ from src.utils.config import dark_theme
5
+
6
+ def get_theme_css():
7
+ """
8
+ Get the base theme CSS for the application
9
+
10
+ Returns:
11
+ str: CSS string for the theme
12
+ """
13
+ return f"""
14
+ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
15
+
16
+ html, body, [class*="css"] {{
17
+ font-family: 'Inter', sans-serif;
18
+ background-color: {dark_theme['bg_color']};
19
+ color: {dark_theme['text_color']};
20
+ }}
21
+
22
+ h1, h2, h3, h4, h5, h6 {{
23
+ font-family: 'Inter', sans-serif;
24
+ font-weight: 600;
25
+ color: {dark_theme['heading_color']};
26
+ }}
27
+
28
+ .main {{
29
+ background-color: {dark_theme['bg_color']};
30
+ }}
31
+ """
src/utils/config.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Theme and configuration settings for the Model Capability Leaderboard application
2
+
3
+ # Theme colors - using dark mode by default
4
+ dark_theme = {
5
+ 'bg_color': '#1a202c',
6
+ 'text_color': '#e2e8f0',
7
+ 'card_bg': '#2d3748',
8
+ 'primary': '#818cf8',
9
+ 'secondary': '#a78bfa',
10
+ 'border': '#4a5568',
11
+ 'hover': '#4a5568',
12
+ 'table_header': '#2d3748',
13
+ 'table_border': '#4a5568',
14
+ 'heading_color': '#e2e8f0',
15
+ 'gradient': 'linear-gradient(135deg, #818cf8 0%, #a78bfa 100%)',
16
+ 'warning_bg': '#7c2d12',
17
+ 'warning_border': '#f97316',
18
+ 'info_bg': '#1e3a8a',
19
+ 'info_border': '#3b82f6',
20
+ 'footer_color': '#a0aec0',
21
+ 'title_color': 'white',
22
+ 'subtitle_color': 'rgba(255, 255, 255, 0.9)',
23
+ 'footer_border': '#4a5568',
24
+ 'task_title': '#a5b4fc',
25
+ 'task_border': '#818cf8',
26
+ # Table-specific colors for the custom table
27
+ 'table_bg': '#0a0a0a',
28
+ 'table_border_color': '#333',
29
+ 'table_header_bg': '#191919',
30
+ 'table_subheader_bg': '#141414',
31
+ 'table_average_column_bg': '#202020',
32
+ 'table_row_odd': '#0a0a0a',
33
+ 'table_row_even': '#111111',
34
+ 'table_hover_bg': '#1a1a1a',
35
+ 'positive_value_color': '#4ade80',
36
+ 'negative_value_color': '#f87171'
37
+ }
38
+
39
+ # Application settings
40
+ app_config = {
41
+ 'title': 'MLRC-Bench Leaderboard',
42
+ 'description': 'Machine Learning Research Challenges Benchmark for AI Agents',
43
+ 'layout': 'wide',
44
+ 'initial_sidebar_state': 'collapsed'
45
+ }
46
+
47
+ # Metrics configuration
48
+ metrics_config = {
49
+ "Margin to Human": {
50
+ "file": "src/data/metrics/margin_to_human.json",
51
+ "description": "Performance on Machine Learning Research Challenges. Higher values indicate better research capabilities.",
52
+ "min_value": -100, # Approximate, adjust as needed
53
+ "max_value": 50, # Approximate, adjust as needed
54
+ "color_map": "RdYlGn"
55
+ }
56
+ # Future metrics can be added here
57
+ # "Another Metric": {
58
+ # "file": "src/data/metrics/another_metric.json",
59
+ # "description": "Description of another metric",
60
+ # "min_value": 0,
61
+ # "max_value": 100,
62
+ # "color_map": "viridis"
63
+ # }
64
+ }
65
+
66
+ # Model type categories
67
+ model_categories = {
68
+ "MLAB (claude-3-5-sonnet-v2)": "Closed Source",
69
+ "MLAB (gemini-exp-1206)": "Closed Source",
70
+ "MLAB (o3-mini)": "Closed Source",
71
+ "MLAB (gpt-4o)": "Closed Source",
72
+ "MLAB (llama3-1-405b-instruct)": "Open Weights",
73
+ "CoI-Agent (o1) + MLAB (gpt-4o)": "Closed Source"
74
+ # More models would be added here as needed
75
+ }
76
+
77
+ # Task descriptions
78
+ tasks_info = {
79
+ "Perception Temporal Action Loc": "Testing the model's ability to understand and localize actions within temporal sequences of events.",
80
+ "Llm Merging": "Assessing the capability to effectively merge knowledge from multiple language models.",
81
+ "Meta Learning": "Evaluating the model's ability to learn how to learn - adapting quickly to new tasks.",
82
+ "Product Recommendation": "Testing the model's ability to recommend relevant products based on user preferences and behavior.",
83
+ "Machine Unlearning": "Evaluating how well models can 'unlearn' specific information when required.",
84
+ "Backdoor Trigger Recovery": "Testing resilience against backdoor attacks and ability to recover from triggered behaviors."
85
+ }
src/utils/data_loader.py ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data loading and processing utilities for the leaderboard application.
3
+ """
4
+ import pandas as pd
5
+ import json
6
+ from src.utils.config import model_categories
7
+
8
+ def load_metric_data(file_path):
9
+ """
10
+ Load metric data from a JSON file
11
+
12
+ Args:
13
+ file_path (str): Path to the JSON file containing metric data
14
+
15
+ Returns:
16
+ dict: Dictionary containing the loaded metric data
17
+ """
18
+ try:
19
+ with open(file_path, "r") as f:
20
+ return json.load(f)
21
+ except FileNotFoundError:
22
+ print(f"Error: File {file_path} not found.")
23
+ return {}
24
+ except json.JSONDecodeError:
25
+ print(f"Error: File {file_path} is not a valid JSON file.")
26
+ return {}
27
+
28
+ def process_data(metric_data):
29
+ """
30
+ Process the metric data into a pandas DataFrame
31
+
32
+ Args:
33
+ metric_data (dict): Dictionary containing the metric data
34
+
35
+ Returns:
36
+ pandas.DataFrame: DataFrame containing the processed data
37
+ """
38
+ # Create a DataFrame to store the model metric data
39
+ tasks = list(metric_data.keys())
40
+ models = []
41
+ model_data = {}
42
+
43
+ # Extract model names and their metric values for each task
44
+ for task in tasks:
45
+ for model in metric_data[task]:
46
+ if model not in models:
47
+ models.append(model)
48
+ model_data[model] = {}
49
+
50
+ # Store the metric value for this task
51
+ model_data[model][task] = metric_data[task][model]
52
+
53
+ # Create DataFrame from the model_data dictionary
54
+ df = pd.DataFrame.from_dict(model_data, orient='index')
55
+
56
+ # Replace NaN values with '-'
57
+ df.fillna('-', inplace=True)
58
+
59
+ # Rename the columns to more readable format
60
+ df.columns = [task.replace("-", " ").replace("_", " ").title() for task in df.columns]
61
+
62
+ # Add a model type column to the dataframe
63
+ df['Model Type'] = df.index.map(lambda x: model_categories.get(x, "Unknown"))
64
+
65
+ return df
66
+
67
+ def calculate_selected_overall(row, selected_tasks):
68
+ """
69
+ Calculate overall average for selected tasks
70
+
71
+ Args:
72
+ row (pandas.Series): Row of data
73
+ selected_tasks (list): List of task names to include in the average
74
+
75
+ Returns:
76
+ float or str: The calculated average or '-' if no numeric values
77
+ """
78
+ numeric_values = []
79
+
80
+ for task in selected_tasks:
81
+ value = row[task]
82
+ # Check if the value is numeric (could be float or string representing float)
83
+ if isinstance(value, (int, float)) or (isinstance(value, str) and value.replace('.', '', 1).replace('-', '', 1).isdigit()):
84
+ numeric_values.append(float(value))
85
+
86
+ # Calculate average if there are numeric values
87
+ if numeric_values:
88
+ return sum(numeric_values) / len(numeric_values)
89
+ else:
90
+ return '-'
91
+
92
+ def filter_and_prepare_data(df, selected_tasks, selected_model_types):
93
+ """
94
+ Filter and prepare data based on selections
95
+
96
+ Args:
97
+ df (pandas.DataFrame): The original DataFrame
98
+ selected_tasks (list): List of selected task names
99
+ selected_model_types (list): List of selected model types
100
+
101
+ Returns:
102
+ pandas.DataFrame: Filtered and prepared DataFrame
103
+ """
104
+ # Filter the dataframe based on selected model types
105
+ filtered_df = df[df['Model Type'].isin(selected_model_types)]
106
+
107
+ # Calculate the average for selected tasks only
108
+ selected_tasks_df = filtered_df[selected_tasks]
109
+ filtered_df['Selected Overall'] = selected_tasks_df.mean(axis=1)
110
+
111
+ # Sort by Selected Overall and add rank
112
+ filtered_df = filtered_df.sort_values('Selected Overall', ascending=False)
113
+ filtered_df.insert(0, 'Rank', range(1, len(filtered_df) + 1))
114
+
115
+ # Add a Model Name column that shows the index (actual model name)
116
+ filtered_df['Model Name'] = filtered_df.index
117
+
118
+ return filtered_df
119
+
120
+ def format_display_dataframe(filtered_df, selected_tasks):
121
+ """
122
+ Create and format the display DataFrame for the leaderboard table
123
+
124
+ Args:
125
+ filtered_df (pandas.DataFrame): The filtered DataFrame
126
+ selected_tasks (list): List of selected task names
127
+
128
+ Returns:
129
+ tuple: (pandas.DataFrame, list) - The display DataFrame and the metric columns
130
+ """
131
+ # Create a fixed display DataFrame with only the model info
132
+ display_df = filtered_df[['Rank', 'Model Name', 'Model Type']].copy()
133
+
134
+ # Format the rank column with medals
135
+ medal_ranks = {1: "πŸ₯‡ 1", 2: "πŸ₯ˆ 2", 3: "πŸ₯‰ 3"}
136
+ display_df['Rank'] = display_df['Rank'].apply(lambda x: medal_ranks.get(x, str(x)))
137
+
138
+ # Add metrics columns (Selected Overall and individual tasks)
139
+ metric_columns = ['Selected Overall'] + selected_tasks
140
+ for col in metric_columns:
141
+ if col in filtered_df.columns:
142
+ # Format numeric columns to 3 decimal places
143
+ if filtered_df[col].dtype in ['float64', 'float32']:
144
+ display_df[col] = filtered_df[col].apply(lambda x: f"{x:.3f}" if isinstance(x, (int, float)) else x)
145
+ else:
146
+ display_df[col] = filtered_df[col]
147
+
148
+ # Rename "Selected Overall" to "Metric Average" in display_df
149
+ if "Selected Overall" in display_df.columns:
150
+ display_df = display_df.rename(columns={"Selected Overall": "Metric Average"})
151
+ # Also update the metric_columns list to reflect the rename
152
+ metric_columns = ['Metric Average'] + selected_tasks
153
+
154
+ return display_df, metric_columns
tiered_models_data.csv DELETED
@@ -1,23 +0,0 @@
1
- tier,model,factuality_score,hallucination_score,avg_tokens,avg_factual_units,avg_undecidable_units,avg_unsupported_units
2
- Tier 1: Hard,πŸ”’ GPT4-o,75.65,0.64,563.15,24.01,4.62,1.01
3
- Tier 1: Hard,πŸ”’ Gemini1.5-Pro,73.78,0.68,517.31,22.25,4.48,1.13
4
- Tier 1: Hard,πŸ”‘ Llama3.1-70B-Instruct,70.07,0.89,532.41,27.17,5.67, 2.13
5
- Tier 1: Hard,πŸ”‘ Llama3.1-405B-Instruct,68.59,0.93,551.28,26.71,6.19,2.2
6
- Tier 1: Hard,πŸ”’ Claude-3.5-Sonnet 🚨,74.95,0.65,395.77,22.64,4.03,1.19
7
- Tier 1: Hard,πŸ”’ CommandR+ 🚨,73.15,0.71,440.93,23.55,4.51,1.4
8
- Tier 1: Hard,πŸ”‘ Mistral-Large-2 🚨,75.19,0.67,485.58,23.21,4.09,1.36
9
- Tier 2: Moderate,πŸ”’ GPT4-o,80.72,0.5,624.67,24.42,3.59,0.89
10
- Tier 2: Moderate,πŸ”’ Gemini1.5-Pro,78.02,0.57,565.97,22.16,3.71,0.97
11
- Tier 2: Moderate,πŸ”‘ Llama3.1-70B-Instruct,75.76,0.71,607.44,25.35,4.33,1.76
12
- Tier 2: Moderate,πŸ”‘ Llama3.1-405B-Instruct,75.05,0.7,599.3,25.24,4.74,1.41
13
- Tier 2: Moderate,πŸ”’ Claude-3.5-Sonnet 🚨,79.92,0.54,414.32,22.15,3.32,1.09
14
- Tier 2: Moderate,πŸ”’ CommandR+ 🚨,80.71,0.52,483.32,24.1,3.17,1.09
15
- Tier 2: Moderate,πŸ”‘ Mistral-Large-2 🚨,79.97,0.52,528.44,22.65,3.21,1.02
16
- Tier 3: Easy,πŸ”’ GPT4-o,91.63,0.26,640.84,29.29,2.01,0.53
17
- Tier 3: Easy,πŸ”’ Gemini1.5-Pro,89.86,0.31,551.81,25.6,1.88,0.71
18
- Tier 3: Easy,πŸ”‘ Llama3.1-70B-Instruct,89.3,0.33,607.75,31.38,2.08,0.83
19
- Tier 3: Easy,πŸ”‘ Llama3.1-405B-Instruct,86.57,0.4,599.87,30.12,2.88,0.85
20
- Tier 3: Easy,πŸ”’ Claude-3.5-Sonnet 🚨,89.61,0.3,411.2,26.72,1.49,0.81
21
- Tier 3: Easy,πŸ”’ CommandR+ 🚨,91.65,0.25,499.06,27.95,1.57,0.54
22
- Tier 3: Easy,πŸ”‘ Mistral-Large-2 🚨,92.0,0.25,523.57,27.8,1.8,0.55
23
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
verifact_data.csv DELETED
@@ -1,25 +0,0 @@
1
- tier,model,FactBench,Reddit,Overall
2
- F1,GPT4o,80.93,42.76,67.41
3
- F1,Claude 3.5-Sonnet,75.68,42.90,63.65
4
- F1,Gemini 1.5-Flash,77.38,40.26,64.10
5
- F1,Llama3.1-8b,60.71,28.86,48.62
6
- F1,Llama3.1-70b,65.83,38.61,55.12
7
- F1,Llama3.1-405B,73.23,38.98,60.61
8
- F1,Qwen2.5-8b,69.23,37.25,55.78
9
- F1,Qwen2.5-32b,71.31,37.34,60.00
10
- Recall,GPT4o,77.13,30.06,57.93
11
- Recall,Claude 3.5-Sonnet,69.35,30.69,53.58
12
- Recall,Gemini 1.5-Flash,70.71,27.67,53.16
13
- Recall,Llama3.1-8b,54.28,20.39,40.46
14
- Recall,Llama3.1-70b,58.00,29.31,46.30
15
- Recall,Llama3.1-405B,68.40,28.00,51.92
16
- Recall,Qwen2.5-8b,58.66,26.01,45.34
17
- Recall,Qwen2.5-32b,62.77,25.38,47.52
18
- Precision,GPT4o,85.11,74.04,80.59
19
- Precision,Claude 3.5-Sonnet,83.28,71.25,78.37
20
- Precision,Gemini 1.5-Flash,85.45,73.87,80.72
21
- Precision,Llama3.1-8b,68.87,49.36,60.91
22
- Precision,Llama3.1-70b,76.05,56.54,68.09
23
- Precision,Llama3.1-405B,78.80,64.10,72.80
24
- Precision,Qwen2.5-8b,77.18,65.58,72.45
25
- Precision,Qwen2.5-32b,82.74,70.60,77.79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
verifact_logo.png DELETED
Binary file (63.1 kB)
 
verifact_steps.png DELETED

Git LFS Details

  • SHA256: f5574607f26cff315614cea74bdca918b4d3661f3b13025cd9920603d173b58f
  • Pointer size: 131 Bytes
  • Size of remote file: 533 kB