Ahmed Ahmed commited on
Commit
1bac1ed
·
1 Parent(s): 36b1a23
Files changed (2) hide show
  1. logs.txt +87 -366
  2. src/leaderboard/read_evals.py +3 -5
logs.txt CHANGED
@@ -1,393 +1,114 @@
 
 
 
 
 
 
 
1
 
2
- Searching for result files in: ./eval-results
3
- Found 7 result files
4
-
5
- Processing file: ./eval-results/EleutherAI/results_EleutherAI_gpt-neo-1.3B_20250726_010247.json
6
-
7
- config.json: 0%| | 0.00/1.35k [00:00<?, ?B/s]
8
- config.json: 100%|██████████| 1.35k/1.35k [00:00<00:00, 17.2MB/s]
9
- Created result object for: EleutherAI/gpt-neo-1.3B
10
- Added new result for EleutherAI_gpt-neo-1.3B_float16
11
-
12
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_231201.json
13
-
14
- config.json: 0%| | 0.00/665 [00:00<?, ?B/s]
15
- config.json: 100%|██████████| 665/665 [00:00<00:00, 8.83MB/s]
16
- Created result object for: openai-community/gpt2
17
- Added new result for openai-community_gpt2_float16
18
-
19
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_233155.json
20
- Created result object for: openai-community/gpt2
21
- Updated existing result for openai-community_gpt2_float16
22
-
23
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235115.json
24
- Created result object for: openai-community/gpt2
25
- Updated existing result for openai-community_gpt2_float16
26
-
27
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235748.json
28
- Created result object for: openai-community/gpt2
29
- Updated existing result for openai-community_gpt2_float16
30
-
31
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000358.json
32
- Created result object for: openai-community/gpt2
33
- Updated existing result for openai-community_gpt2_float16
34
-
35
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000650.json
36
- Created result object for: openai-community/gpt2
37
- Updated existing result for openai-community_gpt2_float16
38
-
39
- Processing 2 evaluation results
40
 
41
- Converting result to dict for: EleutherAI/gpt-neo-1.3B
 
42
 
43
- === PROCESSING RESULT TO_DICT ===
44
- Processing result for model: EleutherAI/gpt-neo-1.3B
45
- Raw results: {'perplexity': 5.9609375}
46
- Model precision: Precision.float16
47
- Model type: ModelType.PT
48
- Weight type: WeightType.Original
49
- Available tasks: ['task0']
50
- Looking for task: perplexity in results
51
- Found score for perplexity: 5.9609375
52
- Converted score: 82.1477223263516
53
- Calculated average score: 82.1477223263516
54
- Created base data_dict with 13 columns
55
- Added task score: Perplexity = 5.9609375
56
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
57
- === END PROCESSING RESULT TO_DICT ===
58
- Successfully converted and added result
59
 
60
- Converting result to dict for: openai-community/gpt2
 
61
 
62
- === PROCESSING RESULT TO_DICT ===
63
- Processing result for model: openai-community/gpt2
64
- Raw results: {'perplexity': 20.663532257080078}
65
- Model precision: Precision.float16
66
- Model type: ModelType.PT
67
- Weight type: WeightType.Original
68
- Available tasks: ['task0']
69
- Looking for task: perplexity in results
70
- Found score for perplexity: 20.663532257080078
71
- Converted score: 69.7162958010531
72
- Calculated average score: 69.7162958010531
73
- Created base data_dict with 13 columns
74
- Added task score: Perplexity = 20.663532257080078
75
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
76
- === END PROCESSING RESULT TO_DICT ===
77
- Successfully converted and added result
78
 
79
- Returning 2 processed results
 
80
 
81
- Found 2 raw results
82
- Processing result 1/2: EleutherAI/gpt-neo-1.3B
83
 
84
- === PROCESSING RESULT TO_DICT ===
85
- Processing result for model: EleutherAI/gpt-neo-1.3B
86
- Raw results: {'perplexity': 5.9609375}
87
- Model precision: Precision.float16
88
- Model type: ModelType.PT
89
- Weight type: WeightType.Original
90
- Available tasks: ['task0']
91
- Looking for task: perplexity in results
92
- Found score for perplexity: 5.9609375
93
- Converted score: 82.1477223263516
94
- Calculated average score: 82.1477223263516
95
- Created base data_dict with 13 columns
96
- Added task score: Perplexity = 5.9609375
97
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
98
- === END PROCESSING RESULT TO_DICT ===
99
- Successfully processed result 1/2: EleutherAI/gpt-neo-1.3B
100
- Processing result 2/2: openai-community/gpt2
101
 
102
- === PROCESSING RESULT TO_DICT ===
103
- Processing result for model: openai-community/gpt2
104
- Raw results: {'perplexity': 20.663532257080078}
105
- Model precision: Precision.float16
106
- Model type: ModelType.PT
107
- Weight type: WeightType.Original
108
- Available tasks: ['task0']
109
- Looking for task: perplexity in results
110
- Found score for perplexity: 20.663532257080078
111
- Converted score: 69.7162958010531
112
- Calculated average score: 69.7162958010531
113
- Created base data_dict with 13 columns
114
- Added task score: Perplexity = 20.663532257080078
115
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
116
- === END PROCESSING RESULT TO_DICT ===
117
- Successfully processed result 2/2: openai-community/gpt2
118
 
119
- Converted to 2 JSON records
120
- Sample record keys: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
121
 
122
- Created DataFrame with columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
123
- DataFrame shape: (2, 14)
124
 
125
- Sorted DataFrame by average
 
126
 
127
- Selected and rounded columns
 
128
 
129
- Final DataFrame shape after filtering: (2, 12)
130
- Final columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
131
- === FINAL RESULT: DataFrame with 2 rows and 12 columns ===
132
 
133
- === Initializing Leaderboard ===
134
- DataFrame shape: (2, 12)
135
- DataFrame columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
136
- * Running on local URL: http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set `ssr=False` in `launch()`)
137
 
138
- To create a public link, set `share=True` in `launch()`.
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
- === RUNNING PERPLEXITY TEST ===
141
- Model: openai-community/gpt2-large
142
- Revision: main
143
- Precision: float16
144
- Starting dynamic evaluation for openai-community/gpt2-large
145
- Running perplexity evaluation...
146
- Loading model: openai-community/gpt2-large (revision: main)
147
- Loading tokenizer...
148
 
149
- tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]
150
- tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 183kB/s]
 
151
 
152
- config.json: 0%| | 0.00/666 [00:00<?, ?B/s]
153
- config.json: 100%|██████████| 666/666 [00:00<00:00, 7.11MB/s]
 
 
 
154
 
155
- vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]
156
- vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 45.7MB/s]
157
-
158
- merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
159
- merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 44.9MB/s]
160
-
161
- tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]
162
- tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 25.3MB/s]
163
- Tokenizer loaded successfully
164
- Loading model...
165
-
166
- model.safetensors: 0%| | 0.00/3.25G [00:00<?, ?B/s]
167
- model.safetensors: 0%| | 3.99M/3.25G [00:01<18:26, 2.93MB/s]
168
- model.safetensors: 4%|▍ | 138M/3.25G [00:02<00:47, 65.1MB/s]
169
- model.safetensors: 7%|▋ | 235M/3.25G [00:03<00:46, 65.4MB/s]
170
- model.safetensors: 28%|██▊ | 905M/3.25G [00:05<00:09, 258MB/s]
171
- model.safetensors: 46%|████▋ | 1.51G/3.25G [00:06<00:04, 360MB/s]
172
- model.safetensors: 71%|███████ | 2.31G/3.25G [00:07<00:01, 484MB/s]
173
- model.safetensors: 98%|█████████▊| 3.18G/3.25G [00:08<00:00, 593MB/s]
174
- model.safetensors: 100%|██████████| 3.25G/3.25G [00:08<00:00, 390MB/s]
175
-
176
- generation_config.json: 0%| | 0.00/124 [00:00<?, ?B/s]
177
- generation_config.json: 100%|██████████| 124/124 [00:00<00:00, 1.04MB/s]
178
- Model loaded successfully
179
- Tokenizing input text...
180
- Tokenized input shape: torch.Size([1, 141])
181
- Moved inputs to device: cpu
182
- Running forward pass...
183
- `loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
184
- Calculated loss: 2.1944427490234375
185
- Final perplexity: 8.974998474121094
186
- Perplexity evaluation completed: 8.974998474121094
187
- Created result structure: {'config': {'model_dtype': 'torch.float16', 'model_name': 'openai-community/gpt2-large', 'model_sha': 'main'}, 'results': {'perplexity': {'perplexity': 8.974998474121094}}}
188
- Saving result to: ./eval-results/openai-community/results_openai-community_gpt2-large_20250726_013038.json
189
- Result file saved locally
190
- Uploading to HF dataset: ahmedsqrd/results
191
- Upload completed successfully
192
- Evaluation result - Success: True, Result: 8.974998474121094
193
- Attempting to refresh leaderboard...
194
- === REFRESH LEADERBOARD DEBUG ===
195
- Refreshing leaderboard data...
196
 
197
  === GET_LEADERBOARD_DF DEBUG ===
198
  Starting leaderboard creation...
199
  Looking for results in: ./eval-results
200
- Expected columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
201
  Benchmark columns: ['Perplexity']
202
 
203
  Searching for result files in: ./eval-results
204
- Found 8 result files
205
-
206
- Processing file: ./eval-results/EleutherAI/results_EleutherAI_gpt-neo-1.3B_20250726_010247.json
207
- Created result object for: EleutherAI/gpt-neo-1.3B
208
- Added new result for EleutherAI_gpt-neo-1.3B_float16
209
-
210
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_231201.json
211
- Created result object for: openai-community/gpt2
212
- Added new result for openai-community_gpt2_float16
213
-
214
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_233155.json
215
- Created result object for: openai-community/gpt2
216
- Updated existing result for openai-community_gpt2_float16
217
-
218
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235115.json
219
- Created result object for: openai-community/gpt2
220
- Updated existing result for openai-community_gpt2_float16
221
-
222
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235748.json
223
- Created result object for: openai-community/gpt2
224
- Updated existing result for openai-community_gpt2_float16
225
-
226
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000358.json
227
- Created result object for: openai-community/gpt2
228
- Updated existing result for openai-community_gpt2_float16
229
-
230
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000650.json
231
- Created result object for: openai-community/gpt2
232
- Updated existing result for openai-community_gpt2_float16
233
-
234
- Processing file: ./eval-results/openai-community/results_openai-community_gpt2-large_20250726_013038.json
235
- Created result object for: openai-community/gpt2-large
236
- Added new result for openai-community_gpt2-large_float16
237
-
238
- Processing 3 evaluation results
239
-
240
- Converting result to dict for: EleutherAI/gpt-neo-1.3B
241
-
242
- === PROCESSING RESULT TO_DICT ===
243
- Processing result for model: EleutherAI/gpt-neo-1.3B
244
- Raw results: {'perplexity': 5.9609375}
245
- Model precision: Precision.float16
246
- Model type: ModelType.PT
247
- Weight type: WeightType.Original
248
- Available tasks: ['task0']
249
- Looking for task: perplexity in results
250
- Found score for perplexity: 5.9609375
251
- Converted score: 82.1477223263516
252
- Calculated average score: 82.1477223263516
253
- Created base data_dict with 13 columns
254
- Added task score: Perplexity = 5.9609375
255
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
256
- === END PROCESSING RESULT TO_DICT ===
257
- Successfully converted and added result
258
-
259
- Converting result to dict for: openai-community/gpt2
260
-
261
- === PROCESSING RESULT TO_DICT ===
262
- Processing result for model: openai-community/gpt2
263
- Raw results: {'perplexity': 20.663532257080078}
264
- Model precision: Precision.float16
265
- Model type: ModelType.PT
266
- Weight type: WeightType.Original
267
- Available tasks: ['task0']
268
- Looking for task: perplexity in results
269
- Found score for perplexity: 20.663532257080078
270
- Converted score: 69.7162958010531
271
- Calculated average score: 69.7162958010531
272
- Created base data_dict with 13 columns
273
- Added task score: Perplexity = 20.663532257080078
274
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
275
- === END PROCESSING RESULT TO_DICT ===
276
- Successfully converted and added result
277
-
278
- Converting result to dict for: openai-community/gpt2-large
279
-
280
- === PROCESSING RESULT TO_DICT ===
281
- Processing result for model: openai-community/gpt2-large
282
- Raw results: {'perplexity': 8.974998474121094}
283
- Model precision: Precision.float16
284
- Model type: ModelType.PT
285
- Weight type: WeightType.Original
286
- Available tasks: ['task0']
287
- Looking for task: perplexity in results
288
- Found score for perplexity: 8.974998474121094
289
- Converted score: 78.05557235640035
290
- Calculated average score: 78.05557235640035
291
- Created base data_dict with 13 columns
292
- Added task score: Perplexity = 8.974998474121094
293
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
294
- === END PROCESSING RESULT TO_DICT ===
295
- Successfully converted and added result
296
-
297
- Returning 3 processed results
298
-
299
- Found 3 raw results
300
- Processing result 1/3: EleutherAI/gpt-neo-1.3B
301
-
302
- === PROCESSING RESULT TO_DICT ===
303
- Processing result for model: EleutherAI/gpt-neo-1.3B
304
- Raw results: {'perplexity': 5.9609375}
305
- Model precision: Precision.float16
306
- Model type: ModelType.PT
307
- Weight type: WeightType.Original
308
- Available tasks: ['task0']
309
- Looking for task: perplexity in results
310
- Found score for perplexity: 5.9609375
311
- Converted score: 82.1477223263516
312
- Calculated average score: 82.1477223263516
313
- Created base data_dict with 13 columns
314
- Added task score: Perplexity = 5.9609375
315
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
316
- === END PROCESSING RESULT TO_DICT ===
317
- Successfully processed result 1/3: EleutherAI/gpt-neo-1.3B
318
- Processing result 2/3: openai-community/gpt2
319
-
320
- === PROCESSING RESULT TO_DICT ===
321
- Processing result for model: openai-community/gpt2
322
- Raw results: {'perplexity': 20.663532257080078}
323
- Model precision: Precision.float16
324
- Model type: ModelType.PT
325
- Weight type: WeightType.Original
326
- Available tasks: ['task0']
327
- Looking for task: perplexity in results
328
- Found score for perplexity: 20.663532257080078
329
- Converted score: 69.7162958010531
330
- Calculated average score: 69.7162958010531
331
- Created base data_dict with 13 columns
332
- Added task score: Perplexity = 20.663532257080078
333
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
334
- === END PROCESSING RESULT TO_DICT ===
335
- Successfully processed result 2/3: openai-community/gpt2
336
- Processing result 3/3: openai-community/gpt2-large
337
-
338
- === PROCESSING RESULT TO_DICT ===
339
- Processing result for model: openai-community/gpt2-large
340
- Raw results: {'perplexity': 8.974998474121094}
341
- Model precision: Precision.float16
342
- Model type: ModelType.PT
343
- Weight type: WeightType.Original
344
- Available tasks: ['task0']
345
- Looking for task: perplexity in results
346
- Found score for perplexity: 8.974998474121094
347
- Converted score: 78.05557235640035
348
- Calculated average score: 78.05557235640035
349
- Created base data_dict with 13 columns
350
- Added task score: Perplexity = 8.974998474121094
351
- Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
352
- === END PROCESSING RESULT TO_DICT ===
353
- Successfully processed result 3/3: openai-community/gpt2-large
354
-
355
- Converted to 3 JSON records
356
- Sample record keys: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
357
-
358
- Created DataFrame with columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
359
- DataFrame shape: (3, 14)
360
-
361
- Sorted DataFrame by average
362
-
363
- Selected and rounded columns
364
-
365
- Final DataFrame shape after filtering: (3, 12)
366
- Final columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
367
- === FINAL RESULT: DataFrame with 3 rows and 12 columns ===
368
- get_leaderboard_df returned: <class 'pandas.core.frame.DataFrame'>
369
- DataFrame shape: (3, 12)
370
- DataFrame columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
371
- DataFrame empty: False
372
- Final DataFrame for leaderboard - Shape: (3, 12), Columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
373
- Creating leaderboard component...
374
-
375
- === Initializing Leaderboard ===
376
- DataFrame shape: (3, 12)
377
- DataFrame columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
378
- Leaderboard component created successfully
379
- Leaderboard refresh successful
380
- Traceback (most recent call last):
381
- File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 625, in process_events
382
- response = await route_utils.call_process_api(
383
- File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 322, in call_process_api
384
- output = await app.get_blocks().process_api(
385
- File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 2106, in process_api
386
- data = await self.postprocess_data(block_fn, result["prediction"], state)
387
- File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1899, in postprocess_data
388
- state[block._id] = block.__class__(**kwargs)
389
- File "/usr/local/lib/python3.10/site-packages/gradio/component_meta.py", line 181, in wrapper
390
- return fn(self, **kwargs)
391
- File "/usr/local/lib/python3.10/site-packages/gradio_leaderboard/leaderboard.py", line 126, in __init__
392
- raise ValueError("Leaderboard component must have a value set.")
393
- ValueError: Leaderboard component must have a value set.
 
1
+ NCHMARK_COLS: ['Perplexity']
2
+ === END COLUMN SETUP ===
3
+ 🔧 CHECKING MODEL TRACING AVAILABILITY...
4
+ - Model tracing path: /home/user/app/src/evaluation/../../model-tracing
5
+ - Path exists: True
6
+ - main.py exists: True
7
+ 🎯 Final MODEL_TRACING_AVAILABLE = True
8
 
9
+ .gitattributes: 0%| | 0.00/2.46k [00:00<?, ?B/s]
10
+ .gitattributes: 100%|██████████| 2.46k/2.46k [00:00<00:00, 10.1MB/s]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
+ (…)therAI_gpt-neo-1.3B_20250726_010247.json: 0%| | 0.00/202 [00:00<?, ?B/s]
13
+ (…)therAI_gpt-neo-1.3B_20250726_010247.json: 100%|██████████| 202/202 [00:00<00:00, 748kB/s]
14
 
15
+ (…)s_facebook_opt-125m_20250726_020655.json: 0%| | 0.00/205 [00:00<?, ?B/s]
16
+ (…)s_facebook_opt-125m_20250726_020655.json: 100%|██████████| 205/205 [00:00<00:00, 909kB/s]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ (…)s_facebook_opt-350m_20250726_021737.json: 0%| | 0.00/205 [00:00<?, ?B/s]
19
+ (…)s_facebook_opt-350m_20250726_021737.json: 100%|██████████| 205/205 [00:00<00:00, 850kB/s]
20
 
21
+ (…)ommunity_gpt2-large_20250726_013038.json: 0%| | 0.00/214 [00:00<?, ?B/s]
22
+ (…)ommunity_gpt2-large_20250726_013038.json: 100%|██████████| 214/214 [00:00<00:00, 1.03MB/s]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ (…)mmunity_gpt2-medium_20250726_015555.json: 0%| | 0.00/216 [00:00<?, ?B/s]
25
+ (…)mmunity_gpt2-medium_20250726_015555.json: 100%|██████████| 216/216 [00:00<00:00, 730kB/s]
26
 
27
+ (…)enai-community_gpt2_20250725_231201.json: 0%| | 0.00/209 [00:00<?, ?B/s]
28
+ (…)enai-community_gpt2_20250725_231201.json: 100%|██████████| 209/209 [00:00<00:00, 533kB/s]
29
 
30
+ (…)enai-community_gpt2_20250725_233155.json: 0%| | 0.00/209 [00:00<?, ?B/s]
31
+ (…)enai-community_gpt2_20250725_233155.json: 100%|██████████| 209/209 [00:00<00:00, 905kB/s]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ (…)enai-community_gpt2_20250725_235115.json: 0%| | 0.00/209 [00:00<?, ?B/s]
34
+ (…)enai-community_gpt2_20250725_235115.json: 100%|██████████| 209/209 [00:00<00:00, 801kB/s]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ (…)enai-community_gpt2_20250725_235748.json: 0%| | 0.00/209 [00:00<?, ?B/s]
37
+ (…)enai-community_gpt2_20250725_235748.json: 100%|██████████| 209/209 [00:00<00:00, 856kB/s]
38
 
39
+ (…)enai-community_gpt2_20250726_000358.json: 0%| | 0.00/209 [00:00<?, ?B/s]
40
+ (…)enai-community_gpt2_20250726_000358.json: 100%|██████████| 209/209 [00:00<00:00, 696kB/s]
41
 
42
+ (…)enai-community_gpt2_20250726_000650.json: 0%| | 0.00/209 [00:00<?, ?B/s]
43
+ (…)enai-community_gpt2_20250726_000650.json: 100%|██████████| 209/209 [00:00<00:00, 792kB/s]
44
 
45
+ (…)enai-community_gpt2_20250726_015147.json: 0%| | 0.00/209 [00:00<?, ?B/s]
46
+ (…)enai-community_gpt2_20250726_015147.json: 100%|██████████| 209/209 [00:00<00:00, 1.12MB/s]
47
 
48
+ 🚀 STARTING GRADIO APP INITIALIZATION
49
+ 📊 Initializing allowed models...
 
50
 
51
+ 🚀 INITIALIZING ALLOWED MODELS
52
+ 📋 Models to initialize: ['lmsys/vicuna-7b-v1.5', 'ibm-granite/granite-7b-base', 'EleutherAI/llemma_7b']
 
 
53
 
54
+ 🧹 CLEANING NON-ALLOWED RESULT FILES
55
+ 🗑️ Removing non-allowed model result: ./eval-results/EleutherAI/results_EleutherAI_gpt-neo-1.3B_20250726_010247.json (model: EleutherAI/gpt-neo-1.3B)
56
+ 🗑️ Removing non-allowed model result: ./eval-results/facebook/results_facebook_opt-125m_20250726_020655.json (model: facebook/opt-125m)
57
+ 🗑️ Removing non-allowed model result: ./eval-results/facebook/results_facebook_opt-350m_20250726_021737.json (model: facebook/opt-350m)
58
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2-large_20250726_013038.json (model: openai-community/gpt2-large)
59
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2-medium_20250726_015555.json (model: openai-community/gpt2-medium)
60
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_231201.json (model: openai-community/gpt2)
61
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_233155.json (model: openai-community/gpt2)
62
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235115.json (model: openai-community/gpt2)
63
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235748.json (model: openai-community/gpt2)
64
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000358.json (model: openai-community/gpt2)
65
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000650.json (model: openai-community/gpt2)
66
+ 🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250726_015147.json (model: openai-community/gpt2)
67
+ ✅ Removed 12 non-allowed result files
68
 
69
+ 🔧 CREATING RESULT FILE FOR: lmsys/vicuna-7b-v1.5
70
+ 📁 Result file path: ./eval-results/lmsys_vicuna_7b_v1.5_float16.json
71
+ ✅ Created result file: ./eval-results/lmsys_vicuna_7b_v1.5_float16.json
 
 
 
 
 
72
 
73
+ 🔧 CREATING RESULT FILE FOR: ibm-granite/granite-7b-base
74
+ 📁 Result file path: ./eval-results/ibm_granite_granite_7b_base_float16.json
75
+ ✅ Created result file: ./eval-results/ibm_granite_granite_7b_base_float16.json
76
 
77
+ 🔧 CREATING RESULT FILE FOR: EleutherAI/llemma_7b
78
+ 📁 Result file path: ./eval-results/EleutherAI_llemma_7b_float16.json
79
+ ✅ Created result file: ./eval-results/EleutherAI_llemma_7b_float16.json
80
+ ✅ Initialized 3 model result files
81
+ 📊 Creating initial results DataFrame...
82
 
83
+ 📊 CREATE_RESULTS_DATAFRAME CALLED
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  === GET_LEADERBOARD_DF DEBUG ===
86
  Starting leaderboard creation...
87
  Looking for results in: ./eval-results
88
+ Expected columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Match P-Value ⬇️', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
89
  Benchmark columns: ['Perplexity']
90
 
91
  Searching for result files in: ./eval-results
92
+ Found 0 result files
93
+
94
+ Processing 0 evaluation results
95
+
96
+ Returning 0 processed results
97
+
98
+ Found 0 raw results
99
+ No raw data found, creating empty DataFrame
100
+ Creating empty fallback DataFrame...
101
+ Empty DataFrame created with columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Match P-Value ⬇️', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
102
+ 📋 Retrieved leaderboard df: (0, 13)
103
+ ⚠️ DataFrame is None or empty, returning empty DataFrame
104
+ Initial DataFrame created with shape: (0, 6)
105
+ 📋 Columns: ['Model', 'Perplexity', 'Match P-Value', 'Average Score', 'Type', 'Precision']
106
+ 🎨 Creating Gradio interface...
107
+ 🎯 GRADIO INTERFACE SETUP COMPLETE
108
+ 🚀 LAUNCHING GRADIO APP WITH MODEL TRACING INTEGRATION
109
+ 📊 Features enabled:
110
+ - Perplexity evaluation
111
+ - Model trace p-value computation (vs GPT-2 base)
112
+ - Match statistic with alignment
113
+ 🎉 Ready to accept requests!
114
+ * Running on local URL: http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set `ssr=False` in `launch()`)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/leaderboard/read_evals.py CHANGED
@@ -192,12 +192,10 @@ def get_raw_eval_results(results_path: str) -> list[EvalResult]:
192
  model_result_filepaths = []
193
 
194
  for root, _, files in os.walk(results_path):
195
- # We should only have json files in model results
196
- if len(files) == 0 or any([not f.endswith(".json") for f in files]):
197
- continue
198
-
199
  for file in files:
200
- model_result_filepaths.append(os.path.join(root, file))
 
201
 
202
  sys.stderr.write(f"Found {len(model_result_filepaths)} result files\n")
203
  sys.stderr.flush()
 
192
  model_result_filepaths = []
193
 
194
  for root, _, files in os.walk(results_path):
195
+ # Process all JSON files, regardless of other files in the directory
 
 
 
196
  for file in files:
197
+ if file.endswith(".json"):
198
+ model_result_filepaths.append(os.path.join(root, file))
199
 
200
  sys.stderr.write(f"Found {len(model_result_filepaths)} result files\n")
201
  sys.stderr.flush()