tosi-n commited on
Commit
b262a9b
Β·
verified Β·
1 Parent(s): cae9ecf

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. .github/workflows/update_space.yml +20 -0
  2. README.md +124 -6
  3. app.py +497 -0
  4. jenesys.jpg +0 -0
  5. requirements.txt +4 -0
.github/workflows/update_space.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face Space
2
+ on:
3
+ push:
4
+ branches: [main]
5
+ workflow_dispatch:
6
+
7
+ jobs:
8
+ sync:
9
+ runs-on: ubuntu-latest
10
+ steps:
11
+ - uses: actions/checkout@v3
12
+ with:
13
+ fetch-depth: 0
14
+ - name: Push to HF Space
15
+ env:
16
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
17
+ run: |
18
+ git config --global user.email "[email protected]"
19
+ git config --global user.name "GitHub Actions"
20
+ git push https://jenesys-ai:[email protected]/spaces/jenesys-ai/ai_bookkeeper_leaderboard main
README.md CHANGED
@@ -1,12 +1,130 @@
1
  ---
2
- title: AI Bookkeeper Leaderboard
3
- emoji: ⚑
4
- colorFrom: gray
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 5.12.0
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: AI_Bookkeeper_Leaderboard
3
+ emoji: πŸ“Š
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
  ---
12
 
13
+ # AI Bookkeeper Leaderboard
14
+
15
+ A comprehensive benchmark for evaluating AI models on accounting document processing tasks. This benchmark focuses on real-world accounting scenarios and provides detailed metrics across key capabilities.
16
+
17
+ [View Live Demo](https://huggingface.co/spaces/jenesys-ai/ai_bookkeeper_leaderboard)
18
+
19
+ ## Models Evaluated
20
+
21
+ - Ark II (Jenesys AI) - 17.94s inference time
22
+ - Ark I (Jenesys AI) - 7.955s inference time
23
+ - Claude-3-5-Sonnet (Anthropic) - 26.51s inference time
24
+ - GPT-4o (OpenAI) - 19.88s inference time
25
+
26
+ ## Categories and Raw Data Points
27
+
28
+ The benchmark evaluates models across four main categories, each with specific raw data points:
29
+
30
+ 1. **Document Understanding** (25%)
31
+ - Invoice ID Detection
32
+ - Date Field Recognition
33
+ - Line Items Total
34
+ Average = (Invoice ID + Date + Line Items Total) / 3
35
+
36
+ 2. **Data Extraction** (25%)
37
+ - Supplier Information
38
+ - Line Items Quantity
39
+ - Line Items Description
40
+ - VAT Number
41
+ - Line Items Total
42
+ Average = (Supplier + Quantity + Description + VAT_Number + Total) / 5
43
+
44
+ 3. **Bookkeeping Intelligence** (25%)
45
+ - Discount Total
46
+ - Line Items VAT
47
+ - VAT Exclusive Amount
48
+ - VAT Number Validation
49
+ - Discount Verification
50
+ Average = (Discount + VAT_Items + VAT_Exclusive + VAT_Number + Discount_Verification) / 5
51
+
52
+ 4. **Error Handling** (25%)
53
+ - Mean Accuracy (direct measure)
54
+
55
+ ## Model Performance
56
+
57
+ ### Ark II
58
+ - Document Understanding: 80.8% (0.733, 0.887, 0.803)
59
+ - Data Extraction: 74.9% (0.735, 0.882, 0.555, 0.768, 0.803)
60
+ - Bookkeeping Intelligence: 73.0% (0.800, 0.590, 0.694, 0.768, 0.800)
61
+ - Error Handling: 71.8%
62
+
63
+ ### Ark I
64
+ - Document Understanding: 78.5% (0.747, 0.905, 0.703)
65
+ - Data Extraction: 70.9% (0.792, 0.811, 0.521, 0.719, 0.703)
66
+ - Bookkeeping Intelligence: 56.9% (0.600, 0.434, 0.491, 0.719, 0.600)
67
+ - Error Handling: 64.1%
68
+
69
+ ### Claude-3-5-Sonnet
70
+ - Document Understanding: 70.4% (0.773, 0.806, 0.533)
71
+ - Data Extraction: 60.9% (0.706, 0.597, 0.504, 0.708, 0.533)
72
+ - Bookkeeping Intelligence: 62.8% (0.600, 0.524, 0.706, 0.708, 0.600)
73
+ - Error Handling: 67.5%
74
+
75
+ ### GPT-4o
76
+ - Document Understanding: 69.6% (0.600, 0.917, 0.571)
77
+ - Data Extraction: 68.9% (0.818, 0.722, 0.619, 0.714, 0.571)
78
+ - Bookkeeping Intelligence: 25.5% (0.000, 0.313, 0.250, 0.714, 0.000)
79
+ - Error Handling: 68.3%
80
+
81
+ ## Key Findings
82
+
83
+ - Ark II leads in overall performance, particularly in document understanding (80.8%)
84
+ - Ark I shows strong performance relative to its size, especially in document understanding (78.5%)
85
+ - Claude-3-5-Sonnet maintains consistent performance across categories
86
+ - GPT-4o shows competitive performance in document understanding and data extraction but struggles with bookkeeping intelligence tasks
87
+ - Ark I achieves impressive efficiency with the fastest inference time (7.955s)
88
+
89
+ ## Interactive Dashboard Features
90
+
91
+ The dashboard provides several interactive visualizations:
92
+
93
+ 1. **Overall Leaderboard**: Comprehensive view of all models' performance metrics
94
+ 2. **Category Comparison**: Bar chart comparing all models across the four main categories
95
+ 3. **Combined Radar Chart**: Multi-model comparison showing relative strengths and weaknesses
96
+ 4. **Detailed Metrics**: Interactive comparison table showing differences between selected model and Ark II
97
+
98
+ ## Running the Leaderboard
99
+
100
+ 1. Install dependencies:
101
+ ```bash
102
+ pip install gradio pandas plotly
103
+ ```
104
+
105
+ 2. Run the app:
106
+ ```python
107
+ python app.py
108
+ ```
109
+
110
+ 3. Open the provided URL in your browser to view the interactive dashboard.
111
+
112
+ ## Visualization Features
113
+
114
+ - Color-coded performance indicators
115
+ - Comparative analysis with Ark II as baseline
116
+ - Interactive model selection for detailed comparisons
117
+ - Multi-model radar chart for performance pattern analysis
118
+ - Dynamic updates of comparative metrics
119
+
120
+ ## Contributing
121
+
122
+ To add new model evaluations:
123
+ 1. Add model scores following the established format in MODELS dictionary
124
+ 2. Include all required metrics for each category
125
+ 3. Provide model metadata (version, type, provider, size, inference time)
126
+ 4. Follow the existing structure in `app.py`
127
+
128
+ ## License
129
+
130
+ MIT License
app.py ADDED
@@ -0,0 +1,497 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import pandas as pd
3
+ import plotly.graph_objects as go
4
+ import plotly.express as px
5
+ from datetime import datetime
6
+ import os
7
+ import base64
8
+
9
+ # Define the benchmark categories and their component metrics
10
+ CATEGORIES = {
11
+ "Document Understanding": {
12
+ "metrics": [
13
+ "Invoice ID Detection",
14
+ "Date Field Recognition",
15
+ "Address Block Parsing",
16
+ "Table Structure Recognition"
17
+ ],
18
+ "weight": 0.25
19
+ },
20
+ "Data Extraction": {
21
+ "metrics": [
22
+ "Line Item Extraction",
23
+ "Numerical Value Accuracy",
24
+ "Text Field Accuracy",
25
+ "Field Completeness"
26
+ ],
27
+ "weight": 0.25
28
+ },
29
+ "Bookkeeping Intelligence": {
30
+ "metrics": [
31
+ "VAT Calculation",
32
+ "Total Reconciliation",
33
+ "Tax Code Assignment",
34
+ "Account Classification"
35
+ ],
36
+ "weight": 0.25
37
+ },
38
+ "Error Handling": {
39
+ "metrics": [
40
+ "Validation Rules",
41
+ "Inconsistency Detection",
42
+ "Missing Data Handling",
43
+ "Format Validation"
44
+ ],
45
+ "weight": 0.25
46
+ }
47
+ }
48
+
49
+ # Updated benchmark data with real metrics
50
+ MODELS = {
51
+ "Ark II": {
52
+ "version": "ark-ii-v1",
53
+ "type": "Text + Vision",
54
+ "provider": "Jenesys AI",
55
+ "inference_time": "17.94s",
56
+ "scores": {
57
+ "Document Understanding": {
58
+ "Invoice ID": 0.733,
59
+ "Date of Invoice": 0.887,
60
+ "Line Items Total": 0.803,
61
+ "Overall": 0.808
62
+ },
63
+ "Data Extraction": {
64
+ "Supplier": 0.735,
65
+ "Line Items Quantity": 0.882,
66
+ "Line Items Description": 0.555,
67
+ "VAT Number": 0.768,
68
+ "Line Items Total": 0.803,
69
+ "Overall": 0.749
70
+ },
71
+ "Bookkeeping Intelligence": {
72
+ "Discount Total": 0.800,
73
+ "Line Items VAT": 0.590,
74
+ "VAT Exclusive": 0.694,
75
+ "VAT Number": 0.768,
76
+ "Discount Verification": 0.800,
77
+ "Overall": 0.730
78
+ },
79
+ "Error Handling": {
80
+ "Mean Accuracy": 0.718,
81
+ "Overall": 0.718
82
+ }
83
+ }
84
+ },
85
+ "Claude-3-5-Sonnet": {
86
+ "version": "claude-3-5-sonnet-20241022",
87
+ "type": "Text + Vision",
88
+ "provider": "Anthropic",
89
+ "inference_time": "26.51s",
90
+ "scores": {
91
+ "Document Understanding": {
92
+ "Invoice ID": 0.773,
93
+ "Date of Invoice": 0.806,
94
+ "Line Items Total": 0.533,
95
+ "Overall": 0.704
96
+ },
97
+ "Data Extraction": {
98
+ "Supplier": 0.706,
99
+ "Line Items Quantity": 0.597,
100
+ "Line Items Description": 0.504,
101
+ "VAT Number": 0.708,
102
+ "Line Items Total": 0.533,
103
+ "Overall": 0.609
104
+ },
105
+ "Bookkeeping Intelligence": {
106
+ "Discount Total": 0.600,
107
+ "Line Items VAT": 0.524,
108
+ "VAT Exclusive": 0.706,
109
+ "VAT Number": 0.708,
110
+ "Discount Verification": 0.600,
111
+ "Overall": 0.628
112
+ },
113
+ "Error Handling": {
114
+ "Mean Accuracy": 0.675,
115
+ "Overall": 0.675
116
+ }
117
+ }
118
+ },
119
+ "GPT-4o": {
120
+ "version": "gpt-4o",
121
+ "type": "Text + Vision",
122
+ "provider": "OpenAI",
123
+ "inference_time": "19.88s",
124
+ "scores": {
125
+ "Document Understanding": {
126
+ "Invoice ID": 0.600,
127
+ "Date of Invoice": 0.917,
128
+ "Line Items Total": 0.571,
129
+ "Overall": 0.696
130
+ },
131
+ "Data Extraction": {
132
+ "Supplier": 0.818,
133
+ "Line Items Quantity": 0.722,
134
+ "Line Items Description": 0.619,
135
+ "VAT Number": 0.714,
136
+ "Line Items Total": 0.571,
137
+ "Overall": 0.689
138
+ },
139
+ "Bookkeeping Intelligence": {
140
+ "Discount Total": 0.000,
141
+ "Line Items VAT": 0.313,
142
+ "VAT Exclusive": 0.250,
143
+ "VAT Number": 0.714,
144
+ "Discount Verification": 0.000,
145
+ "Overall": 0.255
146
+ },
147
+ "Error Handling": {
148
+ "Mean Accuracy": 0.683,
149
+ "Overall": 0.683
150
+ }
151
+ }
152
+ },
153
+ "Ark I": {
154
+ "version": "ark-i-v1",
155
+ "type": "Text + Vision",
156
+ "provider": "Jenesys AI",
157
+ "inference_time": "7.955s",
158
+ "scores": {
159
+ "Document Understanding": {
160
+ "Invoice ID": 0.747,
161
+ "Date of Invoice": 0.905,
162
+ "Line Items Total": 0.703,
163
+ "Overall": 0.785
164
+ },
165
+ "Data Extraction": {
166
+ "Supplier": 0.792,
167
+ "Line Items Quantity": 0.811,
168
+ "Line Items Description": 0.521,
169
+ "VAT Number": 0.719,
170
+ "Line Items Total": 0.703,
171
+ "Overall": 0.709
172
+ },
173
+ "Bookkeeping Intelligence": {
174
+ "Discount Total": 0.600,
175
+ "Line Items VAT": 0.434,
176
+ "VAT Exclusive": 0.491,
177
+ "VAT Number": 0.719,
178
+ "Discount Verification": 0.600,
179
+ "Overall": 0.569
180
+ },
181
+ "Error Handling": {
182
+ "Mean Accuracy": 0.641,
183
+ "Overall": 0.641
184
+ }
185
+ }
186
+ }
187
+ }
188
+
189
+ def calculate_category_score(scores):
190
+ """Calculate average score for a category's metrics."""
191
+ # Skip 'Overall' when calculating average
192
+ metrics = {k: v for k, v in scores.items() if k != 'Overall'}
193
+ return sum(metrics.values()) / len(metrics)
194
+
195
+ def calculate_overall_score(model_data):
196
+ """Calculate the weighted average score across all categories."""
197
+ category_scores = {}
198
+ for category, metrics in model_data["scores"].items():
199
+ # Skip 'Overall' when calculating
200
+ category_metrics = {k: v for k, v in metrics.items() if k != 'Overall'}
201
+ category_scores[category] = sum(category_metrics.values()) / len(category_metrics) * CATEGORIES[category]["weight"]
202
+ return sum(category_scores.values())
203
+
204
+ def create_leaderboard_df():
205
+ """Create a DataFrame for the leaderboard with detailed metrics."""
206
+ data = []
207
+ for model_name, model_info in MODELS.items():
208
+ # Calculate category scores
209
+ category_scores = {
210
+ category: calculate_category_score(metrics)
211
+ for category, metrics in model_info["scores"].items()
212
+ }
213
+
214
+ # Use Error Handling score as Average Score
215
+ error_handling_score = calculate_category_score(model_info["scores"]["Error Handling"])
216
+
217
+ row = {
218
+ "Model": model_name,
219
+ "Version": model_info["version"],
220
+ "Type": model_info["type"],
221
+ "Provider": model_info["provider"],
222
+ "Average Score": error_handling_score, # Using Error Handling score
223
+ **category_scores
224
+ }
225
+ data.append(row)
226
+
227
+ df = pd.DataFrame(data)
228
+ return df.sort_values("Average Score", ascending=False)
229
+
230
+
231
+ def create_category_comparison():
232
+ """Create a bar chart comparing all models across categories."""
233
+ df = create_leaderboard_df()
234
+ df_melted = df.melt(
235
+ id_vars=["Model"],
236
+ value_vars=list(CATEGORIES.keys()),
237
+ var_name="Category",
238
+ value_name="Score"
239
+ )
240
+
241
+ fig = px.bar(
242
+ df_melted,
243
+ x="Category",
244
+ y="Score",
245
+ color="Model",
246
+ barmode="group",
247
+ title="Model Performance by Category",
248
+ range_y=[0, 1.0]
249
+ )
250
+
251
+ fig.update_layout(
252
+ xaxis_title="Category",
253
+ yaxis_title="Score",
254
+ legend_title="Model",
255
+ font=dict(size=14),
256
+ title=dict(
257
+ text="Model Performance by Category",
258
+ x=0.5,
259
+ y=0.95,
260
+ xanchor='center',
261
+ yanchor='top',
262
+ font=dict(size=20)
263
+ ),
264
+ yaxis=dict(
265
+ tickmode='array',
266
+ ticktext=['0%', '20%', '40%', '60%', '80%', '100%'],
267
+ tickvals=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
268
+ gridcolor='rgba(0, 0, 0, 0.1)',
269
+ zeroline=True,
270
+ zerolinecolor='rgba(0, 0, 0, 0.2)',
271
+ zerolinewidth=1
272
+ ),
273
+ xaxis=dict(
274
+ tickangle=-45,
275
+ gridcolor='rgba(0, 0, 0, 0.1)'
276
+ ),
277
+ bargap=0.2,
278
+ bargroupgap=0.1,
279
+ paper_bgcolor='rgba(255, 255, 255, 0.9)',
280
+ plot_bgcolor='rgba(255, 255, 255, 0.9)',
281
+ margin=dict(t=100, b=100, l=100, r=20),
282
+ showlegend=True,
283
+ legend=dict(
284
+ yanchor="top",
285
+ y=1,
286
+ xanchor="left",
287
+ x=1.02,
288
+ bgcolor='rgba(255, 255, 255, 0.9)',
289
+ bordercolor='rgba(0, 0, 0, 0.1)',
290
+ borderwidth=1
291
+ )
292
+ )
293
+
294
+ return fig
295
+
296
+
297
+
298
+ def create_combined_radar_chart():
299
+ """Create a radar chart showing all models together."""
300
+ try:
301
+ import plotly.graph_objects as go
302
+
303
+ categories = list(CATEGORIES.keys())
304
+
305
+ # Define colors for each model
306
+ colors = {
307
+ "Ark II": "rgb(99, 110, 250)", # Blue
308
+ "Claude-3-5-Sonnet": "rgb(239, 85, 59)", # Red
309
+ "GPT-4o": "rgb(0, 204, 150)", # Green
310
+ "Ark I": "rgb(171, 99, 250)" # Purple
311
+ }
312
+
313
+ fig = go.Figure()
314
+
315
+ # Add trace for each model
316
+ for model_name, color in colors.items():
317
+ model_data = MODELS[model_name]
318
+ values = []
319
+
320
+ for category in categories:
321
+ metrics = {k: v for k, v in model_data["scores"][category].items() if k != 'Overall'}
322
+ if category == "Error Handling":
323
+ values.append(metrics.get("Mean Accuracy", 0.0))
324
+ else:
325
+ values.append(sum(metrics.values()) / len(metrics) if metrics else 0.0)
326
+
327
+ fig.add_trace(go.Scatterpolar(
328
+ r=values + [values[0]],
329
+ theta=categories + [categories[0]],
330
+ fill='none',
331
+ line=dict(color=color, width=2),
332
+ name=model_name
333
+ ))
334
+
335
+ # Update layout
336
+ fig.update_layout(
337
+ polar=dict(
338
+ radialaxis=dict(
339
+ visible=True,
340
+ range=[0, 1.0],
341
+ tickmode='array',
342
+ ticktext=['0%', '20%', '40%', '60%', '80%', '100%'],
343
+ tickvals=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
344
+ gridcolor='rgba(0, 0, 0, 0.1)',
345
+ linecolor='rgba(0, 0, 0, 0.1)'
346
+ ),
347
+ angularaxis=dict(
348
+ gridcolor='rgba(0, 0, 0, 0.1)',
349
+ linecolor='rgba(0, 0, 0, 0.1)'
350
+ ),
351
+ bgcolor='rgba(255, 255, 255, 0.9)'
352
+ ),
353
+ showlegend=True,
354
+ paper_bgcolor='rgba(255, 255, 255, 0.9)',
355
+ plot_bgcolor='rgba(255, 255, 255, 0.9)',
356
+ title=dict(
357
+ text="Model Performance Comparison",
358
+ x=0.5,
359
+ y=0.95,
360
+ xanchor='center',
361
+ yanchor='top',
362
+ font=dict(size=20)
363
+ ),
364
+ legend=dict(
365
+ yanchor="top",
366
+ y=1,
367
+ xanchor="left",
368
+ x=1.02
369
+ ),
370
+ margin=dict(t=100, b=100, l=100, r=100)
371
+ )
372
+
373
+ return fig
374
+ except Exception as e:
375
+ print(f"Error creating radar chart: {str(e)}")
376
+ return go.Figure()
377
+
378
+ def create_comparison_metrics_df(model_name):
379
+ """Create a DataFrame showing detailed metrics with comparisons."""
380
+ base_model = "Ark II"
381
+ data = []
382
+
383
+ base_data = MODELS[base_model]["scores"]
384
+ compare_data = MODELS[model_name]["scores"]
385
+
386
+ for category in CATEGORIES.keys():
387
+ base_metrics = {k: v for k, v in base_data[category].items() if k != 'Overall'}
388
+ compare_metrics = {k: v for k, v in compare_data[category].items() if k != 'Overall'}
389
+
390
+ for metric in base_metrics.keys():
391
+ if metric in compare_metrics:
392
+ base_value = base_metrics[metric]
393
+ compare_value = compare_metrics[metric]
394
+ diff = compare_value - base_value
395
+
396
+ data.append({
397
+ "Category": category,
398
+ "Metric": metric,
399
+ f"{model_name} Score": compare_value,
400
+ f"{base_model} Score": base_value,
401
+ "Difference": diff,
402
+ "Better/Worse": "↑" if diff > 0 else "↓" if diff < 0 else "="
403
+ })
404
+
405
+ df = pd.DataFrame(data)
406
+ return df
407
+
408
+ def update_model_details(model_name):
409
+ """Update the detailed metrics view for a selected model."""
410
+ try:
411
+ df = create_comparison_metrics_df(model_name)
412
+ return [df, create_combined_radar_chart()]
413
+ except Exception as e:
414
+ print(f"Error in update_model_details: {str(e)}")
415
+ return [pd.DataFrame(), go.Figure()]
416
+
417
+ # Load logo as base64
418
+ def get_logo_html():
419
+ logo_path = os.path.join(os.path.dirname(__file__), "jenesys.jpg")
420
+ with open(logo_path, "rb") as f:
421
+ encoded_logo = base64.b64encode(f.read()).decode()
422
+ return f'<img src="data:image/jpeg;base64,{encoded_logo}" style="height: 50px; margin-right: 10px;">'
423
+
424
+ # Create the Gradio interface
425
+ with gr.Blocks(title="AI Bookkeeper Leaderboard") as demo:
426
+ gr.Markdown(f"""
427
+ <div style="display: flex; align-items: center; margin-bottom: 1rem;">
428
+ {get_logo_html()}
429
+ <h1 style="margin: 0;">AI Bookkeeper Leaderboard</h1>
430
+ </div>
431
+ """)
432
+
433
+ gr.Markdown(f"Last updated: {datetime.now().strftime('%Y-%m-%d')}")
434
+
435
+ gr.Markdown("""
436
+ ## About the Benchmark πŸ“Š
437
+
438
+ This benchmark evaluates Large Vision Language Models on their ability to process and understand bookkeeping documents across four main categories:
439
+
440
+ 1. **Document Understanding (25%)**: Ability to parse and understand document structure
441
+ 2. **Data Extraction (25%)**: Accuracy in extracting specific data points
442
+ 3. **Bookkeeping Intelligence (25%)**: Understanding of bookkeeping concepts, calculations and general ledger accounting
443
+ 4. **Error Handling (25%)**: Ability to detect and handle inconsistencies
444
+
445
+ Each metric is scored from 0 to 1, where:
446
+ - 0.90-1.00 = Excellent
447
+ - 0.80-0.89 = Good
448
+ - 0.70-0.79 = Acceptable
449
+ - < 0.70 = Needs improvement
450
+
451
+ """)
452
+
453
+ with gr.Row():
454
+ leaderboard = gr.DataFrame(
455
+ create_leaderboard_df(),
456
+ label="Overall Leaderboard",
457
+ height=200
458
+ )
459
+
460
+ with gr.Row():
461
+ with gr.Column(scale=1, min_width=1200):
462
+ category_plot = gr.Plot(
463
+ value=create_category_comparison()
464
+ )
465
+
466
+ with gr.Row():
467
+ with gr.Column(scale=1):
468
+ model_selector = gr.Dropdown(
469
+ choices=[m for m in list(MODELS.keys()) if m != "Ark II"],
470
+ label="Select Model to Compare with Ark II",
471
+ value="Claude-3-5-Sonnet",
472
+ interactive=True
473
+ )
474
+
475
+ with gr.Row():
476
+ with gr.Column(scale=2):
477
+ metrics_table = gr.DataFrame(
478
+ create_comparison_metrics_df("Claude-3-5-Sonnet"),
479
+ label="Comparison Metrics (vs Ark II)",
480
+ height=400
481
+ )
482
+
483
+ with gr.Row():
484
+ with gr.Column(scale=1, min_width=1200):
485
+ radar_chart = gr.Plot(value=create_combined_radar_chart())
486
+
487
+ # Update callback
488
+ model_selector.change(
489
+ fn=update_model_details,
490
+ inputs=[model_selector],
491
+ outputs=[metrics_table, radar_chart]
492
+ )
493
+
494
+
495
+
496
+ if __name__ == "__main__":
497
+ demo.launch(share=True)
jenesys.jpg ADDED
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ pandas>=2.0.0
3
+ plotly>=5.0.0
4
+ numpy>=1.24.0