metadata

title: AI_Bookkeeper_Leaderboard
emoji: 📊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit

AI Bookkeeper Leaderboard

A comprehensive benchmark for evaluating AI models on accounting document processing tasks. This benchmark focuses on real-world accounting scenarios and provides detailed metrics across key capabilities.

View Live Demo

Models Evaluated

Ark II (Jenesys AI) - 17.94s inference time
Ark I (Jenesys AI) - 7.955s inference time
Claude-3-5-Sonnet (Anthropic) - 26.51s inference time
GPT-4o (OpenAI) - 19.88s inference time

Categories and Raw Data Points

The benchmark evaluates models across four main categories, each with specific raw data points:

Document Understanding (25%)
- Invoice ID Detection
- Date Field Recognition
- Line Items Total Average = (Invoice ID + Date + Line Items Total) / 3
Data Extraction (25%)
- Supplier Information
- Line Items Quantity
- Line Items Description
- VAT Number
- Line Items Total Average = (Supplier + Quantity + Description + VAT_Number + Total) / 5
Bookkeeping Intelligence (25%)
- Discount Total
- Line Items VAT
- VAT Exclusive Amount
- VAT Number Validation
- Discount Verification Average = (Discount + VAT_Items + VAT_Exclusive + VAT_Number + Discount_Verification) / 5
Error Handling (25%)
- Mean Accuracy (direct measure)

Model Performance

Ark II

Document Understanding: 80.8% (0.733, 0.887, 0.803)
Data Extraction: 74.9% (0.735, 0.882, 0.555, 0.768, 0.803)
Bookkeeping Intelligence: 73.0% (0.800, 0.590, 0.694, 0.768, 0.800)
Error Handling: 71.8%

Ark I

Document Understanding: 78.5% (0.747, 0.905, 0.703)
Data Extraction: 70.9% (0.792, 0.811, 0.521, 0.719, 0.703)
Bookkeeping Intelligence: 56.9% (0.600, 0.434, 0.491, 0.719, 0.600)
Error Handling: 64.1%

Claude-3-5-Sonnet

Document Understanding: 70.4% (0.773, 0.806, 0.533)
Data Extraction: 60.9% (0.706, 0.597, 0.504, 0.708, 0.533)
Bookkeeping Intelligence: 62.8% (0.600, 0.524, 0.706, 0.708, 0.600)
Error Handling: 67.5%

GPT-4o

Document Understanding: 69.6% (0.600, 0.917, 0.571)
Data Extraction: 68.9% (0.818, 0.722, 0.619, 0.714, 0.571)
Bookkeeping Intelligence: 25.5% (0.000, 0.313, 0.250, 0.714, 0.000)
Error Handling: 68.3%

Key Findings

Ark II leads in overall performance, particularly in document understanding (80.8%)
Ark I shows strong performance relative to its size, especially in document understanding (78.5%)
Claude-3-5-Sonnet maintains consistent performance across categories
GPT-4o shows competitive performance in document understanding and data extraction but struggles with bookkeeping intelligence tasks
Ark I achieves impressive efficiency with the fastest inference time (7.955s)

Interactive Dashboard Features

The dashboard provides several interactive visualizations:

Overall Leaderboard: Comprehensive view of all models' performance metrics
Category Comparison: Bar chart comparing all models across the four main categories
Combined Radar Chart: Multi-model comparison showing relative strengths and weaknesses
Detailed Metrics: Interactive comparison table showing differences between selected model and Ark II

Running the Leaderboard

Install dependencies:
```
pip install gradio pandas plotly
```
Run the app:
```
python app.py
```
Open the provided URL in your browser to view the interactive dashboard.

Visualization Features

Color-coded performance indicators
Comparative analysis with Ark II as baseline
Interactive model selection for detailed comparisons
Multi-model radar chart for performance pattern analysis
Dynamic updates of comparative metrics

Contributing

To add new model evaluations:

Add model scores following the established format in MODELS dictionary
Include all required metrics for each category
Provide model metadata (version, type, provider, size, inference time)
Follow the existing structure in app.py

License

MIT License