A newer version of the Gradio SDK is available:
5.23.0
metadata
title: AI_Bookkeeper_Leaderboard
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
AI Bookkeeper Leaderboard
A comprehensive benchmark for evaluating AI models on accounting document processing tasks. This benchmark focuses on real-world accounting scenarios and provides detailed metrics across key capabilities.
Models Evaluated
- Ark II (Jenesys AI) - 17.94s inference time
- Ark I (Jenesys AI) - 7.955s inference time
- Claude-3-5-Sonnet (Anthropic) - 26.51s inference time
- GPT-4o (OpenAI) - 19.88s inference time
Categories and Raw Data Points
The benchmark evaluates models across four main categories, each with specific raw data points:
Document Understanding (25%)
- Invoice ID Detection
- Date Field Recognition
- Line Items Total Average = (Invoice ID + Date + Line Items Total) / 3
Data Extraction (25%)
- Supplier Information
- Line Items Quantity
- Line Items Description
- VAT Number
- Line Items Total Average = (Supplier + Quantity + Description + VAT_Number + Total) / 5
Bookkeeping Intelligence (25%)
- Discount Total
- Line Items VAT
- VAT Exclusive Amount
- VAT Number Validation
- Discount Verification Average = (Discount + VAT_Items + VAT_Exclusive + VAT_Number + Discount_Verification) / 5
Error Handling (25%)
- Mean Accuracy (direct measure)
Model Performance
Ark II
- Document Understanding: 80.8% (0.733, 0.887, 0.803)
- Data Extraction: 74.9% (0.735, 0.882, 0.555, 0.768, 0.803)
- Bookkeeping Intelligence: 73.0% (0.800, 0.590, 0.694, 0.768, 0.800)
- Error Handling: 71.8%
Ark I
- Document Understanding: 78.5% (0.747, 0.905, 0.703)
- Data Extraction: 70.9% (0.792, 0.811, 0.521, 0.719, 0.703)
- Bookkeeping Intelligence: 56.9% (0.600, 0.434, 0.491, 0.719, 0.600)
- Error Handling: 64.1%
Claude-3-5-Sonnet
- Document Understanding: 70.4% (0.773, 0.806, 0.533)
- Data Extraction: 60.9% (0.706, 0.597, 0.504, 0.708, 0.533)
- Bookkeeping Intelligence: 62.8% (0.600, 0.524, 0.706, 0.708, 0.600)
- Error Handling: 67.5%
GPT-4o
- Document Understanding: 69.6% (0.600, 0.917, 0.571)
- Data Extraction: 68.9% (0.818, 0.722, 0.619, 0.714, 0.571)
- Bookkeeping Intelligence: 25.5% (0.000, 0.313, 0.250, 0.714, 0.000)
- Error Handling: 68.3%
Key Findings
- Ark II leads in overall performance, particularly in document understanding (80.8%)
- Ark I shows strong performance relative to its size, especially in document understanding (78.5%)
- Claude-3-5-Sonnet maintains consistent performance across categories
- GPT-4o shows competitive performance in document understanding and data extraction but struggles with bookkeeping intelligence tasks
- Ark I achieves impressive efficiency with the fastest inference time (7.955s)
Interactive Dashboard Features
The dashboard provides several interactive visualizations:
- Overall Leaderboard: Comprehensive view of all models' performance metrics
- Category Comparison: Bar chart comparing all models across the four main categories
- Combined Radar Chart: Multi-model comparison showing relative strengths and weaknesses
- Detailed Metrics: Interactive comparison table showing differences between selected model and Ark II
Running the Leaderboard
Install dependencies:
pip install gradio pandas plotly
Run the app:
python app.py
Open the provided URL in your browser to view the interactive dashboard.
Visualization Features
- Color-coded performance indicators
- Comparative analysis with Ark II as baseline
- Interactive model selection for detailed comparisons
- Multi-model radar chart for performance pattern analysis
- Dynamic updates of comparative metrics
Contributing
To add new model evaluations:
- Add model scores following the established format in MODELS dictionary
- Include all required metrics for each category
- Provide model metadata (version, type, provider, size, inference time)
- Follow the existing structure in
app.py
License
MIT License