tosi-n's picture
Upload folder using huggingface_hub
b262a9b verified

A newer version of the Gradio SDK is available: 5.23.0

Upgrade
metadata
title: AI_Bookkeeper_Leaderboard
emoji: πŸ“Š
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit

AI Bookkeeper Leaderboard

A comprehensive benchmark for evaluating AI models on accounting document processing tasks. This benchmark focuses on real-world accounting scenarios and provides detailed metrics across key capabilities.

View Live Demo

Models Evaluated

  • Ark II (Jenesys AI) - 17.94s inference time
  • Ark I (Jenesys AI) - 7.955s inference time
  • Claude-3-5-Sonnet (Anthropic) - 26.51s inference time
  • GPT-4o (OpenAI) - 19.88s inference time

Categories and Raw Data Points

The benchmark evaluates models across four main categories, each with specific raw data points:

  1. Document Understanding (25%)

    • Invoice ID Detection
    • Date Field Recognition
    • Line Items Total Average = (Invoice ID + Date + Line Items Total) / 3
  2. Data Extraction (25%)

    • Supplier Information
    • Line Items Quantity
    • Line Items Description
    • VAT Number
    • Line Items Total Average = (Supplier + Quantity + Description + VAT_Number + Total) / 5
  3. Bookkeeping Intelligence (25%)

    • Discount Total
    • Line Items VAT
    • VAT Exclusive Amount
    • VAT Number Validation
    • Discount Verification Average = (Discount + VAT_Items + VAT_Exclusive + VAT_Number + Discount_Verification) / 5
  4. Error Handling (25%)

    • Mean Accuracy (direct measure)

Model Performance

Ark II

  • Document Understanding: 80.8% (0.733, 0.887, 0.803)
  • Data Extraction: 74.9% (0.735, 0.882, 0.555, 0.768, 0.803)
  • Bookkeeping Intelligence: 73.0% (0.800, 0.590, 0.694, 0.768, 0.800)
  • Error Handling: 71.8%

Ark I

  • Document Understanding: 78.5% (0.747, 0.905, 0.703)
  • Data Extraction: 70.9% (0.792, 0.811, 0.521, 0.719, 0.703)
  • Bookkeeping Intelligence: 56.9% (0.600, 0.434, 0.491, 0.719, 0.600)
  • Error Handling: 64.1%

Claude-3-5-Sonnet

  • Document Understanding: 70.4% (0.773, 0.806, 0.533)
  • Data Extraction: 60.9% (0.706, 0.597, 0.504, 0.708, 0.533)
  • Bookkeeping Intelligence: 62.8% (0.600, 0.524, 0.706, 0.708, 0.600)
  • Error Handling: 67.5%

GPT-4o

  • Document Understanding: 69.6% (0.600, 0.917, 0.571)
  • Data Extraction: 68.9% (0.818, 0.722, 0.619, 0.714, 0.571)
  • Bookkeeping Intelligence: 25.5% (0.000, 0.313, 0.250, 0.714, 0.000)
  • Error Handling: 68.3%

Key Findings

  • Ark II leads in overall performance, particularly in document understanding (80.8%)
  • Ark I shows strong performance relative to its size, especially in document understanding (78.5%)
  • Claude-3-5-Sonnet maintains consistent performance across categories
  • GPT-4o shows competitive performance in document understanding and data extraction but struggles with bookkeeping intelligence tasks
  • Ark I achieves impressive efficiency with the fastest inference time (7.955s)

Interactive Dashboard Features

The dashboard provides several interactive visualizations:

  1. Overall Leaderboard: Comprehensive view of all models' performance metrics
  2. Category Comparison: Bar chart comparing all models across the four main categories
  3. Combined Radar Chart: Multi-model comparison showing relative strengths and weaknesses
  4. Detailed Metrics: Interactive comparison table showing differences between selected model and Ark II

Running the Leaderboard

  1. Install dependencies:

    pip install gradio pandas plotly
    
  2. Run the app:

    python app.py
    
  3. Open the provided URL in your browser to view the interactive dashboard.

Visualization Features

  • Color-coded performance indicators
  • Comparative analysis with Ark II as baseline
  • Interactive model selection for detailed comparisons
  • Multi-model radar chart for performance pattern analysis
  • Dynamic updates of comparative metrics

Contributing

To add new model evaluations:

  1. Add model scores following the established format in MODELS dictionary
  2. Include all required metrics for each category
  3. Provide model metadata (version, type, provider, size, inference time)
  4. Follow the existing structure in app.py

License

MIT License