DataEngEval / README.md
uparekh01151's picture
Add Hugging Face Spaces YAML frontmatter to README
2018997
|
raw
history blame
5.58 kB
metadata
title: DataEngEval
emoji: πŸ₯‡
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
  - leaderboard
  - evaluation
  - sql
  - code-generation
  - data-engineering

DataEngEval

A comprehensive evaluation platform for AI models across SQL generation and code generation. Compare model performance with standardized metrics on real-world datasets including NYC Taxi queries, Python algorithms, and Go web services.

πŸš€ Features

  • Multi-use-case evaluation: SQL generation, Python code, Go services
  • Real-world datasets: NYC Taxi, sorting algorithms, HTTP handlers, concurrency patterns
  • Comprehensive metrics: Correctness, execution success, syntax validation, performance
  • Remote inference: Uses Hugging Face Inference API (no local model downloads)
  • Mock mode: Works without API keys for demos

🎯 Current Use Cases

SQL Generation

  • Dataset: NYC Taxi Small
  • Dialects: Presto, BigQuery, Snowflake
  • Metrics: Correctness, execution, result matching, dialect compliance

Code Generation

  • Python: Algorithms, data structures, object-oriented programming
  • Go: Web services, concurrency, HTTP handlers
  • Metrics: Syntax correctness, compilation success, execution success, code quality

πŸ—οΈ Project Structure

dataeng-leaderboard/
β”œβ”€β”€ app.py                     # Main Gradio application
β”œβ”€β”€ requirements.txt           # Dependencies for Hugging Face Spaces
β”œβ”€β”€ config/                    # Configuration files
β”‚   β”œβ”€β”€ app.yaml              # App settings
β”‚   β”œβ”€β”€ models.yaml           # Model configurations
β”‚   β”œβ”€β”€ metrics.yaml          # Scoring weights
β”‚   └── use_cases.yaml        # Use case definitions
β”œβ”€β”€ src/                      # Source code modules
β”‚   β”œβ”€β”€ evaluator.py          # Dataset management and evaluation
β”‚   β”œβ”€β”€ models_registry.py    # Model configuration and interfaces
β”‚   β”œβ”€β”€ scoring.py            # Metrics computation
β”‚   └── utils/                # Utility functions
β”œβ”€β”€ tasks/                    # Multi-use-case datasets
β”‚   β”œβ”€β”€ sql_generation/      # SQL generation tasks
β”‚   β”œβ”€β”€ code_generation/      # Code generation tasks
β”‚   └── documentation/       # Documentation tasks
β”œβ”€β”€ prompts/                  # SQL generation templates
└── test/                     # Test files

πŸš€ Quick Start

Running on Hugging Face Spaces

  1. Fork this Space: Click "Fork" on the Hugging Face Space
  2. Configure: Add your HF_TOKEN as a secret in Space settings (optional)
  3. Deploy: The Space will automatically build and deploy
  4. Use: Access the Space URL to start evaluating models

Running Locally

  1. Clone this repository:
git clone <repository-url>
cd dataeng-leaderboard
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables (optional):
export HF_TOKEN="your_huggingface_token"  # For Hugging Face models
  1. Run the application:
gradio app.py

πŸ“Š Usage

Evaluating Models

  1. Select Dataset: Choose from available datasets (NYC Taxi)
  2. Choose Dialect: Select target SQL dialect (Presto, BigQuery, Snowflake)
  3. Pick Test Case: Select a specific natural language question to evaluate
  4. Select Models: Choose one or more models to evaluate
  5. Run Evaluation: Click "Run Evaluation" to generate SQL and compute metrics
  6. View Results: See individual results and updated leaderboard

Understanding Metrics

The platform computes several metrics for each evaluation:

  • Correctness (Exact): Binary score (0/1) for exact result match
  • Execution Success: Binary score (0/1) for successful SQL execution
  • Result Match F1: F1 score for partial result matching
  • Latency: Response time in milliseconds
  • Readability: Score based on SQL structure and formatting
  • Dialect Compliance: Binary score (0/1) for successful SQL transpilation

Composite Score combines all metrics with weights:

  • Correctness: 40%
  • Execution Success: 25%
  • Result Match F1: 15%
  • Dialect Compliance: 10%
  • Readability: 5%
  • Latency: 5%

βš™οΈ Configuration

Adding New Models

Edit config/models.yaml to add new models:

models:
  - name: "Your Model Name"
    provider: "huggingface"
    model_id: "your/model-id"
    params:
      max_new_tokens: 512
      temperature: 0.1
    description: "Description of your model"

Adding New Datasets

  1. Create a new folder under tasks/ (e.g., tasks/my_dataset/)
  2. Add three required files:

schema.sql: Database schema definition loader.py: Database creation script cases.yaml: Test cases with questions and reference SQL

🀝 Contributing

Adding New Features

  1. Fork the repository
  2. Create a feature branch
  3. Implement your changes
  4. Test thoroughly
  5. Submit a pull request

Testing

Run the test suite:

python run_tests.py

πŸ“„ License

This project is licensed under the Apache-2.0 License.

πŸ™ Acknowledgments