Spaces:
Running
Running
metadata
title: DataEngEval
emoji: π₯
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
- leaderboard
- evaluation
- sql
- code-generation
- data-engineering
DataEngEval
A comprehensive evaluation platform for AI models across SQL generation and code generation. Compare model performance with standardized metrics on real-world datasets including NYC Taxi queries, Python algorithms, and Go web services.
π Features
- Multi-use-case evaluation: SQL generation, Python code, Go services
- Real-world datasets: NYC Taxi, sorting algorithms, HTTP handlers, concurrency patterns
- Comprehensive metrics: Correctness, execution success, syntax validation, performance
- Remote inference: Uses Hugging Face Inference API (no local model downloads)
- Mock mode: Works without API keys for demos
π― Current Use Cases
SQL Generation
- Dataset: NYC Taxi Small
- Dialects: Presto, BigQuery, Snowflake
- Metrics: Correctness, execution, result matching, dialect compliance
Code Generation
- Python: Algorithms, data structures, object-oriented programming
- Go: Web services, concurrency, HTTP handlers
- Metrics: Syntax correctness, compilation success, execution success, code quality
ποΈ Project Structure
dataeng-leaderboard/
βββ app.py # Main Gradio application
βββ requirements.txt # Dependencies for Hugging Face Spaces
βββ config/ # Configuration files
β βββ app.yaml # App settings
β βββ models.yaml # Model configurations
β βββ metrics.yaml # Scoring weights
β βββ use_cases.yaml # Use case definitions
βββ src/ # Source code modules
β βββ evaluator.py # Dataset management and evaluation
β βββ models_registry.py # Model configuration and interfaces
β βββ scoring.py # Metrics computation
β βββ utils/ # Utility functions
βββ tasks/ # Multi-use-case datasets
β βββ sql_generation/ # SQL generation tasks
β βββ code_generation/ # Code generation tasks
β βββ documentation/ # Documentation tasks
βββ prompts/ # SQL generation templates
βββ test/ # Test files
π Quick Start
Running on Hugging Face Spaces
- Fork this Space: Click "Fork" on the Hugging Face Space
- Configure: Add your
HF_TOKENas a secret in Space settings (optional) - Deploy: The Space will automatically build and deploy
- Use: Access the Space URL to start evaluating models
Running Locally
- Clone this repository:
git clone <repository-url>
cd dataeng-leaderboard
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (optional):
export HF_TOKEN="your_huggingface_token" # For Hugging Face models
- Run the application:
gradio app.py
π Usage
Evaluating Models
- Select Dataset: Choose from available datasets (NYC Taxi)
- Choose Dialect: Select target SQL dialect (Presto, BigQuery, Snowflake)
- Pick Test Case: Select a specific natural language question to evaluate
- Select Models: Choose one or more models to evaluate
- Run Evaluation: Click "Run Evaluation" to generate SQL and compute metrics
- View Results: See individual results and updated leaderboard
Understanding Metrics
The platform computes several metrics for each evaluation:
- Correctness (Exact): Binary score (0/1) for exact result match
- Execution Success: Binary score (0/1) for successful SQL execution
- Result Match F1: F1 score for partial result matching
- Latency: Response time in milliseconds
- Readability: Score based on SQL structure and formatting
- Dialect Compliance: Binary score (0/1) for successful SQL transpilation
Composite Score combines all metrics with weights:
- Correctness: 40%
- Execution Success: 25%
- Result Match F1: 15%
- Dialect Compliance: 10%
- Readability: 5%
- Latency: 5%
βοΈ Configuration
Adding New Models
Edit config/models.yaml to add new models:
models:
- name: "Your Model Name"
provider: "huggingface"
model_id: "your/model-id"
params:
max_new_tokens: 512
temperature: 0.1
description: "Description of your model"
Adding New Datasets
- Create a new folder under
tasks/(e.g.,tasks/my_dataset/) - Add three required files:
schema.sql: Database schema definition
loader.py: Database creation script
cases.yaml: Test cases with questions and reference SQL
π€ Contributing
Adding New Features
- Fork the repository
- Create a feature branch
- Implement your changes
- Test thoroughly
- Submit a pull request
Testing
Run the test suite:
python run_tests.py
π License
This project is licensed under the Apache-2.0 License.
π Acknowledgments
- Built with Gradio
- SQL transpilation powered by sqlglot
- Database execution using DuckDB
- Model APIs from Hugging Face
- Deployed on Hugging Face Spaces