metadata

license: mit
language:
  - ar
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: text2text-generation
library_name: transformers
tags:
  - Text-To-SQL
  - Arabic
  - Spider
  - SQL

Model Card for Arabic Text-To-SQL (OsamaMo)

Model Details

Model Description

This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on Qwen/Qwen2.5-1.5B-Instruct and trained using LoRA on Kaggle for 15 hours on a P100 8GB GPU.

Developed by: Osama Mohamed (OsamaMo)
Funded by: Self-funded
Shared by: Osama Mohamed
Model type: Text-to-SQL fine-tuned model
Language(s): Arabic (ar)
License: MIT
Finetuned from: Qwen/Qwen2.5-1.5B-Instruct

Model Sources

Repository: Hugging Face Model Hub
Dataset: Spider (translated to Arabic)
Training Script: LLaMA-Factory

Uses

Direct Use

This model is intended for converting Arabic natural language questions into SQL queries. It can be used for database querying in Arabic-speaking applications.

Downstream Use

Can be fine-tuned further for specific databases or Arabic dialect adaptations.

Out-of-Scope Use

The model is not intended for direct execution of SQL queries.
Not recommended for non-database-related NLP tasks.

Bias, Risks, and Limitations

The model might generate incorrect or non-optimized SQL queries.
Bias may exist due to dataset translations and model pretraining data.

Recommendations

Validate generated SQL queries before execution.
Ensure compatibility with specific database schemas.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

model.load_adapter(finetuned_model_id)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False, top_k=None, temperature=None, top_p=None,
    )

    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

Training Details

Training Data

Dataset: Spider (translated into Arabic)
Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
Training format:
- System instruction guiding Arabic-to-SQL conversion.
- Database schema provided for context.
- Arabic user queries mapped to correct SQL output.
- Output is strictly formatted SQL queries enclosed in markdown code blocks.

Training Procedure

Training Hyperparameters

Batch size: 1 (per device)
Gradient accumulation: 4 steps
Learning rate: 1.0e-4
Epochs: 3
Scheduler: Cosine
Warmup ratio: 0.1
Precision: bf16

Speeds, Sizes, Times

Training time: 15 hours on NVIDIA P100 8GB
Checkpointing every: 500 steps

Evaluation

Testing Data

Validation dataset: Spider validation set (translated to Arabic)

Metrics

Exact Match (EM) for SQL correctness
Execution Accuracy (EX) on databases

Results

Model achieved competitive SQL generation accuracy for Arabic queries.
Further testing required for robustness.

Environmental Impact

Hardware Type: NVIDIA Tesla P100 8GB
Hours used: 15
Cloud Provider: Kaggle
Carbon Emitted: Estimated using ML Impact Calculator

Technical Specifications

Model Architecture and Objective

Transformer-based Qwen2.5-1.5B architecture.
Fine-tuned for Text-to-SQL task using LoRA.

Compute Infrastructure

Hardware: Kaggle P100 GPU (8GB VRAM)
Software: Python, Transformers, LLaMA-Factory, Hugging Face Hub

Citation

If you use this model, please cite:

@misc{OsamaMo_ArabicSQL,
  author = {Osama Mohamed},
  title = {Arabic Text-To-SQL Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
}

Model Card Contact

For questions, contact Osama Mohamed via Hugging Face (OsamaMo).