OsamaMo's picture
Create README.md
c19331a verified
|
raw
history blame
4.98 kB
metadata
license: mit
language:
  - ar
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: text2text-generation
library_name: transformers
tags:
  - Text-To-SQL
  - Arabic
  - Spider
  - SQL

Model Card for Arabic Text-To-SQL (OsamaMo)

Model Details

Model Description

This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on Qwen/Qwen2.5-1.5B-Instruct and trained using LoRA on Kaggle for 15 hours on a P100 8GB GPU.

  • Developed by: Osama Mohamed (OsamaMo)
  • Funded by: Self-funded
  • Shared by: Osama Mohamed
  • Model type: Text-to-SQL fine-tuned model
  • Language(s): Arabic (ar)
  • License: MIT
  • Finetuned from: Qwen/Qwen2.5-1.5B-Instruct

Model Sources

Uses

Direct Use

This model is intended for converting Arabic natural language questions into SQL queries. It can be used for database querying in Arabic-speaking applications.

Downstream Use

Can be fine-tuned further for specific databases or Arabic dialect adaptations.

Out-of-Scope Use

  • The model is not intended for direct execution of SQL queries.
  • Not recommended for non-database-related NLP tasks.

Bias, Risks, and Limitations

  • The model might generate incorrect or non-optimized SQL queries.
  • Bias may exist due to dataset translations and model pretraining data.

Recommendations

  • Validate generated SQL queries before execution.
  • Ensure compatibility with specific database schemas.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

model.load_adapter(finetuned_model_id)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False, top_k=None, temperature=None, top_p=None,
    )

    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

Training Details

Training Data

  • Dataset: Spider (translated into Arabic)
  • Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
  • Training format:
    • System instruction guiding Arabic-to-SQL conversion.
    • Database schema provided for context.
    • Arabic user queries mapped to correct SQL output.
    • Output is strictly formatted SQL queries enclosed in markdown code blocks.

Training Procedure

Training Hyperparameters

  • Batch size: 1 (per device)
  • Gradient accumulation: 4 steps
  • Learning rate: 1.0e-4
  • Epochs: 3
  • Scheduler: Cosine
  • Warmup ratio: 0.1
  • Precision: bf16

Speeds, Sizes, Times

  • Training time: 15 hours on NVIDIA P100 8GB
  • Checkpointing every: 500 steps

Evaluation

Testing Data

  • Validation dataset: Spider validation set (translated to Arabic)

Metrics

  • Exact Match (EM) for SQL correctness
  • Execution Accuracy (EX) on databases

Results

  • Model achieved competitive SQL generation accuracy for Arabic queries.
  • Further testing required for robustness.

Environmental Impact

  • Hardware Type: NVIDIA Tesla P100 8GB
  • Hours used: 15
  • Cloud Provider: Kaggle
  • Carbon Emitted: Estimated using ML Impact Calculator

Technical Specifications

Model Architecture and Objective

  • Transformer-based Qwen2.5-1.5B architecture.
  • Fine-tuned for Text-to-SQL task using LoRA.

Compute Infrastructure

  • Hardware: Kaggle P100 GPU (8GB VRAM)
  • Software: Python, Transformers, LLaMA-Factory, Hugging Face Hub

Citation

If you use this model, please cite:

@misc{OsamaMo_ArabicSQL,
  author = {Osama Mohamed},
  title = {Arabic Text-To-SQL Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
}

Model Card Contact

For questions, contact Osama Mohamed via Hugging Face (OsamaMo).