Model Card for Arabic Text-To-SQL (OsamaMo)

Model Details

Model Description

This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on Qwen/Qwen2.5-1.5B-Instruct and trained using LoRA on Kaggle for 15 hours on a P100 8GB GPU.

Developed by: Osama Mohamed (OsamaMo)
Funded by: Self-funded
Shared by: Osama Mohamed
Model type: Text-to-SQL fine-tuned model
Language(s): Arabic (ar)
License: MIT
Finetuned from: Qwen/Qwen2.5-1.5B-Instruct

Model Sources

Repository: Hugging Face Model Hub
Dataset: Spider (translated to Arabic)
Training Script: LLaMA-Factory

Uses

Direct Use

This model is intended for converting Arabic natural language questions into SQL queries. It can be used for database querying in Arabic-speaking applications.

Downstream Use

Can be fine-tuned further for specific databases or Arabic dialect adaptations.

Out-of-Scope Use

The model is not intended for direct execution of SQL queries.
Not recommended for non-database-related NLP tasks.

Bias, Risks, and Limitations

The model might generate incorrect or non-optimized SQL queries.
Bias may exist due to dataset translations and model pretraining data.

Recommendations

Validate generated SQL queries before execution.
Ensure compatibility with specific database schemas.

How to Get Started with the Model

Load Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re

device = "cuda" if torch.cuda.is_available() else "cpu"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"

# Load the base model and adapter for fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
model.load_adapter(finetuned_model_id)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False,  temperature= False,
    )
    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

Example Usage


# Production-ready system message for SQL generation
system_message = (
    "You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic "
    "natural language queries into SQL queries with precision and clarity.\n"
)

def get_sql_query(db_schema, arabic_query):
    # Construct the instruction message including the DB schema and the Arabic query
    instruction_message = "\n".join([
        "## DB-Schema:",
        db_schema,
        "",
        "## User-Prompt:",
        arabic_query,
        "# Output SQL:",
        "```SQL"
    ])
    
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": instruction_message}
    ]
    
    response = generate_resp(messages)
    
    # Extract the SQL query from the response using a regex to capture text within the ```sql markdown block
    match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE)
    if match:
        sql_query = match.group(1).strip()
        return sql_query
    else:
        return response.strip()

# Example usage:
example_db_schema = r'''{
    'Pharmcy': 
          CREATE TABLE `purchase` (
            `BARCODE` varchar(20) NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(20) NOT NULL,
            `COMPANY_NAME` varchar(20) NOT NULL,
            `QUANTITY` int NOT NULL,
            `PRICE` double NOT NULL,
            `AMOUNT` double NOT NULL,
            PRIMARY KEY (`BARCODE`),
            KEY `fkr3` (`COMPANY_NAME`),
            CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `sales` (
            `BARCODE` varchar(20) NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(10) NOT NULL,
            `DOSE` varchar(10) NOT NULL,
            `QUANTITY` int NOT NULL,
            `PRICE` double NOT NULL,
            `AMOUNT` double NOT NULL,
            `DATE` varchar(15) NOT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `users` (
            `ID` int NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `DOB` varchar(20) NOT NULL,
            `ADDRESS` varchar(100) NOT NULL,
            `PHONE` varchar(20) NOT NULL,
            `SALARY` double NOT NULL,
            `PASSWORD` varchar(20) NOT NULL,
            PRIMARY KEY (`ID`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `history_sales` (
            `USER_NAME` varchar(20) NOT NULL,
            `BARCODE` varchar(20) NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(10) NOT NULL,
            `DOSE` varchar(10) NOT NULL,
            `QUANTITY` int NOT NULL,
            `PRICE` double NOT NULL,
            `AMOUNT` double NOT NULL,
            `DATE` varchar(15) NOT NULL,
            `TIME` varchar(20) NOT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `expiry` (
            `PRODUCT_NAME` varchar(50) NOT NULL,
            `PRODUCT_CODE` varchar(20) NOT NULL,
            `DATE_OF_EXPIRY` varchar(10) NOT NULL,
            `QUANTITY_REMAIN` int NOT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `drugs` (
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(20) NOT NULL,
            `BARCODE` varchar(20) NOT NULL,
            `DOSE` varchar(10) NOT NULL,
            `CODE` varchar(10) NOT NULL,
            `COST_PRICE` double NOT NULL,
            `SELLING_PRICE` double NOT NULL,
            `EXPIRY` varchar(20) NOT NULL,
            `COMPANY_NAME` varchar(50) NOT NULL,
            `PRODUCTION_DATE` date NOT NULL,
            `EXPIRATION_DATE` date NOT NULL,
            `PLACE` varchar(20) NOT NULL,
            `QUANTITY` int NOT NULL,
            PRIMARY KEY (`BARCODE`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `company` (
            `NAME` varchar(50) NOT NULL,
            `ADDRESS` varchar(50) NOT NULL,
            `PHONE` varchar(20) NOT NULL,
            PRIMARY KEY (`NAME`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          Answer the following questions about this schema:
}'''

example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'"

sql_result = get_sql_query(example_db_schema, example_arabic_query)
print("استعلام SQL الناتج:")
print(sql_result)

Training Details

Training Data

Dataset: Spider (translated into Arabic)
Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
Training format:
- System instruction guiding Arabic-to-SQL conversion.
- Database schema provided for context.
- Arabic user queries mapped to correct SQL output.
- Output is strictly formatted SQL queries enclosed in markdown code blocks.

Training Procedure

Training Hyperparameters

Batch size: 1 (per device)
Gradient accumulation: 4 steps
Learning rate: 1.0e-4
Epochs: 3
Scheduler: Cosine
Warmup ratio: 0.1
Precision: bf16

Speeds, Sizes, Times

Training time: 15 hours on NVIDIA P100 8GB
Checkpointing every: 500 steps

Evaluation

Testing Data

Validation dataset: Spider validation set (translated to Arabic)

Metrics

Exact Match (EM) for SQL correctness
Execution Accuracy (EX) on databases

Results

Model achieved competitive SQL generation accuracy for Arabic queries.
Further testing required for robustness.

Environmental Impact

Hardware Type: NVIDIA Tesla P100 8GB
Hours used: 15
Cloud Provider: Kaggle
Carbon Emitted: Estimated using ML Impact Calculator

Technical Specifications

Model Architecture and Objective

Transformer-based Qwen2.5-1.5B architecture.
Fine-tuned for Text-to-SQL task using LoRA.

Compute Infrastructure

Hardware: Kaggle P100 GPU (8GB VRAM)
Software: Python, Transformers, LLaMA-Factory, Hugging Face Hub

Citation

If you use this model, please cite:

@misc{OsamaMo_ArabicSQL,
  author = {Osama Mohamed},
  title = {Arabic Text-To-SQL Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
}

Model Card Contact

For questions, contact Osama Mohamed via Hugging Face (OsamaMo).

OsamaMo
/

Arabic_Text-To-SQL_using_Qwen2.5-1.5B