Model Card for Arabic Text-To-SQL (OsamaMo)
Model Details
Model Description
This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on Qwen/Qwen2.5-1.5B-Instruct and trained using LoRA on Kaggle for 15 hours on a P100 8GB GPU.
- Developed by: Osama Mohamed (OsamaMo)
- Funded by: Self-funded
- Shared by: Osama Mohamed
- Model type: Text-to-SQL fine-tuned model
- Language(s): Arabic (ar)
- License: MIT
- Finetuned from: Qwen/Qwen2.5-1.5B-Instruct
Model Sources
- Repository: Hugging Face Model Hub
- Dataset: Spider (translated to Arabic)
- Training Script: LLaMA-Factory
Uses
Direct Use
This model is intended for converting Arabic natural language questions into SQL queries. It can be used for database querying in Arabic-speaking applications.
Downstream Use
Can be fine-tuned further for specific databases or Arabic dialect adaptations.
Out-of-Scope Use
- The model is not intended for direct execution of SQL queries.
- Not recommended for non-database-related NLP tasks.
Bias, Risks, and Limitations
- The model might generate incorrect or non-optimized SQL queries.
- Bias may exist due to dataset translations and model pretraining data.
Recommendations
- Validate generated SQL queries before execution.
- Ensure compatibility with specific database schemas.
How to Get Started with the Model
Load Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re
device = "cuda" if torch.cuda.is_available() else "cpu"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"
# Load the base model and adapter for fine-tuning
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
model.load_adapter(finetuned_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
def generate_resp(messages):
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=1024,
do_sample=False, temperature= False,
)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
Example Usage
# Production-ready system message for SQL generation
system_message = (
"You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic "
"natural language queries into SQL queries with precision and clarity.\n"
)
def get_sql_query(db_schema, arabic_query):
# Construct the instruction message including the DB schema and the Arabic query
instruction_message = "\n".join([
"## DB-Schema:",
db_schema,
"",
"## User-Prompt:",
arabic_query,
"# Output SQL:",
"```SQL"
])
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": instruction_message}
]
response = generate_resp(messages)
# Extract the SQL query from the response using a regex to capture text within the ```sql markdown block
match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE)
if match:
sql_query = match.group(1).strip()
return sql_query
else:
return response.strip()
# Example usage:
example_db_schema = r'''{
'Pharmcy':
CREATE TABLE `purchase` (
`BARCODE` varchar(20) NOT NULL,
`NAME` varchar(50) NOT NULL,
`TYPE` varchar(20) NOT NULL,
`COMPANY_NAME` varchar(20) NOT NULL,
`QUANTITY` int NOT NULL,
`PRICE` double NOT NULL,
`AMOUNT` double NOT NULL,
PRIMARY KEY (`BARCODE`),
KEY `fkr3` (`COMPANY_NAME`),
CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `sales` (
`BARCODE` varchar(20) NOT NULL,
`NAME` varchar(50) NOT NULL,
`TYPE` varchar(10) NOT NULL,
`DOSE` varchar(10) NOT NULL,
`QUANTITY` int NOT NULL,
`PRICE` double NOT NULL,
`AMOUNT` double NOT NULL,
`DATE` varchar(15) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `users` (
`ID` int NOT NULL,
`NAME` varchar(50) NOT NULL,
`DOB` varchar(20) NOT NULL,
`ADDRESS` varchar(100) NOT NULL,
`PHONE` varchar(20) NOT NULL,
`SALARY` double NOT NULL,
`PASSWORD` varchar(20) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `history_sales` (
`USER_NAME` varchar(20) NOT NULL,
`BARCODE` varchar(20) NOT NULL,
`NAME` varchar(50) NOT NULL,
`TYPE` varchar(10) NOT NULL,
`DOSE` varchar(10) NOT NULL,
`QUANTITY` int NOT NULL,
`PRICE` double NOT NULL,
`AMOUNT` double NOT NULL,
`DATE` varchar(15) NOT NULL,
`TIME` varchar(20) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `expiry` (
`PRODUCT_NAME` varchar(50) NOT NULL,
`PRODUCT_CODE` varchar(20) NOT NULL,
`DATE_OF_EXPIRY` varchar(10) NOT NULL,
`QUANTITY_REMAIN` int NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `drugs` (
`NAME` varchar(50) NOT NULL,
`TYPE` varchar(20) NOT NULL,
`BARCODE` varchar(20) NOT NULL,
`DOSE` varchar(10) NOT NULL,
`CODE` varchar(10) NOT NULL,
`COST_PRICE` double NOT NULL,
`SELLING_PRICE` double NOT NULL,
`EXPIRY` varchar(20) NOT NULL,
`COMPANY_NAME` varchar(50) NOT NULL,
`PRODUCTION_DATE` date NOT NULL,
`EXPIRATION_DATE` date NOT NULL,
`PLACE` varchar(20) NOT NULL,
`QUANTITY` int NOT NULL,
PRIMARY KEY (`BARCODE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `company` (
`NAME` varchar(50) NOT NULL,
`ADDRESS` varchar(50) NOT NULL,
`PHONE` varchar(20) NOT NULL,
PRIMARY KEY (`NAME`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Answer the following questions about this schema:
}'''
example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'"
sql_result = get_sql_query(example_db_schema, example_arabic_query)
print("استعلام SQL الناتج:")
print(sql_result)
Training Details
Training Data
- Dataset: Spider (translated into Arabic)
- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
- Training format:
- System instruction guiding Arabic-to-SQL conversion.
- Database schema provided for context.
- Arabic user queries mapped to correct SQL output.
- Output is strictly formatted SQL queries enclosed in markdown code blocks.
Training Procedure
Training Hyperparameters
- Batch size: 1 (per device)
- Gradient accumulation: 4 steps
- Learning rate: 1.0e-4
- Epochs: 3
- Scheduler: Cosine
- Warmup ratio: 0.1
- Precision: bf16
Speeds, Sizes, Times
- Training time: 15 hours on NVIDIA P100 8GB
- Checkpointing every: 500 steps
Evaluation
Testing Data
- Validation dataset: Spider validation set (translated to Arabic)
Metrics
- Exact Match (EM) for SQL correctness
- Execution Accuracy (EX) on databases
Results
- Model achieved competitive SQL generation accuracy for Arabic queries.
- Further testing required for robustness.
Environmental Impact
- Hardware Type: NVIDIA Tesla P100 8GB
- Hours used: 15
- Cloud Provider: Kaggle
- Carbon Emitted: Estimated using ML Impact Calculator
Technical Specifications
Model Architecture and Objective
- Transformer-based Qwen2.5-1.5B architecture.
- Fine-tuned for Text-to-SQL task using LoRA.
Compute Infrastructure
- Hardware: Kaggle P100 GPU (8GB VRAM)
- Software: Python, Transformers, LLaMA-Factory, Hugging Face Hub
Citation
If you use this model, please cite:
@misc{OsamaMo_ArabicSQL,
author = {Osama Mohamed},
title = {Arabic Text-To-SQL Model},
year = {2024},
howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
}
Model Card Contact
For questions, contact Osama Mohamed via Hugging Face (OsamaMo).
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.