OsamaMo
/

Arabic_Text-To-SQL_using_Qwen2.5-1.5B

+---
+license: mit
+language:
+- ar
+base_model:
+- Qwen/Qwen2.5-1.5B-Instruct
+pipeline_tag: text2text-generation
+library_name: transformers
+tags:
+- Text-To-SQL
+- Arabic
+- Spider
+- SQL
+---
+# Model Card for Arabic Text-To-SQL (OsamaMo)
+## Model Details
+### Model Description
+This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**.
+- **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
+- **Funded by:** Self-funded
+- **Shared by:** Osama Mohamed
+- **Model type:** Text-to-SQL fine-tuned model
+- **Language(s):** Arabic (ar)
+- **License:** MIT
+- **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct
+### Model Sources
+- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
+- **Dataset:** Spider (translated to Arabic)
+- **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)
+## Uses
+### Direct Use
+This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications.
+### Downstream Use
+Can be fine-tuned further for specific databases or Arabic dialect adaptations.
+### Out-of-Scope Use
+- The model is **not** intended for direct execution of SQL queries.
+- Not recommended for non-database-related NLP tasks.
+## Bias, Risks, and Limitations
+- The model might generate incorrect or non-optimized SQL queries.
+- Bias may exist due to dataset translations and model pretraining data.
+### Recommendations
+- Validate generated SQL queries before execution.
+- Ensure compatibility with specific database schemas.
+## How to Get Started with the Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+device = "cuda" if torch.cuda.is_available() else "cpu"
+base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
+finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"
+model = AutoModelForCausalLM.from_pretrained(
+    base_model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+model.load_adapter(finetuned_model_id)
+tokenizer = AutoTokenizer.from_pretrained(base_model_id)
+def generate_resp(messages):
+    text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    model_inputs = tokenizer([text], return_tensors="pt").to(device)
+    generated_ids = model.generate(
+        model_inputs.input_ids,
+        max_new_tokens=1024,
+        do_sample=False, top_k=None, temperature=None, top_p=None,
+    )
+    generated_ids = [
+        output_ids[len(input_ids):]
+        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    return response
+```
+## Training Details
+### Training Data
+- Dataset: **Spider (translated into Arabic)**
+- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
+- Training format:
+  - System instruction guiding Arabic-to-SQL conversion.
+  - Database schema provided for context.
+  - Arabic user queries mapped to correct SQL output.
+  - Output is strictly formatted SQL queries enclosed in markdown code blocks.
+### Training Procedure
+#### Training Hyperparameters
+- **Batch size:** 1 (per device)
+- **Gradient accumulation:** 4 steps
+- **Learning rate:** 1.0e-4
+- **Epochs:** 3
+- **Scheduler:** Cosine
+- **Warmup ratio:** 0.1
+- **Precision:** bf16
+#### Speeds, Sizes, Times
+- **Training time:** 15 hours on **NVIDIA P100 8GB**
+- **Checkpointing every:** 500 steps
+## Evaluation
+### Testing Data
+- Validation dataset: Spider validation set (translated to Arabic)
+### Metrics
+- Exact Match (EM) for SQL correctness
+- Execution Accuracy (EX) on databases
+### Results
+- Model achieved **competitive SQL generation accuracy** for Arabic queries.
+- Further testing required for robustness.
+## Environmental Impact
+- **Hardware Type:** NVIDIA Tesla P100 8GB
+- **Hours used:** 15
+- **Cloud Provider:** Kaggle
+- **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)
+## Technical Specifications
+### Model Architecture and Objective
+- Transformer-based **Qwen2.5-1.5B** architecture.
+- Fine-tuned for Text-to-SQL task using LoRA.
+### Compute Infrastructure
+- **Hardware:** Kaggle P100 GPU (8GB VRAM)
+- **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{OsamaMo_ArabicSQL,
+  author = {Osama Mohamed},
+  title = {Arabic Text-To-SQL Model},
+  year = {2024},
+  howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
+}
+```
+## Model Card Contact
+For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).