Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- ar
|
5 |
+
base_model:
|
6 |
+
- Qwen/Qwen2.5-1.5B-Instruct
|
7 |
+
pipeline_tag: text2text-generation
|
8 |
+
library_name: transformers
|
9 |
+
tags:
|
10 |
+
- Text-To-SQL
|
11 |
+
- Arabic
|
12 |
+
- Spider
|
13 |
+
- SQL
|
14 |
+
---
|
15 |
+
|
16 |
+
# Model Card for Arabic Text-To-SQL (OsamaMo)
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
### Model Description
|
21 |
+
|
22 |
+
This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**.
|
23 |
+
|
24 |
+
- **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
|
25 |
+
- **Funded by:** Self-funded
|
26 |
+
- **Shared by:** Osama Mohamed
|
27 |
+
- **Model type:** Text-to-SQL fine-tuned model
|
28 |
+
- **Language(s):** Arabic (ar)
|
29 |
+
- **License:** MIT
|
30 |
+
- **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct
|
31 |
+
|
32 |
+
### Model Sources
|
33 |
+
|
34 |
+
- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
|
35 |
+
- **Dataset:** Spider (translated to Arabic)
|
36 |
+
- **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)
|
37 |
+
|
38 |
+
## Uses
|
39 |
+
|
40 |
+
### Direct Use
|
41 |
+
|
42 |
+
This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications.
|
43 |
+
|
44 |
+
### Downstream Use
|
45 |
+
|
46 |
+
Can be fine-tuned further for specific databases or Arabic dialect adaptations.
|
47 |
+
|
48 |
+
### Out-of-Scope Use
|
49 |
+
|
50 |
+
- The model is **not** intended for direct execution of SQL queries.
|
51 |
+
- Not recommended for non-database-related NLP tasks.
|
52 |
+
|
53 |
+
## Bias, Risks, and Limitations
|
54 |
+
|
55 |
+
- The model might generate incorrect or non-optimized SQL queries.
|
56 |
+
- Bias may exist due to dataset translations and model pretraining data.
|
57 |
+
|
58 |
+
### Recommendations
|
59 |
+
|
60 |
+
- Validate generated SQL queries before execution.
|
61 |
+
- Ensure compatibility with specific database schemas.
|
62 |
+
|
63 |
+
## How to Get Started with the Model
|
64 |
+
|
65 |
+
```python
|
66 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
67 |
+
import torch
|
68 |
+
|
69 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
70 |
+
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
|
71 |
+
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"
|
72 |
+
|
73 |
+
model = AutoModelForCausalLM.from_pretrained(
|
74 |
+
base_model_id,
|
75 |
+
device_map="auto",
|
76 |
+
torch_dtype=torch.bfloat16
|
77 |
+
)
|
78 |
+
|
79 |
+
model.load_adapter(finetuned_model_id)
|
80 |
+
|
81 |
+
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
|
82 |
+
|
83 |
+
def generate_resp(messages):
|
84 |
+
text = tokenizer.apply_chat_template(
|
85 |
+
messages,
|
86 |
+
tokenize=False,
|
87 |
+
add_generation_prompt=True
|
88 |
+
)
|
89 |
+
|
90 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
91 |
+
|
92 |
+
generated_ids = model.generate(
|
93 |
+
model_inputs.input_ids,
|
94 |
+
max_new_tokens=1024,
|
95 |
+
do_sample=False, top_k=None, temperature=None, top_p=None,
|
96 |
+
)
|
97 |
+
|
98 |
+
generated_ids = [
|
99 |
+
output_ids[len(input_ids):]
|
100 |
+
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
101 |
+
]
|
102 |
+
|
103 |
+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
104 |
+
|
105 |
+
return response
|
106 |
+
```
|
107 |
+
|
108 |
+
## Training Details
|
109 |
+
|
110 |
+
### Training Data
|
111 |
+
|
112 |
+
- Dataset: **Spider (translated into Arabic)**
|
113 |
+
- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
|
114 |
+
- Training format:
|
115 |
+
- System instruction guiding Arabic-to-SQL conversion.
|
116 |
+
- Database schema provided for context.
|
117 |
+
- Arabic user queries mapped to correct SQL output.
|
118 |
+
- Output is strictly formatted SQL queries enclosed in markdown code blocks.
|
119 |
+
|
120 |
+
### Training Procedure
|
121 |
+
|
122 |
+
#### Training Hyperparameters
|
123 |
+
|
124 |
+
- **Batch size:** 1 (per device)
|
125 |
+
- **Gradient accumulation:** 4 steps
|
126 |
+
- **Learning rate:** 1.0e-4
|
127 |
+
- **Epochs:** 3
|
128 |
+
- **Scheduler:** Cosine
|
129 |
+
- **Warmup ratio:** 0.1
|
130 |
+
- **Precision:** bf16
|
131 |
+
|
132 |
+
#### Speeds, Sizes, Times
|
133 |
+
|
134 |
+
- **Training time:** 15 hours on **NVIDIA P100 8GB**
|
135 |
+
- **Checkpointing every:** 500 steps
|
136 |
+
|
137 |
+
## Evaluation
|
138 |
+
|
139 |
+
### Testing Data
|
140 |
+
|
141 |
+
- Validation dataset: Spider validation set (translated to Arabic)
|
142 |
+
|
143 |
+
### Metrics
|
144 |
+
|
145 |
+
- Exact Match (EM) for SQL correctness
|
146 |
+
- Execution Accuracy (EX) on databases
|
147 |
+
|
148 |
+
### Results
|
149 |
+
|
150 |
+
- Model achieved **competitive SQL generation accuracy** for Arabic queries.
|
151 |
+
- Further testing required for robustness.
|
152 |
+
|
153 |
+
## Environmental Impact
|
154 |
+
|
155 |
+
- **Hardware Type:** NVIDIA Tesla P100 8GB
|
156 |
+
- **Hours used:** 15
|
157 |
+
- **Cloud Provider:** Kaggle
|
158 |
+
- **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)
|
159 |
+
|
160 |
+
## Technical Specifications
|
161 |
+
|
162 |
+
### Model Architecture and Objective
|
163 |
+
|
164 |
+
- Transformer-based **Qwen2.5-1.5B** architecture.
|
165 |
+
- Fine-tuned for Text-to-SQL task using LoRA.
|
166 |
+
|
167 |
+
### Compute Infrastructure
|
168 |
+
|
169 |
+
- **Hardware:** Kaggle P100 GPU (8GB VRAM)
|
170 |
+
- **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub
|
171 |
+
|
172 |
+
## Citation
|
173 |
+
|
174 |
+
If you use this model, please cite:
|
175 |
+
|
176 |
+
```bibtex
|
177 |
+
@misc{OsamaMo_ArabicSQL,
|
178 |
+
author = {Osama Mohamed},
|
179 |
+
title = {Arabic Text-To-SQL Model},
|
180 |
+
year = {2024},
|
181 |
+
howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
|
182 |
+
}
|
183 |
+
```
|
184 |
+
|
185 |
+
## Model Card Contact
|
186 |
+
|
187 |
+
For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).
|
188 |
+
|