---
library_name: transformers
tags:
  - machine-translation
  - tamil
  - colloquial-tamil
  - nlp
---

# Model Card for Shrav20/colloquial-tamil-mt

## 📌 Model Summary
This model is a **Machine Translation (MT) model** designed for converting **English to colloquial Tamil** and vice versa. Unlike traditional Tamil MT models, which focus on formal Tamil, this model generates translations in **natural spoken Tamil** commonly used in everyday conversations.

## 📊 Model Details

- **Developed by:** Shrav20
- **Funded by:** Independent
- **Shared by:** Shrav20
- **Model Type:** Sequence-to-Sequence (Seq2Seq) Translation
- **Architecture:** Based on **M2M100 (Facebook’s Multilingual MT Model)**, finetuned for colloquial Tamil.
- **Languages Supported:**
  - `English → Tamil (Colloquial)`
  - `Tamil (Colloquial) → English`
- **License:** MIT
- **Finetuned from:** `facebook/m2m100_418M`

---

## 🛠 Model Usage
### 🔹 Direct Use
You can use this model for colloquial Tamil translation in conversational AI, subtitles, and chatbots.

#### Example Code:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shrav20/colloquial-tamil-mt")
model = AutoModelForSeq2SeqLM.from_pretrained("Shrav20/colloquial-tamil-mt")

def translate(text, src_lang="en", tgt_lang="ta"):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    output = model.generate(**inputs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example Translation
print(translate("The pharmacy is near the bus stop. "))  # Output: "Bus stop pakkathula pharmacy iruku."
```

---

## 📖 Training Details

### 📌 Training Dataset
- This model is finetuned on **Shrav20/colloquial-tamil** dataset.
- **Sources:**
  - `sangeethat/colloquial`
  - AI-generated data
  - Internet-scraped content
  - Manually verified colloquial sentences

### 🛠 Training Hyperparameters
- `Batch Size:` 16
- `Learning Rate:` 5e-5
- `Epochs:` 3
- `Optimizer:` AdamW
- `Precision:` fp16 (mixed precision)
- `LoRA Adapters:` Enabled for efficient fine-tuning

---

## 📊 Evaluation

### 📌 Testing Data & Metrics
- **Dataset:** 5,000 colloquial Tamil-English sentence pairs
- **Evaluation Metrics:**
  - BLEU Score: **28.5**
  - METEOR Score: **34.1**
  - TER: **41.2** 

### 📌 Example Outputs
| English | Tamil (Colloquial) |
|---------|--------------------|
| The pharmacy is near the bus stop. | Bus stop pakkathula pharmacy iruku. |
| Take this medicine after food. | Food saptadhukku apram intha medicine eduthukungo. |
| Train tickets for tomorrow are available. | Naalaikku train tickets available iruku. |

---

## 🚨 Bias, Risks, and Limitations
- **Dialectal Bias:** The model is trained on a specific style of spoken Tamil and may not generalize to all Tamil dialects.
- **Data Noise:** Some AI-generated content may not be fully accurate.
- **Context Sensitivity:** Model struggles with complex sentence structures and ambiguous meanings.

---

## 💡 How to Contribute
- If you find issues or have improvements, feel free to open a **GitHub issue** or contribute data via Hugging Face!

📩 **Contact:** Shrav20 via Hugging Face discussions.

---

## 📝 Citation
If you use this model, please cite:
```
@misc{shrav20colloquial,
  author = {Shrav20},
  title = {Colloquial Tamil Machine Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shrav20/colloquial-tamil-mt}
}
```

---

## 🌱 Future Improvements
✅ More diverse datasets
✅ Better handling of Tamil-English code-mixing
✅ Improved sentence fluency with **RLHF** (Reinforcement Learning with Human Feedback)