--- library_name: transformers tags: - machine-translation - tamil - colloquial-tamil - nlp --- # Model Card for Shrav20/colloquial-tamil-mt ## 📌 Model Summary This model is a **Machine Translation (MT) model** designed for converting **English to colloquial Tamil** and vice versa. Unlike traditional Tamil MT models, which focus on formal Tamil, this model generates translations in **natural spoken Tamil** commonly used in everyday conversations. ## 📊 Model Details - **Developed by:** Shrav20 - **Funded by:** Independent - **Shared by:** Shrav20 - **Model Type:** Sequence-to-Sequence (Seq2Seq) Translation - **Architecture:** Based on **M2M100 (Facebook’s Multilingual MT Model)**, finetuned for colloquial Tamil. - **Languages Supported:** - `English → Tamil (Colloquial)` - `Tamil (Colloquial) → English` - **License:** MIT - **Finetuned from:** `facebook/m2m100_418M` --- ## 🛠 Model Usage ### 🔹 Direct Use You can use this model for colloquial Tamil translation in conversational AI, subtitles, and chatbots. #### Example Code: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Shrav20/colloquial-tamil-mt") model = AutoModelForSeq2SeqLM.from_pretrained("Shrav20/colloquial-tamil-mt") def translate(text, src_lang="en", tgt_lang="ta"): inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) output = model.generate(**inputs) return tokenizer.decode(output[0], skip_special_tokens=True) # Example Translation print(translate("The pharmacy is near the bus stop. ")) # Output: "Bus stop pakkathula pharmacy iruku." ``` --- ## 📖 Training Details ### 📌 Training Dataset - This model is finetuned on **Shrav20/colloquial-tamil** dataset. - **Sources:** - `sangeethat/colloquial` - AI-generated data - Internet-scraped content - Manually verified colloquial sentences ### 🛠 Training Hyperparameters - `Batch Size:` 16 - `Learning Rate:` 5e-5 - `Epochs:` 3 - `Optimizer:` AdamW - `Precision:` fp16 (mixed precision) - `LoRA Adapters:` Enabled for efficient fine-tuning --- ## 📊 Evaluation ### 📌 Testing Data & Metrics - **Dataset:** 5,000 colloquial Tamil-English sentence pairs - **Evaluation Metrics:** - BLEU Score: **28.5** - METEOR Score: **34.1** - TER: **41.2** ### 📌 Example Outputs | English | Tamil (Colloquial) | |---------|--------------------| | The pharmacy is near the bus stop. | Bus stop pakkathula pharmacy iruku. | | Take this medicine after food. | Food saptadhukku apram intha medicine eduthukungo. | | Train tickets for tomorrow are available. | Naalaikku train tickets available iruku. | --- ## 🚨 Bias, Risks, and Limitations - **Dialectal Bias:** The model is trained on a specific style of spoken Tamil and may not generalize to all Tamil dialects. - **Data Noise:** Some AI-generated content may not be fully accurate. - **Context Sensitivity:** Model struggles with complex sentence structures and ambiguous meanings. --- ## 💡 How to Contribute - If you find issues or have improvements, feel free to open a **GitHub issue** or contribute data via Hugging Face! 📩 **Contact:** Shrav20 via Hugging Face discussions. --- ## 📝 Citation If you use this model, please cite: ``` @misc{shrav20colloquial, author = {Shrav20}, title = {Colloquial Tamil Machine Translation Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/Shrav20/colloquial-tamil-mt} } ``` --- ## 🌱 Future Improvements ✅ More diverse datasets ✅ Better handling of Tamil-English code-mixing ✅ Improved sentence fluency with **RLHF** (Reinforcement Learning with Human Feedback)