Khmer mT5 Summarization Model

πŸ“Œ Introduction

This repository contains a fine-tuned mT5 model for Khmer text summarization. The model is based on Google's mT5-small and fine-tuned on a dataset of Khmer text and corresponding summaries.

Fine-tuning was performed using the Hugging Face Trainer API, optimizing the model to generate concise and meaningful summaries of Khmer text.


πŸš€ Model Details

  • Base Model: google/mt5-small
  • Fine-tuned for: Khmer text summarization
  • Training Dataset: kimleang123/khmer-text-dataset
  • Framework: Hugging Face transformers
  • Task Type: Sequence-to-Sequence (Seq2Seq)
  • Input: Khmer text (articles, paragraphs, or documents)
  • Output: Summarized Khmer text
  • Training Hardware: GPU (Tesla T4)
  • Evaluation Metric: ROUGE Score

πŸ”§ Installation & Setup

1️⃣ Install Dependencies

Ensure you have transformers, torch, and datasets installed:

pip install transformers torch datasets

2️⃣ Load the Model

To load and use the fine-tuned model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

πŸ“Œ How to Use

1️⃣ Using Python Code

def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarize_khmer(khmer_text)
print("πŸ”Ή Khmer Summary:", summary)

2️⃣ Using Hugging Face Pipeline

For a simpler approach:

from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("πŸ”Ή Khmer Summary:", summary[0]['summary_text'])

3️⃣ Deploy as an API using FastAPI

You can create a simple API for summarization:

from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload

πŸ“Š Model Evaluation

The model was evaluated using ROUGE scores, which measure how similar the generated summaries are to the ground truth summaries.

from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()

πŸ’Ύ Saving & Uploading the Model

After fine-tuning, the model was uploaded to Hugging Face Hub:

model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")

To download it later:

model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")

🎯 Summary

Feature Details
Base Model google/mt5-small
Task Summarization
Language Khmer (αžαŸ’αž˜αŸ‚αžš)
Dataset kimleang123/khmer-text-dataset
Framework Hugging Face Transformers
Evaluation Metric ROUGE Score
Deployment Hugging Face Model Hub, API (FastAPI), Python Code

🀝 Contributing

Contributions are welcome! Feel free to open issues or submit pull requests if you find any improvements.

πŸ“¬ Contact

If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.

πŸ“Œ Built for Khmer NLP Community πŸ‡°πŸ‡­ πŸš€

Downloads last month
27
Safetensors
Model size
300M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for songhieng/khmer-mt5-summarization

Base model

google/mt5-small
Finetuned
(402)
this model

Dataset used to train songhieng/khmer-mt5-summarization

Space using songhieng/khmer-mt5-summarization 1