Myanmar Written/Spoken Style Classifier

License: MIT

This repository contains a machine learning model that classifies Myanmar text into two categories: "written" (formal style) and "spoken" (informal style). The model is built using scikit-learn, employing TF-IDF for text vectorization and Logistic Regression for classification.

Model Description

This model is designed to distinguish between formal written Myanmar (α€›α€±α€Έα€Ÿα€”α€Ί) and informal spoken Myanmar (α€•α€Όα€±α€¬α€Ÿα€”α€Ί). It leverages the power of TF-IDF to capture the nuances of word usage frequency and importance within the text, combined with the efficiency of Logistic Regression for accurate classification.

  • Model: Logistic Regression
  • Vectorizer: TF-IDF (Term Frequency-Inverse Document Frequency)
  • Programming Language: Python
  • Library: scikit-learn
  • Author: kalixlouiis

Dataset

The model was trained on the kalixlouiis/myanmar-written-spoken-classification dataset available on Hugging Face Datasets:

  • Dataset: kalixlouiis/myanmar-written-spoken-classification
  • Description: The dataset consists of Myanmar sentences labeled as either "written" or "spoken". It contains 2400 sentences, equally split (1200 each) between the two categories. The dataset has been preprocessed to remove duplicate entries.
  • Dataset split: 80/20 (train/test) with stratify.

Model Performance

The model achieved the following performance on the test set:

  • Accuracy: 1.0000

  • Classification Report:

                  precision    recall  f1-score   support
    
       spoken (0)       1.00      1.00      1.00       240
      written (1)       1.00      1.00      1.00       241
    
         accuracy                           1.00       481
        macro avg       1.00      1.00      1.00       481
     weighted avg       1.00      1.00      1.00       481
    
  • Confusion Matrix:

      Predicted    spoken     written
True  spoken    |  240     |     0      |
      written   |   0      |    241     |
  • Interpretation: The model demonstrates perfect accuracy on the test set, correctly classifying all sentences. Precision, recall, and F1-score are all 1.00 for both classes, indicating excellent performance in distinguishing between written and spoken styles.
  • Important Note: Achieving perfect accuracy may suggest potential overfitting. While the reported results are excellent, it's recommended to perform further evaluation, such as cross-validation, to ensure the model generalizes well to unseen data. A larger, more diverse test set would also be beneficial.

Hyperparameters

The following hyperparameters were used for the Logistic Regression model, determined through GridSearchCV:

  • C: 100
  • penalty: 'l2'
  • solver: 'liblinear'

How to Use

  1. Prerequisites:

    • Python 3.6 or higher
    • scikit-learn
    • joblib
    • pandas (optional, for loading data from CSV)
  2. Installation (if needed):

    pip install scikit-learn joblib pandas
    
  3. Loading the Model and Vectorizer:

    import joblib
    import os
    
    # Replace with the actual path to your model files.  If you've uploaded
    # to the Hugging Face Hub, see the next section.
    MODEL_DIR = "path/to/your/model"  # Example: "myanmar-written-spoken-classifier"
    
    model = joblib.load(os.path.join(MODEL_DIR, "model.joblib"))
    vectorizer = joblib.load(os.path.join(MODEL_DIR, "vectorizer.joblib"))
    
  4. Classifying New Text:

    def classify_text(text, model, vectorizer):
        """Classifies a Myanmar sentence as written or spoken.
    
        Args:
            text: The input sentence (string).
            model: The loaded Logistic Regression model.
            vectorizer: The loaded TF-IDF vectorizer.
    
        Returns:
            "written" or "spoken" (string).
        """
        text_tfidf = vectorizer.transform([text])
        prediction = model.predict(text_tfidf)[0]
        return "written" if prediction == 1 else "spoken"
    
    # Example usage:
    new_text = "α€šα€”α€±α€· α€™α€­α€―α€Έα€›α€½α€¬α€™α€Šα€Ί α€‘α€„α€Ία€žα€Šα€Ία‹"
    result = classify_text(new_text, model, vectorizer)
    print(f"'{new_text}' is classified as: {result}")
    

How to Load from Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib
import os

# Replace "kalixlouiis/myanmar-written-spoken-classifier" with your actual repo ID.
REPO_ID = "kalixlouiis/myanmar-written-spoken-classifier"
MODEL_FILENAME = "model.joblib"
VECTORIZER_FILENAME = "vectorizer.joblib"
LOCAL_DIR = "downloaded_model"  # Or any directory you prefer

# Create the local directory if it doesn't exist
os.makedirs(LOCAL_DIR, exist_ok=True)

# Download the model and vectorizer
try:
    hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILENAME, local_dir=LOCAL_DIR)
    hf_hub_download(repo_id=REPO_ID, filename=VECTORIZER_FILENAME, local_dir=LOCAL_DIR)
    print("Model and vectorizer downloaded successfully.")

except Exception as e:
    print(f"Error downloading files: {e}")
    # Handle the error appropriately, e.g., exit or use default files

# Load the model and vectorizer
model = joblib.load(os.path.join(LOCAL_DIR, MODEL_FILENAME))
vectorizer = joblib.load(os.path.join(LOCAL_DIR, VECTORIZER_FILENAME))

# Now you can use the model and vectorizer as shown in the previous section.

Ethical Considerations

  • Bias: The model's performance and predictions may reflect biases present in the training data. It's crucial to be aware of potential biases, especially if the dataset is limited in scope or representation.
  • Fairness: Evaluate the model's performance across different demographic groups (e.g., speakers from different regions) to ensure fairness.
  • Transparency: This README provides transparency about the model's architecture, training data, and performance.
  • Accountability: I, kalixlouiis, am the creator of this model and am responsible for addressing any issues or concerns related to its use.

Future Improvements

  • Overfitting Mitigation: Implement techniques like cross-validation and regularization to address potential overfitting.
  • Dataset Expansion: Increase the size and diversity of the training data to improve generalization.
  • Error Analysis: Analyze misclassified examples to identify areas for improvement.
  • Explore Alternative Models: Consider experimenting with other machine learning models, including deep learning approaches (e.g., Transformers), for potentially improved performance.
  • Add example usage: Include additional examples demonstrating how to use the model, including handling potential errors.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Downloads last month
0
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train kalixlouiis/myanmar-written-spoken-classifier