Myanmar Written/Spoken Style Classifier

This repository contains a machine learning model that classifies Myanmar text into two categories: "written" (formal style) and "spoken" (informal style). The model is built using scikit-learn, employing TF-IDF for text vectorization and Logistic Regression for classification.

Model Description

This model is designed to distinguish between formal written Myanmar (ရေးဟန်) and informal spoken Myanmar (ပြောဟန်). It leverages the power of TF-IDF to capture the nuances of word usage frequency and importance within the text, combined with the efficiency of Logistic Regression for accurate classification.

Model: Logistic Regression
Vectorizer: TF-IDF (Term Frequency-Inverse Document Frequency)
Programming Language: Python
Library: scikit-learn
Author: kalixlouiis

Dataset

The model was trained on the kalixlouiis/myanmar-written-spoken-classification dataset available on Hugging Face Datasets:

Dataset: kalixlouiis/myanmar-written-spoken-classification
Description: The dataset consists of Myanmar sentences labeled as either "written" or "spoken". It contains 2400 sentences, equally split (1200 each) between the two categories. The dataset has been preprocessed to remove duplicate entries.
Dataset split: 80/20 (train/test) with stratify.

Model Performance

The model achieved the following performance on the test set:

Accuracy: 1.0000

Classification Report:

              precision    recall  f1-score   support

   spoken (0)       1.00      1.00      1.00       240
  written (1)       1.00      1.00      1.00       241

     accuracy                           1.00       481
    macro avg       1.00      1.00      1.00       481
 weighted avg       1.00      1.00      1.00       481

Confusion Matrix:

      Predicted    spoken     written
True  spoken    |  240     |     0      |
      written   |   0      |    241     |

Interpretation: The model demonstrates perfect accuracy on the test set, correctly classifying all sentences. Precision, recall, and F1-score are all 1.00 for both classes, indicating excellent performance in distinguishing between written and spoken styles.
Important Note: Achieving perfect accuracy may suggest potential overfitting. While the reported results are excellent, it's recommended to perform further evaluation, such as cross-validation, to ensure the model generalizes well to unseen data. A larger, more diverse test set would also be beneficial.

Hyperparameters

The following hyperparameters were used for the Logistic Regression model, determined through GridSearchCV:

C: 100
penalty: 'l2'
solver: 'liblinear'

How to Use

Prerequisites:
- Python 3.6 or higher
- scikit-learn
- joblib
- pandas (optional, for loading data from CSV)
Installation (if needed):
```
pip install scikit-learn joblib pandas
```

Loading the Model and Vectorizer:

import joblib
import os

# Replace with the actual path to your model files.  If you've uploaded
# to the Hugging Face Hub, see the next section.
MODEL_DIR = "path/to/your/model"  # Example: "myanmar-written-spoken-classifier"

model = joblib.load(os.path.join(MODEL_DIR, "model.joblib"))
vectorizer = joblib.load(os.path.join(MODEL_DIR, "vectorizer.joblib"))

Classifying New Text:

def classify_text(text, model, vectorizer):
    """Classifies a Myanmar sentence as written or spoken.

    Args:
        text: The input sentence (string).
        model: The loaded Logistic Regression model.
        vectorizer: The loaded TF-IDF vectorizer.

    Returns:
        "written" or "spoken" (string).
    """
    text_tfidf = vectorizer.transform([text])
    prediction = model.predict(text_tfidf)[0]
    return "written" if prediction == 1 else "spoken"

# Example usage:
new_text = "ယနေ့ မိုးရွာမည် ထင်သည်။"
result = classify_text(new_text, model, vectorizer)
print(f"'{new_text}' is classified as: {result}")

How to Load from Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib
import os

# Replace "kalixlouiis/myanmar-written-spoken-classifier" with your actual repo ID.
REPO_ID = "kalixlouiis/myanmar-written-spoken-classifier"
MODEL_FILENAME = "model.joblib"
VECTORIZER_FILENAME = "vectorizer.joblib"
LOCAL_DIR = "downloaded_model"  # Or any directory you prefer

# Create the local directory if it doesn't exist
os.makedirs(LOCAL_DIR, exist_ok=True)

# Download the model and vectorizer
try:
    hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILENAME, local_dir=LOCAL_DIR)
    hf_hub_download(repo_id=REPO_ID, filename=VECTORIZER_FILENAME, local_dir=LOCAL_DIR)
    print("Model and vectorizer downloaded successfully.")

except Exception as e:
    print(f"Error downloading files: {e}")
    # Handle the error appropriately, e.g., exit or use default files

# Load the model and vectorizer
model = joblib.load(os.path.join(LOCAL_DIR, MODEL_FILENAME))
vectorizer = joblib.load(os.path.join(LOCAL_DIR, VECTORIZER_FILENAME))

# Now you can use the model and vectorizer as shown in the previous section.

Ethical Considerations

Bias: The model's performance and predictions may reflect biases present in the training data. It's crucial to be aware of potential biases, especially if the dataset is limited in scope or representation.
Fairness: Evaluate the model's performance across different demographic groups (e.g., speakers from different regions) to ensure fairness.
Transparency: This README provides transparency about the model's architecture, training data, and performance.
Accountability: I, kalixlouiis, am the creator of this model and am responsible for addressing any issues or concerns related to its use.

Future Improvements

Overfitting Mitigation: Implement techniques like cross-validation and regularization to address potential overfitting.
Dataset Expansion: Increase the size and diversity of the training data to improve generalization.
Error Analysis: Analyze misclassified examples to identify areas for improvement.
Explore Alternative Models: Consider experimenting with other machine learning models, including deep learning approaches (e.g., Transformers), for potentially improved performance.
Add example usage: Include additional examples demonstrating how to use the model, including handling potential errors.

License

This project is licensed under the MIT License - see the LICENSE file for details.

kalixlouiis
/

myanmar-written-spoken-classifier