Myanmar Written/Spoken Style Classifier
This repository contains a machine learning model that classifies Myanmar text into two categories: "written" (formal style) and "spoken" (informal style). The model is built using scikit-learn, employing TF-IDF for text vectorization and Logistic Regression for classification.
Model Description
This model is designed to distinguish between formal written Myanmar (αα±αΈαααΊ) and informal spoken Myanmar (ααΌα±α¬αααΊ). It leverages the power of TF-IDF to capture the nuances of word usage frequency and importance within the text, combined with the efficiency of Logistic Regression for accurate classification.
- Model: Logistic Regression
- Vectorizer: TF-IDF (Term Frequency-Inverse Document Frequency)
- Programming Language: Python
- Library: scikit-learn
- Author: kalixlouiis
Dataset
The model was trained on the kalixlouiis/myanmar-written-spoken-classification
dataset available on Hugging Face Datasets:
- Dataset: kalixlouiis/myanmar-written-spoken-classification
- Description: The dataset consists of Myanmar sentences labeled as either "written" or "spoken". It contains 2400 sentences, equally split (1200 each) between the two categories. The dataset has been preprocessed to remove duplicate entries.
- Dataset split: 80/20 (train/test) with stratify.
Model Performance
The model achieved the following performance on the test set:
Accuracy: 1.0000
Classification Report:
precision recall f1-score support spoken (0) 1.00 1.00 1.00 240 written (1) 1.00 1.00 1.00 241 accuracy 1.00 481 macro avg 1.00 1.00 1.00 481 weighted avg 1.00 1.00 1.00 481
Confusion Matrix:
Predicted spoken written
True spoken | 240 | 0 |
written | 0 | 241 |
- Interpretation: The model demonstrates perfect accuracy on the test set, correctly classifying all sentences. Precision, recall, and F1-score are all 1.00 for both classes, indicating excellent performance in distinguishing between written and spoken styles.
- Important Note: Achieving perfect accuracy may suggest potential overfitting. While the reported results are excellent, it's recommended to perform further evaluation, such as cross-validation, to ensure the model generalizes well to unseen data. A larger, more diverse test set would also be beneficial.
Hyperparameters
The following hyperparameters were used for the Logistic Regression model, determined through GridSearchCV:
- C: 100
- penalty: 'l2'
- solver: 'liblinear'
How to Use
Prerequisites:
- Python 3.6 or higher
- scikit-learn
- joblib
- pandas (optional, for loading data from CSV)
Installation (if needed):
pip install scikit-learn joblib pandas
Loading the Model and Vectorizer:
import joblib import os # Replace with the actual path to your model files. If you've uploaded # to the Hugging Face Hub, see the next section. MODEL_DIR = "path/to/your/model" # Example: "myanmar-written-spoken-classifier" model = joblib.load(os.path.join(MODEL_DIR, "model.joblib")) vectorizer = joblib.load(os.path.join(MODEL_DIR, "vectorizer.joblib"))
Classifying New Text:
def classify_text(text, model, vectorizer): """Classifies a Myanmar sentence as written or spoken. Args: text: The input sentence (string). model: The loaded Logistic Regression model. vectorizer: The loaded TF-IDF vectorizer. Returns: "written" or "spoken" (string). """ text_tfidf = vectorizer.transform([text]) prediction = model.predict(text_tfidf)[0] return "written" if prediction == 1 else "spoken" # Example usage: new_text = "ααα±α· ααα―αΈαα½α¬αααΊ αααΊαααΊα" result = classify_text(new_text, model, vectorizer) print(f"'{new_text}' is classified as: {result}")
How to Load from Hugging Face Hub
from huggingface_hub import hf_hub_download
import joblib
import os
# Replace "kalixlouiis/myanmar-written-spoken-classifier" with your actual repo ID.
REPO_ID = "kalixlouiis/myanmar-written-spoken-classifier"
MODEL_FILENAME = "model.joblib"
VECTORIZER_FILENAME = "vectorizer.joblib"
LOCAL_DIR = "downloaded_model" # Or any directory you prefer
# Create the local directory if it doesn't exist
os.makedirs(LOCAL_DIR, exist_ok=True)
# Download the model and vectorizer
try:
hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILENAME, local_dir=LOCAL_DIR)
hf_hub_download(repo_id=REPO_ID, filename=VECTORIZER_FILENAME, local_dir=LOCAL_DIR)
print("Model and vectorizer downloaded successfully.")
except Exception as e:
print(f"Error downloading files: {e}")
# Handle the error appropriately, e.g., exit or use default files
# Load the model and vectorizer
model = joblib.load(os.path.join(LOCAL_DIR, MODEL_FILENAME))
vectorizer = joblib.load(os.path.join(LOCAL_DIR, VECTORIZER_FILENAME))
# Now you can use the model and vectorizer as shown in the previous section.
Ethical Considerations
- Bias: The model's performance and predictions may reflect biases present in the training data. It's crucial to be aware of potential biases, especially if the dataset is limited in scope or representation.
- Fairness: Evaluate the model's performance across different demographic groups (e.g., speakers from different regions) to ensure fairness.
- Transparency: This README provides transparency about the model's architecture, training data, and performance.
- Accountability: I, kalixlouiis, am the creator of this model and am responsible for addressing any issues or concerns related to its use.
Future Improvements
- Overfitting Mitigation: Implement techniques like cross-validation and regularization to address potential overfitting.
- Dataset Expansion: Increase the size and diversity of the training data to improve generalization.
- Error Analysis: Analyze misclassified examples to identify areas for improvement.
- Explore Alternative Models: Consider experimenting with other machine learning models, including deep learning approaches (e.g., Transformers), for potentially improved performance.
- Add example usage: Include additional examples demonstrating how to use the model, including handling potential errors.
License
This project is licensed under the MIT License - see the LICENSE file for details.
- Downloads last month
- 0