π Gaussian Mixture Model (GMM) for Imbalanced Classification
This project implements a Gaussian Mixture Model (GMM)-based classifier designed to handle extremely imbalanced classification problems. It simulates real-world imbalance scenarios and benchmarks against 3 public datasets.
π§ Problem Statement
Many real-world classification tasks (e.g., fraud detection, rare disease diagnosis) suffer from minority class scarcity. Classical ML methods often fail due to biased decision boundaries.
This project demonstrates how GMM-based generative classifiers, when combined with intelligent imbalance handling (e.g., undersampling), can improve minority class detection β especially in low-data regimes.
π§ͺ Datasets Used
- Breast Cancer Wisconsin Dataset (
sklearn.datasets.load_breast_cancer
) - Credit Card Fraud Detection (OpenML 42175)
- Adult Income Dataset (OpenML 1590)
π Key Features
- π GMM classifier per class
- βοΈ Controlled imbalance sampling
- π Evaluation: F1-macro, balanced accuracy
- π§ͺ Multi-dataset benchmark
- π Hugging Face integration for model sharing
π Usage
π§ Install dependencies
pip install -r requirements.txt
βΆοΈ Run benchmark
python benchmark.py
π Output
- Classification report
- Confusion matrix
- Balanced accuracy
- F1-score (macro)
π Project Structure
gmm-minority-classification/
βββ gmm_classifier.py # GMM model logic
βββ data_loader.py # Dataset loaders (3 total)
βββ imbalance_sampler.py # Undersampling function
βββ benchmark.py # Multi-dataset test harness
βββ evaluate.py # Metric evaluation functions
βββ push_to_huggingface.py # Upload model to HF hub
βββ requirements.txt
βββ README.md
βββ .gitignore
π Research & Citations
We built this based on the following key research works:
GMM & Probabilistic Models
- Dempster et al. (1977) β Maximum likelihood via EM
- Bishop, C. (2006) β Pattern Recognition and Machine Learning
- McLachlan & Peel (2000) β Finite Mixture Models
- Reynolds et al. (2009) β Gaussian Mixture Modeling for Classification
- Bouveyron et al. (2007) β High-dimensional GMM classification
Imbalanced Classification
- Chawla et al. (2002) β SMOTE
- He & Garcia (2009) β Learning from Imbalanced Data
- Japkowicz (2000) β The Class Imbalance Problem: A Historical Perspective
- Buda et al. (2018) β A systematic study of class imbalance
- Liu et al. (2009) β EasyEnsemble and BalanceCascade
Evaluation Metrics
- Sokolova & Lapalme (2009) β A systematic analysis of performance measures
- Van Rijsbergen (1979) β Information Retrieval (F-measure origin)
Dataset Papers
- Dua & Graff (2019) β UCI Machine Learning Repository
- Lichman (2013) β Adult Dataset
- Dal Pozzolo et al. (2015) β Credit Card Fraud Dataset
Recent Works & Variants
- Loquercio et al. (2020) β Generative Models for Anomaly Detection
- Roy et al. (2022) β GMM on Tabular Data
- Fuchs et al. (2023) β Robust GMM Variants
- Ren et al. (2023) β Mixture of Experts for Class Imbalance
- Guo et al. (2021) β Bayesian GMMs in Skewed Data
- Cao et al. (2021) β Confidence-aware GMMs
- Wang et al. (2023) β Deep Mixture Models for Rare Class Learning
- Han et al. (2022) β Label Noise and GMM
- Kim et al. (2022) β Hybrid GMM for Multi-Class Tabular Data
- Cortes et al. (2025) β Margin-aware Mixture Models
π€ Push to Hugging Face
To publish the trained GMM:
huggingface-cli login
python push_to_huggingface.py
You can also use:
huggingface-cli repo create gmm-imbalance-model --type=model
π Authors
Saurav Singla
π¬ github.com/sauravsingla
π¦ Pretrained GMM Model
We provide a pretrained Gaussian Mixture Model as gmm_pretrained_model.pkl
inside this repository.
π§ Load Model in Python
from joblib import load
# Load from local file
model_bundle = load("gmm_pretrained_model.pkl")
scaler = model_bundle["scaler"]
model_0 = model_bundle["model_0"]
model_1 = model_bundle["model_1"]
# Predict
X_scaled = scaler.transform(X_new)
score_0 = model_0.score_samples(X_scaled)
score_1 = model_1.score_samples(X_scaled)
y_pred = (score_1 > score_0).astype(int)
π Load from Hugging Face
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="YOUR_USERNAME/gmm-imbalance-model", filename="gmm_pretrained_model.pkl")
model_bundle = load(model_path)