Toxic Comment Classification using Deep Learning
A multilingual toxic comment classification system using language-aware transformers and advanced deep learning techniques.
ποΈ Architecture Overview
Core Components
LanguageAwareTransformer
- Base: XLM-RoBERTa Large
- Custom language-aware attention mechanism
- Gating mechanism for feature fusion
- Language-specific dropout rates
- Support for 7 languages with English fallback
ToxicDataset
- Efficient caching system
- Language ID mapping
- Memory pinning for CUDA optimization
- Automatic handling of missing values
Training System
- Mixed precision training (BF16/FP16)
- Gradient accumulation
- Language-aware loss weighting
- Distributed training support
- Automatic threshold optimization
Key Features
Language Awareness
- Language-specific embeddings
- Dynamic dropout rates per language
- Language-aware attention mechanism
- Automatic fallback to English for unsupported languages
Performance Optimization
- Gradient checkpointing
- Memory-efficient attention
- Automatic mixed precision
- Caching system for processed data
- CUDA optimization with memory pinning
Training Features
- Weighted focal loss with language awareness
- Dynamic threshold optimization
- Early stopping with patience
- Gradient flow monitoring
- Comprehensive metric tracking
π Data Processing
Input Format
{
'comment_text': str, # The text to classify
'lang': str, # Language code (en, ru, tr, es, fr, it, pt)
'toxic': int, # Binary labels for each category
'severe_toxic': int,
'obscene': int,
'threat': int,
'insult': int,
'identity_hate': int
}
Language Support
- Primary: en, ru, tr, es, fr, it, pt
- Default fallback: en (English)
- Language ID mapping: {en: 0, ru: 1, tr: 2, es: 3, fr: 4, it: 5, pt: 6}
π Model Architecture
Base Model
- XLM-RoBERTa Large
- Hidden size: 1024
- Attention heads: 16
- Max sequence length: 128
Custom Components
- Language-Aware Classifier
- Input: Hidden states [batch_size, hidden_size]
- Language embeddings: [batch_size, 64]
- Projection: hidden_size + 64 -> 512
- Output: 6 toxicity predictions
- Language-Aware Attention
- Input: Hidden states + Language embeddings
- Scaled dot product attention
- Gating mechanism for feature fusion
- Memory-efficient implementation
π οΈ Training Configuration
Hyperparameters
{
"batch_size": 32,
"grad_accum_steps": 2,
"epochs": 4,
"lr": 2e-5,
"weight_decay": 0.01,
"warmup_ratio": 0.1,
"label_smoothing": 0.01,
"model_dropout": 0.1,
"freeze_layers": 2
}
Optimization
- Optimizer: AdamW
- Learning rate scheduler: Cosine with warmup
- Mixed precision: BF16/FP16
- Gradient clipping: 1.0
- Gradient accumulation steps: 2
π Metrics and Monitoring
Training Metrics
- Loss (per language)
- AUC-ROC (macro)
- Precision, Recall, F1
- Language-specific metrics
- Gradient norms
- Memory usage
Validation Metrics
- AUC-ROC (per class and language)
- Optimal thresholds per language
- Critical class performance (threat, identity_hate)
- Distribution shift monitoring
π§ Usage
Training
python model/train.py
Inference
from model.predict import predict_toxicity
results = predict_toxicity(
text="Your text here",
model=model,
tokenizer=tokenizer,
config=config
)
π Code Structure
model/
βββ language_aware_transformer.py # Core model architecture
βββ train.py # Training loop and utilities
βββ predict.py # Inference utilities
βββ evaluation/
β βββ evaluate.py # Evaluation functions
β βββ threshold_optimizer.py # Dynamic threshold optimization
βββ data/
β βββ sampler.py # Custom sampling strategies
βββ training_config.py # Configuration management
π€ AI/ML Specific Notes
Tensor Shapes
- Input IDs: [batch_size, seq_len]
- Attention Mask: [batch_size, seq_len]
- Language IDs: [batch_size]
- Hidden States: [batch_size, seq_len, hidden_size]
- Language Embeddings: [batch_size, embed_dim]
Critical Components
- Language ID handling in forward pass
- Attention mask shape management
- Memory-efficient attention implementation
- Gradient flow in language-aware components
Performance Considerations
- Cache management for processed data
- Memory pinning for GPU transfers
- Gradient accumulation for large batches
- Language-specific dropout rates
Error Handling
- Language ID validation
- Shape compatibility checks
- Gradient norm monitoring
- Device placement verification
π Notes for AI Systems
When modifying the code:
- Maintain language ID handling in forward pass
- Preserve attention mask shape management
- Keep device consistency checks
- Handle BatchEncoding security in PyTorch 2.6+
Key attention points:
- Language ID tensor shape and type
- Attention mask broadcasting
- Memory-efficient attention implementation
- Gradient flow through language-aware components
Common pitfalls:
- Incorrect attention mask shapes
- Language ID type mismatches
- Memory leaks in caching
- Device inconsistencies