|
# Toxic Comment Classification using Deep Learning |
|
|
|
A multilingual toxic comment classification system using language-aware transformers and advanced deep learning techniques. |
|
|
|
## ποΈ Architecture Overview |
|
|
|
### Core Components |
|
|
|
1. **LanguageAwareTransformer** |
|
- Base: XLM-RoBERTa Large |
|
- Custom language-aware attention mechanism |
|
- Gating mechanism for feature fusion |
|
- Language-specific dropout rates |
|
- Support for 7 languages with English fallback |
|
|
|
2. **ToxicDataset** |
|
- Efficient caching system |
|
- Language ID mapping |
|
- Memory pinning for CUDA optimization |
|
- Automatic handling of missing values |
|
|
|
3. **Training System** |
|
- Mixed precision training (BF16/FP16) |
|
- Gradient accumulation |
|
- Language-aware loss weighting |
|
- Distributed training support |
|
- Automatic threshold optimization |
|
|
|
### Key Features |
|
|
|
- **Language Awareness** |
|
- Language-specific embeddings |
|
- Dynamic dropout rates per language |
|
- Language-aware attention mechanism |
|
- Automatic fallback to English for unsupported languages |
|
|
|
- **Performance Optimization** |
|
- Gradient checkpointing |
|
- Memory-efficient attention |
|
- Automatic mixed precision |
|
- Caching system for processed data |
|
- CUDA optimization with memory pinning |
|
|
|
- **Training Features** |
|
- Weighted focal loss with language awareness |
|
- Dynamic threshold optimization |
|
- Early stopping with patience |
|
- Gradient flow monitoring |
|
- Comprehensive metric tracking |
|
|
|
## π Data Processing |
|
|
|
### Input Format |
|
```python |
|
{ |
|
'comment_text': str, # The text to classify |
|
'lang': str, # Language code (en, ru, tr, es, fr, it, pt) |
|
'toxic': int, # Binary labels for each category |
|
'severe_toxic': int, |
|
'obscene': int, |
|
'threat': int, |
|
'insult': int, |
|
'identity_hate': int |
|
} |
|
``` |
|
|
|
### Language Support |
|
- Primary: en, ru, tr, es, fr, it, pt |
|
- Default fallback: en (English) |
|
- Language ID mapping: {en: 0, ru: 1, tr: 2, es: 3, fr: 4, it: 5, pt: 6} |
|
|
|
## π Model Architecture |
|
|
|
### Base Model |
|
- XLM-RoBERTa Large |
|
- Hidden size: 1024 |
|
- Attention heads: 16 |
|
- Max sequence length: 128 |
|
|
|
### Custom Components |
|
|
|
1. **Language-Aware Classifier** |
|
```python |
|
- Input: Hidden states [batch_size, hidden_size] |
|
- Language embeddings: [batch_size, 64] |
|
- Projection: hidden_size + 64 -> 512 |
|
- Output: 6 toxicity predictions |
|
``` |
|
|
|
2. **Language-Aware Attention** |
|
```python |
|
- Input: Hidden states + Language embeddings |
|
- Scaled dot product attention |
|
- Gating mechanism for feature fusion |
|
- Memory-efficient implementation |
|
``` |
|
|
|
## π οΈ Training Configuration |
|
|
|
### Hyperparameters |
|
```python |
|
{ |
|
"batch_size": 32, |
|
"grad_accum_steps": 2, |
|
"epochs": 4, |
|
"lr": 2e-5, |
|
"weight_decay": 0.01, |
|
"warmup_ratio": 0.1, |
|
"label_smoothing": 0.01, |
|
"model_dropout": 0.1, |
|
"freeze_layers": 2 |
|
} |
|
``` |
|
|
|
### Optimization |
|
- Optimizer: AdamW |
|
- Learning rate scheduler: Cosine with warmup |
|
- Mixed precision: BF16/FP16 |
|
- Gradient clipping: 1.0 |
|
- Gradient accumulation steps: 2 |
|
|
|
## π Metrics and Monitoring |
|
|
|
### Training Metrics |
|
- Loss (per language) |
|
- AUC-ROC (macro) |
|
- Precision, Recall, F1 |
|
- Language-specific metrics |
|
- Gradient norms |
|
- Memory usage |
|
|
|
### Validation Metrics |
|
- AUC-ROC (per class and language) |
|
- Optimal thresholds per language |
|
- Critical class performance (threat, identity_hate) |
|
- Distribution shift monitoring |
|
|
|
## π§ Usage |
|
|
|
### Training |
|
```bash |
|
python model/train.py |
|
``` |
|
|
|
### Inference |
|
```python |
|
from model.predict import predict_toxicity |
|
|
|
results = predict_toxicity( |
|
text="Your text here", |
|
model=model, |
|
tokenizer=tokenizer, |
|
config=config |
|
) |
|
``` |
|
|
|
## π Code Structure |
|
|
|
``` |
|
model/ |
|
βββ language_aware_transformer.py # Core model architecture |
|
βββ train.py # Training loop and utilities |
|
βββ predict.py # Inference utilities |
|
βββ evaluation/ |
|
β βββ evaluate.py # Evaluation functions |
|
β βββ threshold_optimizer.py # Dynamic threshold optimization |
|
βββ data/ |
|
β βββ sampler.py # Custom sampling strategies |
|
βββ training_config.py # Configuration management |
|
``` |
|
|
|
## π€ AI/ML Specific Notes |
|
|
|
1. **Tensor Shapes** |
|
- Input IDs: [batch_size, seq_len] |
|
- Attention Mask: [batch_size, seq_len] |
|
- Language IDs: [batch_size] |
|
- Hidden States: [batch_size, seq_len, hidden_size] |
|
- Language Embeddings: [batch_size, embed_dim] |
|
|
|
2. **Critical Components** |
|
- Language ID handling in forward pass |
|
- Attention mask shape management |
|
- Memory-efficient attention implementation |
|
- Gradient flow in language-aware components |
|
|
|
3. **Performance Considerations** |
|
- Cache management for processed data |
|
- Memory pinning for GPU transfers |
|
- Gradient accumulation for large batches |
|
- Language-specific dropout rates |
|
|
|
4. **Error Handling** |
|
- Language ID validation |
|
- Shape compatibility checks |
|
- Gradient norm monitoring |
|
- Device placement verification |
|
|
|
## π Notes for AI Systems |
|
|
|
1. When modifying the code: |
|
- Maintain language ID handling in forward pass |
|
- Preserve attention mask shape management |
|
- Keep device consistency checks |
|
- Handle BatchEncoding security in PyTorch 2.6+ |
|
|
|
2. Key attention points: |
|
- Language ID tensor shape and type |
|
- Attention mask broadcasting |
|
- Memory-efficient attention implementation |
|
- Gradient flow through language-aware components |
|
|
|
3. Common pitfalls: |
|
- Incorrect attention mask shapes |
|
- Language ID type mismatches |
|
- Memory leaks in caching |
|
- Device inconsistencies |
|
|