# Toxic Comment Classification using Deep Learning

A multilingual toxic comment classification system using language-aware transformers and advanced deep learning techniques.

## 🏗️ Architecture Overview

### Core Components

1. **LanguageAwareTransformer**
   - Base: XLM-RoBERTa Large
   - Custom language-aware attention mechanism
   - Gating mechanism for feature fusion
   - Language-specific dropout rates
   - Support for 7 languages with English fallback

2. **ToxicDataset**
   - Efficient caching system
   - Language ID mapping
   - Memory pinning for CUDA optimization
   - Automatic handling of missing values

3. **Training System**
   - Mixed precision training (BF16/FP16)
   - Gradient accumulation
   - Language-aware loss weighting
   - Distributed training support
   - Automatic threshold optimization

### Key Features

- **Language Awareness**
  - Language-specific embeddings
  - Dynamic dropout rates per language
  - Language-aware attention mechanism
  - Automatic fallback to English for unsupported languages

- **Performance Optimization**
  - Gradient checkpointing
  - Memory-efficient attention
  - Automatic mixed precision
  - Caching system for processed data
  - CUDA optimization with memory pinning

- **Training Features**
  - Weighted focal loss with language awareness
  - Dynamic threshold optimization
  - Early stopping with patience
  - Gradient flow monitoring
  - Comprehensive metric tracking

## 📊 Data Processing

### Input Format
```python
{
    'comment_text': str,  # The text to classify
    'lang': str,          # Language code (en, ru, tr, es, fr, it, pt)
    'toxic': int,         # Binary labels for each category
    'severe_toxic': int,
    'obscene': int,
    'threat': int,
    'insult': int,
    'identity_hate': int
}
```

### Language Support
- Primary: en, ru, tr, es, fr, it, pt
- Default fallback: en (English)
- Language ID mapping: {en: 0, ru: 1, tr: 2, es: 3, fr: 4, it: 5, pt: 6}

## 🚀 Model Architecture

### Base Model
- XLM-RoBERTa Large
- Hidden size: 1024
- Attention heads: 16
- Max sequence length: 128

### Custom Components

1. **Language-Aware Classifier**
```python
- Input: Hidden states [batch_size, hidden_size]
- Language embeddings: [batch_size, 64]
- Projection: hidden_size + 64 -> 512
- Output: 6 toxicity predictions
```

2. **Language-Aware Attention**
```python
- Input: Hidden states + Language embeddings
- Scaled dot product attention
- Gating mechanism for feature fusion
- Memory-efficient implementation
```

## 🛠️ Training Configuration

### Hyperparameters
```python
{
    "batch_size": 32,
    "grad_accum_steps": 2,
    "epochs": 4,
    "lr": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.1,
    "label_smoothing": 0.01,
    "model_dropout": 0.1,
    "freeze_layers": 2
}
```

### Optimization
- Optimizer: AdamW
- Learning rate scheduler: Cosine with warmup
- Mixed precision: BF16/FP16
- Gradient clipping: 1.0
- Gradient accumulation steps: 2

## 📈 Metrics and Monitoring

### Training Metrics
- Loss (per language)
- AUC-ROC (macro)
- Precision, Recall, F1
- Language-specific metrics
- Gradient norms
- Memory usage

### Validation Metrics
- AUC-ROC (per class and language)
- Optimal thresholds per language
- Critical class performance (threat, identity_hate)
- Distribution shift monitoring

## 🔧 Usage

### Training
```bash
python model/train.py
```

### Inference
```python
from model.predict import predict_toxicity

results = predict_toxicity(
    text="Your text here",
    model=model,
    tokenizer=tokenizer,
    config=config
)
```

## 🔍 Code Structure

```
model/
├── language_aware_transformer.py  # Core model architecture
├── train.py                      # Training loop and utilities
├── predict.py                    # Inference utilities
├── evaluation/
│   ├── evaluate.py              # Evaluation functions
│   └── threshold_optimizer.py    # Dynamic threshold optimization
├── data/
│   └── sampler.py               # Custom sampling strategies
└── training_config.py           # Configuration management
```

## 🤖 AI/ML Specific Notes

1. **Tensor Shapes**
   - Input IDs: [batch_size, seq_len]
   - Attention Mask: [batch_size, seq_len]
   - Language IDs: [batch_size]
   - Hidden States: [batch_size, seq_len, hidden_size]
   - Language Embeddings: [batch_size, embed_dim]

2. **Critical Components**
   - Language ID handling in forward pass
   - Attention mask shape management
   - Memory-efficient attention implementation
   - Gradient flow in language-aware components

3. **Performance Considerations**
   - Cache management for processed data
   - Memory pinning for GPU transfers
   - Gradient accumulation for large batches
   - Language-specific dropout rates

4. **Error Handling**
   - Language ID validation
   - Shape compatibility checks
   - Gradient norm monitoring
   - Device placement verification

## 📝 Notes for AI Systems

1. When modifying the code:
   - Maintain language ID handling in forward pass
   - Preserve attention mask shape management
   - Keep device consistency checks
   - Handle BatchEncoding security in PyTorch 2.6+

2. Key attention points:
   - Language ID tensor shape and type
   - Attention mask broadcasting
   - Memory-efficient attention implementation
   - Gradient flow through language-aware components

3. Common pitfalls:
   - Incorrect attention mask shapes
   - Language ID type mismatches
   - Memory leaks in caching
   - Device inconsistencies