Deeptanshuu
/

Multilingual_Toxic_Comment_Classifier

Text Classification

Model card Files Files and versions Community

Deeptanshuu commited on 14 days ago

Commit

bc3c436

1 Parent(s): 85a8c27

Readme

Browse files

Files changed (1) hide show

readme.md +17 -217

readme.md CHANGED Viewed

@@ -1,217 +1,17 @@
-# Toxic Comment Classification using Deep Learning
-A multilingual toxic comment classification system using language-aware transformers and advanced deep learning techniques.
-## 🏗️ Architecture Overview
-### Core Components
-1. **LanguageAwareTransformer**
-   - Base: XLM-RoBERTa Large
-   - Custom language-aware attention mechanism
-   - Gating mechanism for feature fusion
-   - Language-specific dropout rates
-   - Support for 7 languages with English fallback
-2. **ToxicDataset**
-   - Efficient caching system
-   - Language ID mapping
-   - Memory pinning for CUDA optimization
-   - Automatic handling of missing values
-3. **Training System**
-   - Mixed precision training (BF16/FP16)
-   - Gradient accumulation
-   - Language-aware loss weighting
-   - Distributed training support
-   - Automatic threshold optimization
-### Key Features
-- **Language Awareness**
-  - Language-specific embeddings
-  - Dynamic dropout rates per language
-  - Language-aware attention mechanism
-  - Automatic fallback to English for unsupported languages
-- **Performance Optimization**
-  - Gradient checkpointing
-  - Memory-efficient attention
-  - Automatic mixed precision
-  - Caching system for processed data
-  - CUDA optimization with memory pinning
-- **Training Features**
-  - Weighted focal loss with language awareness
-  - Dynamic threshold optimization
-  - Early stopping with patience
-  - Gradient flow monitoring
-  - Comprehensive metric tracking
-## 📊 Data Processing
-### Input Format
-```python
-{
-    'comment_text': str,  # The text to classify
-    'lang': str,          # Language code (en, ru, tr, es, fr, it, pt)
-    'toxic': int,         # Binary labels for each category
-    'severe_toxic': int,
-    'obscene': int,
-    'threat': int,
-    'insult': int,
-    'identity_hate': int
-}
-```
-### Language Support
-- Primary: en, ru, tr, es, fr, it, pt
-- Default fallback: en (English)
-- Language ID mapping: {en: 0, ru: 1, tr: 2, es: 3, fr: 4, it: 5, pt: 6}
-## 🚀 Model Architecture
-### Base Model
-- XLM-RoBERTa Large
-- Hidden size: 1024
-- Attention heads: 16
-- Max sequence length: 128
-### Custom Components
-1. **Language-Aware Classifier**
-```python
-- Input: Hidden states [batch_size, hidden_size]
-- Language embeddings: [batch_size, 64]
-- Projection: hidden_size + 64 -> 512
-- Output: 6 toxicity predictions
-```
-2. **Language-Aware Attention**
-```python
-- Input: Hidden states + Language embeddings
-- Scaled dot product attention
-- Gating mechanism for feature fusion
-- Memory-efficient implementation
-```
-## 🛠️ Training Configuration
-### Hyperparameters
-```python
-{
-    "batch_size": 32,
-    "grad_accum_steps": 2,
-    "epochs": 4,
-    "lr": 2e-5,
-    "weight_decay": 0.01,
-    "warmup_ratio": 0.1,
-    "label_smoothing": 0.01,
-    "model_dropout": 0.1,
-    "freeze_layers": 2
-}
-```
-### Optimization
-- Optimizer: AdamW
-- Learning rate scheduler: Cosine with warmup
-- Mixed precision: BF16/FP16
-- Gradient clipping: 1.0
-- Gradient accumulation steps: 2
-## 📈 Metrics and Monitoring
-### Training Metrics
-- Loss (per language)
-- AUC-ROC (macro)
-- Precision, Recall, F1
-- Language-specific metrics
-- Gradient norms
-- Memory usage
-### Validation Metrics
-- AUC-ROC (per class and language)
-- Optimal thresholds per language
-- Critical class performance (threat, identity_hate)
-- Distribution shift monitoring
-## 🔧 Usage
-### Training
-```bash
-python model/train.py
-```
-### Inference
-```python
-from model.predict import predict_toxicity
-results = predict_toxicity(
-    text="Your text here",
-    model=model,
-    tokenizer=tokenizer,
-    config=config
-)
-```
-## 🔍 Code Structure
-```
-model/
-├── language_aware_transformer.py  # Core model architecture
-├── train.py                      # Training loop and utilities
-├── predict.py                    # Inference utilities
-├── evaluation/
-│   ├── evaluate.py              # Evaluation functions
-│   └── threshold_optimizer.py    # Dynamic threshold optimization
-├── data/
-│   └── sampler.py               # Custom sampling strategies
-└── training_config.py           # Configuration management
-```
-## 🤖 AI/ML Specific Notes
-1. **Tensor Shapes**
-   - Input IDs: [batch_size, seq_len]
-   - Attention Mask: [batch_size, seq_len]
-   - Language IDs: [batch_size]
-   - Hidden States: [batch_size, seq_len, hidden_size]
-   - Language Embeddings: [batch_size, embed_dim]
-2. **Critical Components**
-   - Language ID handling in forward pass
-   - Attention mask shape management
-   - Memory-efficient attention implementation
-   - Gradient flow in language-aware components
-3. **Performance Considerations**
-   - Cache management for processed data
-   - Memory pinning for GPU transfers
-   - Gradient accumulation for large batches
-   - Language-specific dropout rates
-4. **Error Handling**
-   - Language ID validation
-   - Shape compatibility checks
-   - Gradient norm monitoring
-   - Device placement verification
-## 📝 Notes for AI Systems
-1. When modifying the code:
-   - Maintain language ID handling in forward pass
-   - Preserve attention mask shape management
-   - Keep device consistency checks
-   - Handle BatchEncoding security in PyTorch 2.6+
-2. Key attention points:
-   - Language ID tensor shape and type
-   - Attention mask broadcasting
-   - Memory-efficient attention implementation
-   - Gradient flow through language-aware components
-3. Common pitfalls:
-   - Incorrect attention mask shapes
-   - Language ID type mismatches
-   - Memory leaks in caching
-   - Device inconsistencies

+---
+datasets:
+- textdetox/multilingual_toxicity_dataset
+language:
+- en
+- it
+- ru
+- ae
+- es
+- tr
+metrics:
+- accuracy
+- f1
+base_model:
+- FacebookAI/xlm-roberta-large
+pipeline_tag: text-classification
+---