Deeptanshuu commited on
Commit
bc3c436
·
1 Parent(s): 85a8c27
Files changed (1) hide show
  1. readme.md +17 -217
readme.md CHANGED
@@ -1,217 +1,17 @@
1
- # Toxic Comment Classification using Deep Learning
2
-
3
- A multilingual toxic comment classification system using language-aware transformers and advanced deep learning techniques.
4
-
5
- ## 🏗️ Architecture Overview
6
-
7
- ### Core Components
8
-
9
- 1. **LanguageAwareTransformer**
10
- - Base: XLM-RoBERTa Large
11
- - Custom language-aware attention mechanism
12
- - Gating mechanism for feature fusion
13
- - Language-specific dropout rates
14
- - Support for 7 languages with English fallback
15
-
16
- 2. **ToxicDataset**
17
- - Efficient caching system
18
- - Language ID mapping
19
- - Memory pinning for CUDA optimization
20
- - Automatic handling of missing values
21
-
22
- 3. **Training System**
23
- - Mixed precision training (BF16/FP16)
24
- - Gradient accumulation
25
- - Language-aware loss weighting
26
- - Distributed training support
27
- - Automatic threshold optimization
28
-
29
- ### Key Features
30
-
31
- - **Language Awareness**
32
- - Language-specific embeddings
33
- - Dynamic dropout rates per language
34
- - Language-aware attention mechanism
35
- - Automatic fallback to English for unsupported languages
36
-
37
- - **Performance Optimization**
38
- - Gradient checkpointing
39
- - Memory-efficient attention
40
- - Automatic mixed precision
41
- - Caching system for processed data
42
- - CUDA optimization with memory pinning
43
-
44
- - **Training Features**
45
- - Weighted focal loss with language awareness
46
- - Dynamic threshold optimization
47
- - Early stopping with patience
48
- - Gradient flow monitoring
49
- - Comprehensive metric tracking
50
-
51
- ## 📊 Data Processing
52
-
53
- ### Input Format
54
- ```python
55
- {
56
- 'comment_text': str, # The text to classify
57
- 'lang': str, # Language code (en, ru, tr, es, fr, it, pt)
58
- 'toxic': int, # Binary labels for each category
59
- 'severe_toxic': int,
60
- 'obscene': int,
61
- 'threat': int,
62
- 'insult': int,
63
- 'identity_hate': int
64
- }
65
- ```
66
-
67
- ### Language Support
68
- - Primary: en, ru, tr, es, fr, it, pt
69
- - Default fallback: en (English)
70
- - Language ID mapping: {en: 0, ru: 1, tr: 2, es: 3, fr: 4, it: 5, pt: 6}
71
-
72
- ## 🚀 Model Architecture
73
-
74
- ### Base Model
75
- - XLM-RoBERTa Large
76
- - Hidden size: 1024
77
- - Attention heads: 16
78
- - Max sequence length: 128
79
-
80
- ### Custom Components
81
-
82
- 1. **Language-Aware Classifier**
83
- ```python
84
- - Input: Hidden states [batch_size, hidden_size]
85
- - Language embeddings: [batch_size, 64]
86
- - Projection: hidden_size + 64 -> 512
87
- - Output: 6 toxicity predictions
88
- ```
89
-
90
- 2. **Language-Aware Attention**
91
- ```python
92
- - Input: Hidden states + Language embeddings
93
- - Scaled dot product attention
94
- - Gating mechanism for feature fusion
95
- - Memory-efficient implementation
96
- ```
97
-
98
- ## 🛠️ Training Configuration
99
-
100
- ### Hyperparameters
101
- ```python
102
- {
103
- "batch_size": 32,
104
- "grad_accum_steps": 2,
105
- "epochs": 4,
106
- "lr": 2e-5,
107
- "weight_decay": 0.01,
108
- "warmup_ratio": 0.1,
109
- "label_smoothing": 0.01,
110
- "model_dropout": 0.1,
111
- "freeze_layers": 2
112
- }
113
- ```
114
-
115
- ### Optimization
116
- - Optimizer: AdamW
117
- - Learning rate scheduler: Cosine with warmup
118
- - Mixed precision: BF16/FP16
119
- - Gradient clipping: 1.0
120
- - Gradient accumulation steps: 2
121
-
122
- ## 📈 Metrics and Monitoring
123
-
124
- ### Training Metrics
125
- - Loss (per language)
126
- - AUC-ROC (macro)
127
- - Precision, Recall, F1
128
- - Language-specific metrics
129
- - Gradient norms
130
- - Memory usage
131
-
132
- ### Validation Metrics
133
- - AUC-ROC (per class and language)
134
- - Optimal thresholds per language
135
- - Critical class performance (threat, identity_hate)
136
- - Distribution shift monitoring
137
-
138
- ## 🔧 Usage
139
-
140
- ### Training
141
- ```bash
142
- python model/train.py
143
- ```
144
-
145
- ### Inference
146
- ```python
147
- from model.predict import predict_toxicity
148
-
149
- results = predict_toxicity(
150
- text="Your text here",
151
- model=model,
152
- tokenizer=tokenizer,
153
- config=config
154
- )
155
- ```
156
-
157
- ## 🔍 Code Structure
158
-
159
- ```
160
- model/
161
- ├── language_aware_transformer.py # Core model architecture
162
- ├── train.py # Training loop and utilities
163
- ├── predict.py # Inference utilities
164
- ├── evaluation/
165
- │ ├── evaluate.py # Evaluation functions
166
- │ └── threshold_optimizer.py # Dynamic threshold optimization
167
- ├── data/
168
- │ └── sampler.py # Custom sampling strategies
169
- └── training_config.py # Configuration management
170
- ```
171
-
172
- ## 🤖 AI/ML Specific Notes
173
-
174
- 1. **Tensor Shapes**
175
- - Input IDs: [batch_size, seq_len]
176
- - Attention Mask: [batch_size, seq_len]
177
- - Language IDs: [batch_size]
178
- - Hidden States: [batch_size, seq_len, hidden_size]
179
- - Language Embeddings: [batch_size, embed_dim]
180
-
181
- 2. **Critical Components**
182
- - Language ID handling in forward pass
183
- - Attention mask shape management
184
- - Memory-efficient attention implementation
185
- - Gradient flow in language-aware components
186
-
187
- 3. **Performance Considerations**
188
- - Cache management for processed data
189
- - Memory pinning for GPU transfers
190
- - Gradient accumulation for large batches
191
- - Language-specific dropout rates
192
-
193
- 4. **Error Handling**
194
- - Language ID validation
195
- - Shape compatibility checks
196
- - Gradient norm monitoring
197
- - Device placement verification
198
-
199
- ## 📝 Notes for AI Systems
200
-
201
- 1. When modifying the code:
202
- - Maintain language ID handling in forward pass
203
- - Preserve attention mask shape management
204
- - Keep device consistency checks
205
- - Handle BatchEncoding security in PyTorch 2.6+
206
-
207
- 2. Key attention points:
208
- - Language ID tensor shape and type
209
- - Attention mask broadcasting
210
- - Memory-efficient attention implementation
211
- - Gradient flow through language-aware components
212
-
213
- 3. Common pitfalls:
214
- - Incorrect attention mask shapes
215
- - Language ID type mismatches
216
- - Memory leaks in caching
217
- - Device inconsistencies
 
1
+ ---
2
+ datasets:
3
+ - textdetox/multilingual_toxicity_dataset
4
+ language:
5
+ - en
6
+ - it
7
+ - ru
8
+ - ae
9
+ - es
10
+ - tr
11
+ metrics:
12
+ - accuracy
13
+ - f1
14
+ base_model:
15
+ - FacebookAI/xlm-roberta-large
16
+ pipeline_tag: text-classification
17
+ ---