Text Classification
Transformers
Safetensors
bert
toxic
File size: 1,906 Bytes
179b9a7
 
a4bc426
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a64e80
 
179b9a7
 
a4bc426
179b9a7
a4bc426
179b9a7
a4bc426
179b9a7
a4bc426
 
a0077ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179b9a7
a4bc426
179b9a7
a4bc426
 
 
179b9a7
a4bc426
 
179b9a7
a4bc426
179b9a7
a4bc426
 
 
179b9a7
a4bc426
 
179b9a7
a4bc426
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
library_name: transformers
language:
- en
- fr
- it
- es
- ru
- uk
- tt
- ar
- hi
- ja
- zh
- he
- am
- de
license: openrail++
datasets:
- textdetox/multilingual_toxicity_dataset
metrics:
- f1
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
tags:
- toxic
---

## Multilingual Toxicity Classifier for 15 Languages (2025)

This is an instance of [bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset).

Now, the models covers 15 languages from various language families:

| Language  | Code | F1 Score |
|-----------|------|---------|
| English   | en   | 0.9035  |
| Russian   | ru   | 0.9224  |
| Ukrainian | uk   | 0.9461  |
| German    | de   | 0.5181  |
| Spanish   | es   | 0.7291  |
| Arabic    | ar   | 0.5139  |
| Amharic   | am   | 0.6316  |
| Hindi     | hi   | 0.7268  |
| Chinese   | zh   | 0.6703  |
| Italian   | it   | 0.6485  |
| French    | fr   | 0.9125  |
| Hinglish  | hin  | 0.6850  |
| Hebrew    | he   | 0.8686  |
| Japanese  | ja   | 0.8644  |
| Tatar     | tt   | 0.6170  |

## How to use

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('textdetox/bert-multilingual-toxicity-classifier')
model = AutoModelForSequenceClassification.from_pretrained('textdetox/bert-multilingual-toxicity-classifier')

batch = tokenizer.encode("You are amazing!", return_tensors="pt")

output = model(batch)
# idx 0 for neutral, idx 1 for toxic
```

## Citation
The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation.

Citation TBD soon.