Text Classification
Transformers
Safetensors
bert
toxic
dardem's picture
Update README.md
5a64e80 verified
metadata
library_name: transformers
language:
  - en
  - fr
  - it
  - es
  - ru
  - uk
  - tt
  - ar
  - hi
  - ja
  - zh
  - he
  - am
  - de
license: openrail++
datasets:
  - textdetox/multilingual_toxicity_dataset
metrics:
  - f1
base_model:
  - google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
tags:
  - toxic

Multilingual Toxicity Classifier for 15 Languages (2025)

This is an instance of bert-base-multilingual-cased that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset textdetox/multilingual_toxicity_dataset.

Now, the models covers 15 languages from various language families:

Language Code F1 Score
English en 0.9035
Russian ru 0.9224
Ukrainian uk 0.9461
German de 0.5181
Spanish es 0.7291
Arabic ar 0.5139
Amharic am 0.6316
Hindi hi 0.7268
Chinese zh 0.6703
Italian it 0.6485
French fr 0.9125
Hinglish hin 0.6850
Hebrew he 0.8686
Japanese ja 0.8644
Tatar tt 0.6170

How to use

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('textdetox/bert-multilingual-toxicity-classifier')
model = AutoModelForSequenceClassification.from_pretrained('textdetox/bert-multilingual-toxicity-classifier')

batch = tokenizer.encode("You are amazing!", return_tensors="pt")

output = model(batch)
# idx 0 for neutral, idx 1 for toxic

Citation

The model is prepared for TextDetox 2025 Shared Task evaluation.

Citation TBD soon.