|
--- |
|
pipeline_tag: text-classification |
|
language: |
|
- multilingual |
|
license: apache-2.0 |
|
library_name: transformers |
|
--- |
|
|
|
# Model Description |
|
|
|
This model was build by translating the fine-Edu annotations into 15 languages using the best proprietary LLM for translation in the world: Tower LLM 70B. |
|
|
|
The translation model excels at translating entire documents and thus its the perfect fit to translate the texts we will use to train our classifier. |
|
|
|
The classifier is trained for English, German, Spanish, Japanese, Chinese, Russian, Hindi, Czech, Ukrainian, Icelandic, Portuguese, French, Dutch, Italian and Korean. Since its build on top of [mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) it should be able to generalize across other languages. |
|
|
|
## Running Model: |
|
To run inference you must install |
|
``` |
|
pip install transformers[torch] |
|
pip install datasets |
|
pip install pandas |
|
pip install tqdm |
|
``` |
|
|
|
After installing those libraries you can sun the following code: |
|
|
|
```python |
|
import pandas as pd |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
from tqdm import tqdm |
|
|
|
|
|
device = "cuda" |
|
path = "Unbabel/mfineweb-edu-classifier" |
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
path, |
|
device_map=device, |
|
trust_remote_code=True, |
|
torch_dtype=torch.bfloat16 |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True) |
|
|
|
def get_model_outputs(texts): |
|
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(model.device) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
score = outputs.logits |
|
prob = torch.nn.functional.sigmoid(outputs.binary_logits) |
|
return score.cpu(), prob.cpu() |
|
|
|
def batchify_texts(texts, batch_size): |
|
for i in range(0, len(texts), batch_size): |
|
yield texts[i:i + batch_size] |
|
|
|
# TODO: replace the next line with the texts you want to classify |
|
texts = LIST_WITH_TEXTS_TO_CLASSIFY |
|
batch_size = 64 # Adjust based on your available memory and model capacity |
|
num_batches = (len(texts) + batch_size - 1) // batch_size |
|
|
|
all_scores = [] |
|
all_probs = [] |
|
with tqdm(total=num_batches, dynamic_ncols=True) as pbar: |
|
for batch_num, batch in enumerate(batchify_texts(texts, batch_size), 1): |
|
score, probs = get_model_outputs(batch) |
|
all_scores.append(score) |
|
all_probs.append(probs) |
|
pbar.set_description(f"Processing Batch {batch_num}/{num_batches}") |
|
pbar.update(1) |
|
|
|
# SCORES is the output of the regression head and should reflect the |
|
# educational score of the text! |
|
scores = torch.cat(all_scores, dim=0).squeeze() |
|
|
|
## BINARY_PRED is the output of the classification head that tells |
|
# if a text has an acceptable educational score or not. |
|
# NOTE: Converting the scores into binary predictions is also possible |
|
all_probs = torch.cat(all_probs, dim=0).squeeze() |
|
binary_pred = (all_probs >= 0.5).numpy().astype(int) |
|
``` |
|
|
|
## English Results: |
|
|
|
When testing the model on an english partition with 37537 samples the results are comparable to the original FineEdu-classifier. |
|
|
|
Regression head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.80 0.53 0.64 5130 |
|
1 0.80 0.88 0.83 21602 |
|
2 0.63 0.58 0.61 7849 |
|
3 0.54 0.62 0.58 2310 |
|
4 0.62 0.48 0.54 645 |
|
5 0.00 0.00 0.00 1 |
|
|
|
accuracy 0.74 37537 |
|
macro avg 0.56 0.51 0.53 37537 |
|
weighted avg 0.74 0.74 0.74 37537 |
|
``` |
|
|
|
Binary head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.98 0.97 0.98 34581 |
|
1 0.71 0.74 0.73 2956 |
|
|
|
accuracy 0.96 37537 |
|
macro avg 0.85 0.86 0.85 37537 |
|
weighted avg 0.96 0.96 0.96 37537 |
|
``` |
|
|
|
## Multilingual Results: |
|
|
|
If we evaluate on the same texts translated into 15 different languages are almost identical! |
|
|
|
Regression head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.80 0.50 0.61 5130 |
|
1 0.79 0.87 0.83 21602 |
|
2 0.61 0.58 0.59 7849 |
|
3 0.52 0.61 0.56 2310 |
|
4 0.61 0.38 0.47 645 |
|
5 0.00 0.00 0.00 1 |
|
|
|
accuracy 0.73 37537 |
|
macro avg 0.55 0.49 0.51 37537 |
|
weighted avg 0.73 0.73 0.73 37537 |
|
``` |
|
|
|
Binary head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.98 0.97 0.97 34581 |
|
1 0.70 0.71 0.71 2956 |
|
|
|
accuracy 0.95 37537 |
|
macro avg 0.84 0.84 0.84 37537 |
|
weighted avg 0.95 0.95 0.95 37537 |
|
``` |
|
|