File size: 5,062 Bytes
352467e
 
44e6fb6
 
72e4a68
352467e
44e6fb6
83a1511
352467e
 
72e4a68
352467e
 
 
 
 
44e6fb6
 
 
 
 
 
 
 
 
 
83a1511
 
44e6fb6
83a1511
 
44e6fb6
 
83a1511
 
 
 
 
 
 
 
 
 
 
44e6fb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83a1511
44e6fb6
 
 
83a1511
44e6fb6
 
83a1511
44e6fb6
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
pipeline_tag: classification
language:
- multilingual
license: apache-2.0
library_name: transformers
---

# Model Description

This model was build by translating the fine-Edu annotations into 15 languages using the best proprietary LLM for translation in the world: Tower LLM 70B.

The translation model excels at translating entire documents and thus its the perfect fit to translate the texts we will use to train our classifier. 

The classifier is trained for English, German, Spanish, Japanese, Chinese, Russian, Hindi, Czech, Ukrainian, Icelandic, Portuguese, French, Dutch, Italian and Korean. Since its build on top of [mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) it should be able to generalize across other languages.

## Running Model:
To run inference you must install
```
pip install transformers[torch]
pip install datasets
pip install pandas
pip install tqdm
```

After installing those libraries you can sun the following code:

```python
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from tqdm import tqdm


device = "cuda"
path = "Unbabel/mfineweb-edu-classifier"
model = AutoModelForSequenceClassification.from_pretrained(
    path, 
    device_map=device, 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)

def get_model_outputs(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        score = outputs.logits
        prob = torch.nn.functional.sigmoid(outputs.binary_logits)
    return score.cpu(), prob.cpu()

def batchify_texts(texts, batch_size):
    for i in range(0, len(texts), batch_size):
        yield texts[i:i + batch_size]

# TODO: replace the next line with the texts you want to classify
texts = LIST_WITH_TEXTS_TO_CLASSIFY
batch_size = 64  # Adjust based on your available memory and model capacity
num_batches = (len(texts) + batch_size - 1) // batch_size

all_scores = []
all_probs = []
with tqdm(total=num_batches, dynamic_ncols=True) as pbar:
    for batch_num, batch in enumerate(batchify_texts(texts, batch_size), 1):
        score, probs = get_model_outputs(batch)
        all_scores.append(score)
        all_probs.append(probs)
        pbar.set_description(f"Processing Batch {batch_num}/{num_batches}")
        pbar.update(1)

# SCORES is the output of the regression head and should reflect the
# educational score of the text!
scores = torch.cat(all_scores, dim=0).squeeze()

## BINARY_PRED is the output of the classification head that tells
# if a text has an acceptable educational score or not.
# NOTE: Converting the scores into binary predictions is also possible
all_probs = torch.cat(all_probs, dim=0).squeeze()
binary_pred = (all_probs >= 0.5).numpy().astype(int)
```

## English Results:

When testing the model on an english partition with 37537 samples the results are comparable to the original FineEdu-classifier.

Regression head results:
```
              precision    recall  f1-score   support

           0       0.80      0.53      0.64      5130
           1       0.80      0.88      0.83     21602
           2       0.63      0.58      0.61      7849
           3       0.54      0.62      0.58      2310
           4       0.62      0.48      0.54       645
           5       0.00      0.00      0.00         1

    accuracy                           0.74     37537
   macro avg       0.56      0.51      0.53     37537
weighted avg       0.74      0.74      0.74     37537
```

Binary head results:
```
              precision    recall  f1-score   support

           0       0.98      0.97      0.98     34581
           1       0.71      0.74      0.73      2956

    accuracy                           0.96     37537
   macro avg       0.85      0.86      0.85     37537
weighted avg       0.96      0.96      0.96     37537
```

## Multilingual Results:

If we evaluate on the same texts translated into 15 different languages are almost identical!

Regression head results:
```
              precision    recall  f1-score   support

           0       0.80      0.50      0.61      5130
           1       0.79      0.87      0.83     21602
           2       0.61      0.58      0.59      7849
           3       0.52      0.61      0.56      2310
           4       0.61      0.38      0.47       645
           5       0.00      0.00      0.00         1

    accuracy                           0.73     37537
   macro avg       0.55      0.49      0.51     37537
weighted avg       0.73      0.73      0.73     37537
```

Binary head results:
```
              precision    recall  f1-score   support

           0       0.98      0.97      0.97     34581
           1       0.70      0.71      0.71      2956

    accuracy                           0.95     37537
   macro avg       0.84      0.84      0.84     37537
weighted avg       0.95      0.95      0.95     37537
```