trokhymovych commited on
Commit
c075c16
·
verified ·
1 Parent(s): 11d5fa3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -0
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - yi
5
+ - xh
6
+ - fy
7
+ - cy
8
+ - vi
9
+ - uz
10
+ - ug
11
+ - ur
12
+ - uk
13
+ - tr
14
+ - th
15
+ - te
16
+ - ta
17
+ - sv
18
+ - sw
19
+ - su
20
+ - es
21
+ - so
22
+ - sl
23
+ - sk
24
+ - si
25
+ - sd
26
+ - sr
27
+ - gd
28
+ - sa
29
+ - ru
30
+ - ro
31
+ - pa
32
+ - pt
33
+ - pl
34
+ - fa
35
+ - ps
36
+ - om
37
+ - or
38
+ - 'no'
39
+ - ne
40
+ - mn
41
+ - mr
42
+ - ml
43
+ - ms
44
+ - mg
45
+ - mk
46
+ - lt
47
+ - lv
48
+ - la
49
+ - lo
50
+ - ky
51
+ - ku
52
+ - ko
53
+ - km
54
+ - kk
55
+ - kn
56
+ - jv
57
+ - ja
58
+ - it
59
+ - ga
60
+ - id
61
+ - is
62
+ - hu
63
+ - hi
64
+ - he
65
+ - ha
66
+ - gu
67
+ - el
68
+ - de
69
+ - ka
70
+ - gl
71
+ - fr
72
+ - fi
73
+ - tl
74
+ - et
75
+ - eo
76
+ - en
77
+ - nl
78
+ - da
79
+ - cs
80
+ - hr
81
+ - zh
82
+ - ca
83
+ - my
84
+ - bg
85
+ - br
86
+ - bs
87
+ - bn
88
+ - be
89
+ - eu
90
+ - az
91
+ - as
92
+ - hy
93
+ - ar
94
+ - am
95
+ - af
96
+ - sq
97
+ pipeline_tag: text-classification
98
+ ---
99
+
100
+ # Open Multilingual Text Readability Scoring Model (TRank)
101
+
102
+ [![DOI:10.48550/arXiv.2406.01835](https://zenodo.org/badge/DOI/10.48550/arXiv.2406.01835.svg)](https://doi.org/10.48550/arXiv.2406.01835)
103
+ [![Readability Experiments repo](https://img.shields.io/badge/GitLab-repo-orange)](https://gitlab.wikimedia.org/repos/research/readability-experiments)
104
+
105
+ ## Overview
106
+
107
+ This repository contains an open multilingual readability scoring model TRank, presented in the ACL'24 paper **An Open Multilingual System for Scoring Readability of Wikipedia**.
108
+ The model is designed to evaluate the readability of text across multiple languages.
109
+
110
+ ## Features
111
+
112
+ - **Multilingual Support**: Evaluates readability in multiple languages.
113
+ - **Pairwise Ranking**: Trained using a Siamese architecture with Margin Ranking Loss to differentiate and rank texts from hardest to simplest.
114
+ - **Long Context Window**: Utilizes the Longformer architecture of the base model, supporting inputs up to 4096 tokens.
115
+
116
+ ## Model Training
117
+
118
+ The model training implementation can be found in the [Readability Experiments repo](https://gitlab.wikimedia.org/repos/research/readability-experiments).
119
+
120
+ ## Usage example
121
+ ```
122
+ import torch
123
+ import torch.nn as nn
124
+ from transformers import AutoModel
125
+ from huggingface_hub import PyTorchModelHubMixin
126
+ from transformers import AutoTokenizer
127
+
128
+ # Define the model:
129
+ BASE_MODEL = "Peltarion/xlm-roberta-longformer-base-4096"
130
+ class ReadabilityModel(nn.Module, PyTorchModelHubMixin):
131
+ def __init__(self, model_name=BASE_MODEL):
132
+ super(ReadabilityModel, self).__init__()
133
+ self.model = AutoModel.from_pretrained(model_name)
134
+ self.drop = nn.Dropout(p=0.2)
135
+ self.fc = nn.Linear(768, 1)
136
+
137
+ def forward(self, ids, mask):
138
+ out = self.model(input_ids=ids, attention_mask=mask,
139
+ output_hidden_states=False)
140
+ out = self.drop(out[1])
141
+ outputs = self.fc(out)
142
+
143
+ return outputs
144
+
145
+ # Load the model:
146
+ model = ReadabilityModel.from_pretrained("trokhymovych/TRank_readability")
147
+
148
+ # Load the tokenizer:
149
+ tokenizer = AutoTokenizer.from_pretrained("trokhymovych/TRank_readability")
150
+
151
+ # Set the model to evaluation mode
152
+ model.eval()
153
+ # Example input text
154
+ input_text = "This is an example sentence to evaluate readability."
155
+ # Tokenize the input text
156
+ inputs = tokenizer.encode_plus(
157
+ input_text,
158
+ add_special_tokens=True,
159
+ max_length=512,
160
+ truncation=True,
161
+ padding='max_length',
162
+ return_tensors='pt'
163
+ )
164
+ ids = inputs['input_ids']
165
+ mask = inputs['attention_mask']
166
+
167
+ # Make prediction
168
+ with torch.no_grad():
169
+ outputs = model(ids, mask)
170
+ readability_score = outputs.item()
171
+
172
+ # Print the input text and the readability score
173
+ print(f"Input Text: {input_text}")
174
+ print(f"Readability Score: {readability_score}")
175
+ ```
176
+
177
+
178
+ ## Citation
179
+ Preprint:
180
+ ```
181
+ @misc{trokhymovych2024openmultilingualscoringreadability,
182
+ title={An Open Multilingual System for Scoring Readability of Wikipedia},
183
+ author={Mykola Trokhymovych and Indira Sen and Martin Gerlach},
184
+ year={2024},
185
+ eprint={2406.01835},
186
+ archivePrefix={arXiv},
187
+ primaryClass={cs.CL},
188
+ url={https://arxiv.org/abs/2406.01835},
189
+ }
190
+ ```
191
+
192
+