trokhymovych
/

TRank_readability

+---
+license: mit
+language:
+- yi
+- xh
+- fy
+- cy
+- vi
+- uz
+- ug
+- ur
+- uk
+- tr
+- th
+- te
+- ta
+- sv
+- sw
+- su
+- es
+- so
+- sl
+- sk
+- si
+- sd
+- sr
+- gd
+- sa
+- ru
+- ro
+- pa
+- pt
+- pl
+- fa
+- ps
+- om
+- or
+- 'no'
+- ne
+- mn
+- mr
+- ml
+- ms
+- mg
+- mk
+- lt
+- lv
+- la
+- lo
+- ky
+- ku
+- ko
+- km
+- kk
+- kn
+- jv
+- ja
+- it
+- ga
+- id
+- is
+- hu
+- hi
+- he
+- ha
+- gu
+- el
+- de
+- ka
+- gl
+- fr
+- fi
+- tl
+- et
+- eo
+- en
+- nl
+- da
+- cs
+- hr
+- zh
+- ca
+- my
+- bg
+- br
+- bs
+- bn
+- be
+- eu
+- az
+- as
+- hy
+- ar
+- am
+- af
+- sq
+pipeline_tag: text-classification
+---
+# Open Multilingual Text Readability Scoring Model (TRank)
+[![DOI:10.48550/arXiv.2406.01835](https://zenodo.org/badge/DOI/10.48550/arXiv.2406.01835.svg)](https://doi.org/10.48550/arXiv.2406.01835)
+[![Readability Experiments repo](https://img.shields.io/badge/GitLab-repo-orange)](https://gitlab.wikimedia.org/repos/research/readability-experiments)
+## Overview
+This repository contains an open multilingual readability scoring model TRank, presented in the ACL'24 paper **An Open Multilingual System for Scoring Readability of Wikipedia**.
+The model is designed to evaluate the readability of text across multiple languages.
+## Features
+- **Multilingual Support**: Evaluates readability in multiple languages.
+- **Pairwise Ranking**: Trained using a Siamese architecture with Margin Ranking Loss to differentiate and rank texts from hardest to simplest.
+- **Long Context Window**: Utilizes the Longformer architecture of the base model, supporting inputs up to 4096 tokens.
+## Model Training
+The model training implementation can be found in the [Readability Experiments repo](https://gitlab.wikimedia.org/repos/research/readability-experiments).
+## Usage example
+```
+import torch
+import torch.nn as nn
+from transformers import AutoModel
+from huggingface_hub import PyTorchModelHubMixin
+from transformers import AutoTokenizer
+# Define the model:
+BASE_MODEL = "Peltarion/xlm-roberta-longformer-base-4096"
+class ReadabilityModel(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, model_name=BASE_MODEL):
+        super(ReadabilityModel, self).__init__()
+        self.model = AutoModel.from_pretrained(model_name)
+        self.drop = nn.Dropout(p=0.2)
+        self.fc = nn.Linear(768, 1)
+    def forward(self, ids, mask):
+        out = self.model(input_ids=ids, attention_mask=mask,
+                         output_hidden_states=False)
+        out = self.drop(out[1])
+        outputs = self.fc(out)
+        return outputs
+# Load the model:
+model = ReadabilityModel.from_pretrained("trokhymovych/TRank_readability")
+# Load the tokenizer:
+tokenizer = AutoTokenizer.from_pretrained("trokhymovych/TRank_readability")
+# Set the model to evaluation mode
+model.eval()
+# Example input text
+input_text = "This is an example sentence to evaluate readability."
+# Tokenize the input text
+inputs = tokenizer.encode_plus(
+    input_text,
+    add_special_tokens=True,
+    max_length=512,
+    truncation=True,
+    padding='max_length',
+    return_tensors='pt'
+)
+ids = inputs['input_ids']
+mask = inputs['attention_mask']
+# Make prediction
+with torch.no_grad():
+    outputs = model(ids, mask)
+    readability_score = outputs.item()
+# Print the input text and the readability score
+print(f"Input Text: {input_text}")
+print(f"Readability Score: {readability_score}")
+```
+## Citation
+Preprint:
+```
+@misc{trokhymovych2024openmultilingualscoringreadability,
+      title={An Open Multilingual System for Scoring Readability of Wikipedia},
+      author={Mykola Trokhymovych and Indira Sen and Martin Gerlach},
+      year={2024},
+      eprint={2406.01835},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2406.01835},
+}
+```