PleIAs
/

Segmentext

+**Estienne** is a text-segmentation model trained on Deberta.
+In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.
+Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex). Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.
+Estienne supports the following segmentations:
+The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.