Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
**Estienne** is a text-segmentation model trained on Deberta.
|
2 |
+
|
3 |
+
In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.
|
4 |
+
|
5 |
+
Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex). Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.
|
6 |
+
|
7 |
+
Estienne supports the following segmentations:
|
8 |
+
|
9 |
+
The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.
|