Update README.md

ItLit800 is a spacy model trained for Named Entity Recognition (NER) on 50 Italian novels written in the XIX and early XX century. The learning scheme might be classified as "Distant supervision" because the training data is labeled automatically based on rules. The Train dataset is constituted of 95000 sentences extracted from 50 novels published in the XIX and early XX centuries. It has been trained to recognize 26 types of entities (LABELS) relevant for characterizing the major dimensions of a novel (characters, place, timeframe and the cultural/historical framework).

Also if the domain on which the model has been trained is the one of the XIX and early XX Italian literature, it can be used also for analysing contemporaneous literary or conversational text. In fact, in the Italian classical novels you can find a huge variety of expressions not only for honorific titles or religious entities, but also for time or natural entities. Take into account that the SpaCy NER module for the Italian language is very limited (for example it does not consider "timeframe" entities). Obviously, most of the novels (with the notable exception of Svevo) do not mention companies and make almost no use of acronyms, so they are much less useful to train a model to individuate contemporaneous political or financial entities. However, for this reason, we didn't cancel the original SpaCy NER pipeline but we simply put it after our EntityRuler pipeline. This implies that if our EntityRuler fails to recognize a company name, say "Google", the SpaCy NER might be able to classify it either in ORG or MISC.

PER, LOC, ORG and MISC are the original labels of the SpaCy NER module. We leave them as a sort of "residual" classifier in case some tokens representing a named entity have not already been captured by our rules. In general, the name of the labels reflects the "rule" that has been used to discover it. In the following, we give some indications about the most important ones. In case you are interested, please write to the author. Keep in mind, however, that the labelling is a work in progress and we could change the names or the aggregation criteria in the next releases.

HON: the character has been introduced by an honorific title (ex. "il conte Andrea Sperelli", Il Piacere, G.d'Annunzio);
ANTRO: the character is an animal (ex. "il Pesce-cane", Pinocchio, Collodi) or a fictional figure (ex. "la Fata", Pinocchio, Collodi);
NAME: the character is mentioned only by his name
POI: is a point of interest. So it is a location introduced by specific toponomastic formulas (ex. Via Roma or piazza Cavour)
GEO: is a geographical entity (ex. il fiume Po, il lago di Garda)
DIST: is a combination of GPE, POI or GEO with the meaning of measuring their distance (ex. impiegò 7 ore di treno da Milano a Roma )
TIME: is a temporal reference to a time unit less than 1 day (ex. alle ore 8:30)
DATE: is a temporal reference to a time unit greater than or equal to 1 day (ex. il 25 aprile, gli anni '80)
CHR: is a time span less than 1 day (ex. in 5 minuti)
QTM: is a time span greater than 1 day (ex. ci volle una settimana)
DATECEL: is a date of a celebration (ex. Natale, la festa della Repubblica, la notte di S.Lorenzo)
DATERNG: is a time span (ex. regnò dal 1801 al 1819)
XORG: is an entity referring to an organization that is not a legal person as a company could be (ex. i gesuiti)
KING: refer to characters have been identified through a majestic title or the use of latin numbers (ex. papa Giovanni, re Luigi, Luigi XIV)
CULT: comprise the Work of Arts identified through specific rules (ex. il capolavoro di Dante) and Ngrams identified as belonging to our cultural KB (1.000 items)
MONEY: monetary measures (ex: 5 luigi d'oro, 10 lire)
QNT: physical units of measure (ex. 1 ettaro, 10 chili)

Some final words about pre-processing. Since a novel is constituted by different type of narrative sequences (dialogues, narrations, diaries ...) and the typographical delimiters most of the times are "undirected"(for example, the long dash is used to open or close a dialogue), we developed our own algorithms to sentencize the text. The novels' text was then sentencized according to our algorithms, and the single sentences were feeded inside the SpaCy model first to extract the rule-based NER and then to build the train dataset. We didn't eliminate stopwords.

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -2,6 +2,7 @@
 tags:
 - spacy
 - token-classification
 language:
 - it
 model-index:

 tags:
 - spacy
 - token-classification
+- ner
 language:
 - it
 model-index: