lirondos commited on
Commit
0c57318
·
1 Parent(s): 3fc6b81

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md CHANGED
@@ -58,6 +58,46 @@ The model uses BIO encoding to account for multitoken borrowings.
58
  - [Observatory of anglicism usage in the Spanish press](https://observatoriolazaro.es/)
59
  - [pylazaro Python library](https://pylazaro.readthedocs.io/)
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ## Citation
62
 
63
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
58
  - [Observatory of anglicism usage in the Spanish press](https://observatoriolazaro.es/)
59
  - [pylazaro Python library](https://pylazaro.readthedocs.io/)
60
 
61
+
62
+ ## Metrics (on the test set)
63
+ The following table summarizes the results obtained by this model on the test set of the [COALAS](https://github.com/lirondos/coalas/) corpus.
64
+
65
+ | LABEL | Precision | Recall | F1 |
66
+ |:-------|-----:|-----:|---------:|
67
+ | ALL |85.03 |81.32 | 83.13 |
68
+ | ENG | 85.25 | 83.94 | 84.59 |
69
+ | OTHER | 55.56 | 10.87 | 18.18 |
70
+
71
+
72
+
73
+ ## Dataset
74
+ This model was trained on [COALAS](https://github.com/lirondos/coalas/), a corpus of Spanish newswire annotated with unassimilated lexical borrowings. The corpus contains 370,000 tokens and includes various written media written in European Spanish. The test set was designed to be as difficult as possible: it covers sources and dates not seen in the training set, includes a high number of OOV words (92% of the borrowings in the test set are OOV) and is very borrowing-dense (20 borrowings per 1,000 tokens).
75
+
76
+ |Set | Tokens | ENG | OTHER | Unique |
77
+ |:-------|-----:|-----:|---------:|---------:|
78
+ |Training |231,126 |1,493 | 28 |380 |
79
+ |Development |82,578 |306 |49 |316|
80
+ |Test |58,997 |1,239 |46 |987|
81
+ |**Total** |372,701 |3,038 |123 |1,683 |
82
+
83
+ ## More info
84
+ More information about the dataset, model experimentation and error analysis can be found in the paper: *[Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling](https://aclanthology.org/2022.acl-long.268/)*.
85
+
86
+ ## How to use
87
+ ```
88
+ from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
89
+
90
+ tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
91
+ model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-beto")
92
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
93
+
94
+ example = example = "Buscamos data scientist para proyecto de machine learning."
95
+
96
+ borrowings = nlp(example)
97
+ print(borrowings)
98
+
99
+ ```
100
+
101
  ## Citation
102
 
103
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->