Update README.md
Browse files
README.md
CHANGED
@@ -58,6 +58,46 @@ The model uses BIO encoding to account for multitoken borrowings.
|
|
58 |
- [Observatory of anglicism usage in the Spanish press](https://observatoriolazaro.es/)
|
59 |
- [pylazaro Python library](https://pylazaro.readthedocs.io/)
|
60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
## Citation
|
62 |
|
63 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
|
|
58 |
- [Observatory of anglicism usage in the Spanish press](https://observatoriolazaro.es/)
|
59 |
- [pylazaro Python library](https://pylazaro.readthedocs.io/)
|
60 |
|
61 |
+
|
62 |
+
## Metrics (on the test set)
|
63 |
+
The following table summarizes the results obtained by this model on the test set of the [COALAS](https://github.com/lirondos/coalas/) corpus.
|
64 |
+
|
65 |
+
| LABEL | Precision | Recall | F1 |
|
66 |
+
|:-------|-----:|-----:|---------:|
|
67 |
+
| ALL |85.03 |81.32 | 83.13 |
|
68 |
+
| ENG | 85.25 | 83.94 | 84.59 |
|
69 |
+
| OTHER | 55.56 | 10.87 | 18.18 |
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
## Dataset
|
74 |
+
This model was trained on [COALAS](https://github.com/lirondos/coalas/), a corpus of Spanish newswire annotated with unassimilated lexical borrowings. The corpus contains 370,000 tokens and includes various written media written in European Spanish. The test set was designed to be as difficult as possible: it covers sources and dates not seen in the training set, includes a high number of OOV words (92% of the borrowings in the test set are OOV) and is very borrowing-dense (20 borrowings per 1,000 tokens).
|
75 |
+
|
76 |
+
|Set | Tokens | ENG | OTHER | Unique |
|
77 |
+
|:-------|-----:|-----:|---------:|---------:|
|
78 |
+
|Training |231,126 |1,493 | 28 |380 |
|
79 |
+
|Development |82,578 |306 |49 |316|
|
80 |
+
|Test |58,997 |1,239 |46 |987|
|
81 |
+
|**Total** |372,701 |3,038 |123 |1,683 |
|
82 |
+
|
83 |
+
## More info
|
84 |
+
More information about the dataset, model experimentation and error analysis can be found in the paper: *[Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling](https://aclanthology.org/2022.acl-long.268/)*.
|
85 |
+
|
86 |
+
## How to use
|
87 |
+
```
|
88 |
+
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
89 |
+
|
90 |
+
tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
|
91 |
+
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-beto")
|
92 |
+
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
|
93 |
+
|
94 |
+
example = example = "Buscamos data scientist para proyecto de machine learning."
|
95 |
+
|
96 |
+
borrowings = nlp(example)
|
97 |
+
print(borrowings)
|
98 |
+
|
99 |
+
```
|
100 |
+
|
101 |
## Citation
|
102 |
|
103 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|