Dan Velasco
commited on
Commit
·
0ee58ce
1
Parent(s):
e3329ba
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: tl
|
3 |
+
tags:
|
4 |
+
- roberta
|
5 |
+
- tagalog
|
6 |
+
- filipino
|
7 |
+
license: cc-by-sa-4.0
|
8 |
+
inference: false
|
9 |
+
---
|
10 |
+
|
11 |
+
# RoBERTa Tagalog Base (finetuned on COHFIE)
|
12 |
+
We finetuned [RoBERTa Tagalog Base](https://huggingface.co/jcblaise/roberta-tagalog-base) on a subset of the Corpus of Historical Filipino and Philippine English (COHFIE) which contains pure filipino and code-switching (i.e. tagalog-english) sentences. All model details, training setups, and corpus details can be found in this paper: [Automatic WordNet Construction using Word Sense Induction through Sentence Embeddings](https://arxiv.org/abs/2204.03251).
|
13 |
+
|
14 |
+
## Training Data
|
15 |
+
Sorry, the corpus is not publicly available yet. Stay tuned!
|
16 |
+
|
17 |
+
## Intended uses & limitations
|
18 |
+
The intended use of this model is to adapt the pre-trained model to our target corpus, COHFIE, with the intention of clustering sentences. You can finetune this model on downstream NLP tasks in Filipino. This model may not be safe for use in production since we did not examine it for biases. Please use it with caution.
|
19 |
+
|
20 |
+
## How to use
|
21 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
22 |
+
```python
|
23 |
+
from transformers import AutoTokenizer, AutoModel
|
24 |
+
tokenizer = AutoTokenizer.from_pretrained("danjohnvelasco/roberta-tagalog-base-cohfie-v1")
|
25 |
+
model = AutoModel.from_pretrained("danjohnvelasco/roberta-tagalog-base-cohfie-v1")
|
26 |
+
text = "Replace me with any text in Filipino."
|
27 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
28 |
+
output = model(**encoded_input)
|
29 |
+
```
|
30 |
+
|
31 |
+
## BibTeX entry and citation info
|
32 |
+
If you use this model, please cite our work:
|
33 |
+
|
34 |
+
```
|
35 |
+
@misc{https://doi.org/10.48550/arxiv.2204.03251,
|
36 |
+
doi = {10.48550/ARXIV.2204.03251},
|
37 |
+
url = {https://arxiv.org/abs/2204.03251},
|
38 |
+
author = {Velasco, Dan John and Alba, Axel and Pelagio, Trisha Gail and Ramirez, Bryce Anthony and Cruz, Jan Christian Blaise and Cheng, Charibeth},
|
39 |
+
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
40 |
+
title = {Automatic WordNet Construction using Word Sense Induction through Sentence Embeddings},
|
41 |
+
publisher = {arXiv},
|
42 |
+
year = {2022},
|
43 |
+
copyright = {Creative Commons Attribution 4.0 International}
|
44 |
+
}
|
45 |
+
|
46 |
+
```
|