fathan commited on
Commit
106f023
·
1 Parent(s): a48feed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -21
README.md CHANGED
@@ -1,34 +1,66 @@
1
  ---
2
  tags:
3
  - generated_from_trainer
4
- metrics:
5
- - accuracy
6
  model-index:
7
  - name: code_mixed_ijeroberta
8
  results: []
 
 
 
 
 
 
 
9
  ---
10
 
11
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
  should probably proofread and complete it, then remove this comment. -->
13
 
14
- # code_mixed_ijeroberta
15
-
16
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
17
- It achieves the following results on the evaluation set:
18
- - Loss: 3.1174
19
- - Accuracy: 0.4740
20
-
21
- ## Model description
22
-
23
- More information needed
24
-
25
- ## Intended uses & limitations
26
-
27
- More information needed
28
-
29
- ## Training and evaluation data
30
-
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Training procedure
34
 
@@ -52,4 +84,4 @@ The following hyperparameters were used during training:
52
  - Transformers 4.26.0
53
  - Pytorch 1.12.0+cu102
54
  - Datasets 2.9.0
55
- - Tokenizers 0.12.1
 
1
  ---
2
  tags:
3
  - generated_from_trainer
 
 
4
  model-index:
5
  - name: code_mixed_ijeroberta
6
  results: []
7
+ language:
8
+ - id
9
+ - jv
10
+ - en
11
+ pipeline_tag: fill-mask
12
+ widget:
13
+ - text: biasane nek arep [MASK] file bs pake software ini
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
  should probably proofread and complete it, then remove this comment. -->
18
 
19
+ # Code-mixed IJERoBERTa
20
+
21
+ Code-mixed IJERoBERTa is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data.
22
+ This model is trained based on [BERT](https://arxiv.org/abs/1810.04805) model utilizing
23
+ Hugging Face's [Transformers]((https://huggingface.co/transformers)) library.
24
+
25
+ ## Pre-training Data
26
+ The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases.
27
+ To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words.
28
+ The following are few examples of the keyword phrases:
29
+ - travelling terus
30
+ - proud koncoku
31
+ - great kalian semua
32
+ - chattingane ilang
33
+ - baru aja launching
34
+
35
+ We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:
36
+ - remove duplicate tweets,
37
+ - remove tweets with token length less than 5,
38
+ - remove multiple space,
39
+ - convert emoticon,
40
+ - convert all tweets to lower case.
41
+
42
+ After the first stage pre-processing, we obtain 17,385,773 tweets.
43
+ In the second stage pre-processing, we do the following pre-processing tasks:
44
+ - split the tweets into sentences,
45
+ - remove sentences with token length less than 4,
46
+ - convert ‘@username’ to ‘@USER’,
47
+ - convert URL to HTTPURL.
48
+
49
+ Finally, we have 28,121,693 sentences for the training process.
50
+ This pretraining data will not be opened to public due to Twitter policy.
51
+
52
+ ## Model
53
+ | Model name | Architecture | Size of training data | Size of validation data |
54
+ |-------------------------|-----------------|----------------------------|-------------------------|
55
+ | `code-mixed-ijeroberta` | BERT | 2.24 GB of text | 249 MB of text |
56
+
57
+ ## Evaluation Results
58
+ We train the data with 3 epochs and total steps of 296598 for 12 days.
59
+ The following are the results obtained from the training:
60
+
61
+ | train loss | eval loss | eval perplexity |
62
+ |------------|------------|-----------------|
63
+ | 3.586 | 3.1174 | 22.5867 |
64
 
65
  ## Training procedure
66
 
 
84
  - Transformers 4.26.0
85
  - Pytorch 1.12.0+cu102
86
  - Datasets 2.9.0
87
+ - Tokenizers 0.12.1