updates
Browse files
README.md
CHANGED
|
@@ -93,12 +93,22 @@ more efficient compute- and data-wise to train completely on in-domain data with
|
|
| 93 |
## Training data
|
| 94 |
2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
|
| 95 |
The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
|
|
|
|
| 96 |
|
| 97 |
## Training procedure
|
| 98 |
RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
|
| 99 |
|
| 100 |
## Evaluation results
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
# How to use
|
| 104 |
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
|
|
@@ -132,7 +142,18 @@ with torch.no_grad():
|
|
| 132 |
|
| 133 |
|
| 134 |
# Limitations and bias
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
## BibTeX entry and citation info
|
| 138 |
```
|
|
|
|
| 93 |
## Training data
|
| 94 |
2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
|
| 95 |
The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
|
| 96 |
+
See [Bernice pretrain dataset](https://huggingface.co/datasets/jhu-clsp/bernice-pretrain-data) for details.
|
| 97 |
|
| 98 |
## Training procedure
|
| 99 |
RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
|
| 100 |
|
| 101 |
## Evaluation results
|
| 102 |
+
We evaluated Bernice on three Twitter benchmarks: [TweetEval](https://aclanthology.org/2020.findings-emnlp.148/), [Unified Multilingual Sentiment Analysis
|
| 103 |
+
Benchmark (UMSAB)](https://aclanthology.org/2022.lrec-1.27/), and [Multilingual Hate Speech](https://link.springer.com/chapter/10.1007/978-3-030-67670-4_26). Summary results are shown below, see the paper appendix
|
| 104 |
+
for details.
|
| 105 |
+
|
| 106 |
+
| | **Bernice** | **BERTweet** | **XLM-R** | **XLM-T** | **TwHIN-BERT-MLM** | **TwHIN-BERT** |
|
| 107 |
+
|---------|-------------|--------------|-----------|-----------|--------------------|----------------|
|
| 108 |
+
| TweetEval | 64.80 | **67.90** | 57.60 | 64.40 | 64.80 | 63.10 |
|
| 109 |
+
| UMSAB | **70.34** | - | 67.71 | 66.74 | 68.10 | 67.53 |
|
| 110 |
+
| Hate Speech | **76.20** | - | 74.54 | 73.31 | 73.41 | 74.32 |
|
| 111 |
+
|
| 112 |
|
| 113 |
# How to use
|
| 114 |
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
|
|
|
|
| 142 |
|
| 143 |
|
| 144 |
# Limitations and bias
|
| 145 |
+
|
| 146 |
+
**Presence of Hate Speech:** As with all social media data, there exists spam and hate speech.
|
| 147 |
+
We cleaned our data by filtering for tweet length, but the possibility of this spam remains.
|
| 148 |
+
Hate speech is difficult to detect, especially across languages and cultures thus we leave its removal for future work.
|
| 149 |
+
|
| 150 |
+
**Low-resource Language Evaluation:** Within languages, even with language sampling during training,
|
| 151 |
+
Bernice is still not exposed to the same variety of examples in low-resource languages as high-resource languages like English and Spanish.
|
| 152 |
+
It is unclear whether enough Twitter data exists in these languages, such as Tibetan and Telugu, to ever match the performance on high-resource languages.
|
| 153 |
+
Only models more efficient at generalizing can pave the way for better performance in the wide variety of languages in this low-resource category.
|
| 154 |
+
|
| 155 |
+
See the paper for a more detailed discussion.
|
| 156 |
+
|
| 157 |
|
| 158 |
## BibTeX entry and citation info
|
| 159 |
```
|