Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Lb_mBERT
|
2 |
+
|
3 |
+
Lb_mBERT is a BERT-like language model for the Luxembourgish language.
|
4 |
+
|
5 |
+
We used the weights of the multilingual BERT (mBERT) language model as a starting point and continued pre-training it on the MLM task using the same corpus that we used for our LuxemBERT model (https://huggingface.co/lothritz/LuxemBERT).
|
6 |
+
|
7 |
+
We achieved higher performances on some downstream tasks than the original LuxemBERT, and another Luxembourgish BERT model called DA BERT (https://huggingface.co/iolariu/DA_BERT).
|
8 |
+
|
9 |
+
If you would like to know more about our work, the pre-training corpus, or use our models or datasets, please check out/cite the following papers:
|
10 |
+
|
11 |
+
```
|
12 |
+
@inproceedings{lothritz-etal-2022-luxembert,
|
13 |
+
title = "{L}uxem{BERT}: Simple and Practical Data Augmentation in Language Model Pre-Training for {L}uxembourgish",
|
14 |
+
author = "Lothritz, Cedric and
|
15 |
+
Lebichot, Bertrand and
|
16 |
+
Allix, Kevin and
|
17 |
+
Veiber, Lisa and
|
18 |
+
Bissyande, Tegawende and
|
19 |
+
Klein, Jacques and
|
20 |
+
Boytsov, Andrey and
|
21 |
+
Lefebvre, Cl{\'e}ment and
|
22 |
+
Goujon, Anne",
|
23 |
+
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
|
24 |
+
month = jun,
|
25 |
+
year = "2022",
|
26 |
+
address = "Marseille, France",
|
27 |
+
publisher = "European Language Resources Association",
|
28 |
+
url = "https://aclanthology.org/2022.lrec-1.543",
|
29 |
+
pages = "5080--5089",
|
30 |
+
abstract = "Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.",
|
31 |
+
}
|
32 |
+
```
|
33 |
+
|
34 |
+
```
|
35 |
+
@inproceedings{lothritz2023comparing,
|
36 |
+
title={Comparing Pre-Training Schemes for Luxembourgish BERT Models},
|
37 |
+
author={Lothritz, Cedric and Ezzini, Saad and Purschke, Christoph and Bissyande, Tegawend{\'e} Fran{\c{c}}ois D Assise and Klein, Jacques and Olariu, Isabella and Boytsov, Andrey and Lefebvre, Clement and Goujon, Anne},
|
38 |
+
booktitle={Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)},
|
39 |
+
year={2023}
|
40 |
+
}
|
41 |
+
```
|