Davlan commited on
Commit
6e178e1
·
1 Parent(s): 545dd05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -1
README.md CHANGED
@@ -1,3 +1,182 @@
1
  ---
2
- license: afl-3.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: afro-xlmr-large-75L
7
+ results: []
8
+ language:
9
+ - en
10
+ - am
11
+ - ar
12
+ - so
13
+ - sw
14
+ - pt
15
+ - af
16
+ - fr
17
+ - zu
18
+ - mg
19
+ - ha
20
+ - sn
21
+ - arz
22
+ - ny
23
+ - ig
24
+ - xh
25
+ - yo
26
+ - st
27
+ - rw
28
+ - tn
29
+ - ti
30
+ - ts
31
+ - om
32
+ - run
33
+ - nso
34
+ - ee
35
+ - ln
36
+ - tw
37
+ - pcm
38
+ - gaa
39
+ - loz
40
+ - lg
41
+ - guw
42
+ - bem
43
+ - efi
44
+ - lue
45
+ - lua
46
+ - toi
47
+ - ve
48
+ - tum
49
+ - tll
50
+ - iso
51
+ - kqn
52
+ - zne
53
+ - umb
54
+ - mos
55
+ - tiv
56
+ - lu
57
+ - ff
58
+ - kwy
59
+ - bci
60
+ - rnd
61
+ - luo
62
+ - wal
63
+ - ss
64
+ - lun
65
+ - wo
66
+ - nyk
67
+ - kj
68
+ - ki
69
+ - fon
70
  ---
71
+
72
+
73
+ # afro-xlmr-large-75L
74
+
75
+ AfroXLMR-large was created by MLM adaptation of XLM-R-large model on 75 languages widely spoken in Africa
76
+ including 4 high resource languages.
77
+
78
+ ### Pre-training corpus
79
+ A mix of mC4, Wikipedia and OPUS data
80
+
81
+ ### Languages
82
+
83
+ There are 75 languages available :
84
+ - English (eng)
85
+ - Amharic (amh)
86
+ - Arabic (ara)
87
+ - Somali (som)
88
+ - Kiswahili (swa)
89
+ - Portuguese (por)
90
+ - Afrikaans (afr)
91
+ - French (fra)
92
+ - isiZulu (zul)
93
+ - Malagasy (mlg)
94
+ - Hausa (hau)
95
+ - chiShona (sna)
96
+ - Egyptian Arabic (arz)
97
+ - Chichewa (nya)
98
+ - Igbo (ibo)
99
+ - isiXhosa (xho)
100
+ - Yorùbá (yor)
101
+ - Sesotho (sot)
102
+ - Kinyarwanda (kin)
103
+ - Tigrinya (tir)
104
+ - Tsonga (tso)
105
+ - Oromo (orm)
106
+ - Rundi (run)
107
+ - Northern Sotho (nso)
108
+ - Ewe (ewe)
109
+ - Lingala (lin)
110
+ - Twi (twi)
111
+ - Nigerian Pidgin (pcm)
112
+ - Ga (gaa)
113
+ - Lozi (loz)
114
+ - Luganda (lug)
115
+ - Gun (guw)
116
+ - Bemba (bem)
117
+ - Efik (efi)
118
+ - Luvale (lue)
119
+ - Luba-Lulua (lua)
120
+ - Tonga (toi)
121
+ - Tshivenḓa (ven)
122
+ - Tumbuka (tum)
123
+ - Tetela (tll)
124
+ - Isoko (iso)
125
+ - Kaonde (kqn)
126
+ - Zande (zne)
127
+ - Umbundu (umb)
128
+ - Mossi (mos)
129
+ - Tiv (tiv)
130
+ - Luba-Katanga (lub)
131
+ - Fula (fuv)
132
+ - San Salvador Kongo (kwy)
133
+ - Baoulé (bci)
134
+ - Ruund (rnd)
135
+ - Luo (luo)
136
+ - Wolaitta (wal)
137
+ - Swazi (ssw)
138
+ - Lunda (lun)
139
+ - Wolof (wol)
140
+ - Nyaneka (nyk)
141
+ - Kwanyama (kua)
142
+ - Kikuyu (kik)
143
+ - Fon (fon)
144
+ - Bambara (bam)
145
+ - Chokwe (cjk)
146
+ - Dinka (dik)
147
+ - Dyula (dyu)
148
+ - Kabyle (kab)
149
+ - Kamba (kam)
150
+ - Kabiyè (kbp)
151
+ - Kanuri (knc)
152
+ - Kimbundu (kmb)
153
+ - Kikongo (kon)
154
+ - Nuer (nus)
155
+ - Sango (sag)
156
+ - Tamasheq (taq)
157
+ - Tamazight (tzm)
158
+
159
+
160
+ ### Acknowledgment
161
+ We would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.
162
+
163
+
164
+ ### BibTeX entry and citation info.
165
+ ```
166
+ @inproceedings{alabi-etal-2022-adapting,
167
+ title = "Adapting Pre-trained Language Models to {A}frican Languages via Multilingual Adaptive Fine-Tuning",
168
+ author = "Alabi, Jesujoba O. and
169
+ Adelani, David Ifeoluwa and
170
+ Mosbach, Marius and
171
+ Klakow, Dietrich",
172
+ booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
173
+ month = oct,
174
+ year = "2022",
175
+ address = "Gyeongju, Republic of Korea",
176
+ publisher = "International Committee on Computational Linguistics",
177
+ url = "https://aclanthology.org/2022.coling-1.382",
178
+ pages = "4336--4349",
179
+ abstract = "Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {---} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.",
180
+ }
181
+
182
+ ```