Nicolas-BZRD commited on
Commit
1f1da61
·
verified ·
1 Parent(s): 97b29d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -2
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: README
3
- emoji: 🦀
4
  colorFrom: green
5
  colorTo: blue
6
  sdk: static
@@ -14,4 +14,13 @@ Checkout the full technical report for all the info ! <br>
14
  Checkout the training library to express your creativity!
15
 
16
  ## Abstract
17
- General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: README
3
+ emoji: 🌍
4
  colorFrom: green
5
  colorTo: blue
6
  sdk: static
 
14
  Checkout the training library to express your creativity!
15
 
16
  ## Abstract
17
+ General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.
18
+
19
+ ## Contributors
20
+
21
+ This project was made possible through the collaboration between the MICS laboratory at CentraleSupélec, Diabolocom, Artefact, and Unbabel, as well as the technological support of AMD and CINES. We also highlight the support of the French government through the France 2030 program as part of the ArGiMi project and DataIA Institute, whose contributions facilitated the completion of this work.
22
+
23
+ Finally, we thank the entire EuroBERT team without whom this would not have been possible:
24
+ Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Celine Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno Miguel Guerreiro, Ricardo Rei, Pierre Colombo
25
+
26
+ Diabolocom, Artefact, MICS, CentraleSupélec, Université Paris-Saclay, Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit), Instituto de Telecomunicações, Unbabel, Université Paris-Saclay, CNRS, LISN, INSA Rennes, IRISA, CINES, IRT Saint Exupéry, Illuin Technology, Université Grenoble Alpes, Grenoble INP, LIG, Equall, ISIA Lab