Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base
This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an Alibaba-NLP/gte-multilingual-base model that was further adapted by (Michail et al., 2025)
Limitations
We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use histlux-paraphrase-multilingual-mpnet-base-v2
Model Description
- Model Type: GTE-Multilingual-Base
- Base model: Alibaba-NLP/gte-multilingual-base
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- LB-EN (Historical, Modern)
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)
Evaluation Results
Metrics
(see introducing paper) Historical Bitext Mining (Accuracy):
LB -> FR: 96.8
FR -> LB: 96.9
LB -> EN: 97.2
EN -> LB: 97.2
LB -> DE: 98.0
DE -> LB: 91.8
Contemporary LB (Accuracy): ParaLUX: 62.82
SIB-200(LB): 62.16
Training Details
Training Dataset
The parallel sentences data mix is the following:
impresso-project/HistLuxAlign:
- LB-FR (x20,000)
- LB-EN (x20,000)
- LB-DE (x20,000)
fredxlpy/LuxAlign:
- LB-FR (x40,000)
- LB-EN (x20,000)
Total: 120 000 Sentence pairs in mixed batches of size 8
Contrastive Training
The model was trained with the parameters:
**Loss**:
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit()-Method:
{ "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", }
Citation
BibTeX
Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
@misc{michail2025adaptingmultilingualembeddingmodels,
title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
year={2025},
eprint={2502.07938},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07938},
}
Original Multilingual GTE Model
@inproceedings{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
pages={1393--1412},
year={2024}
}
- Downloads last month
- 13
Model tree for impresso-project/histlux-gte-multilingual-base
Base model
Alibaba-NLP/gte-multilingual-baseDatasets used to train impresso-project/histlux-gte-multilingual-base
Evaluation results
- SIB-200(LB) accuracy on Contemporary-lbself-reported0.622
- ParaLUX accuracy on Contemporary-lbself-reported0.628
- LB<->FR accuracy on LBHistoricalBitextMiningself-reported0.968
- LB<->EN accuracy on LBHistoricalBitextMiningself-reported0.972
- LB<->DE accuracy on LBHistoricalBitextMiningself-reported0.979