tags:
- sentence-transformers
- sentence-similarity
- dataset_size:40000
- loss:MSELoss
- multilingual
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
- source_sentence: Who is filming along?
sentences:
- Wién filmt mat?
- >-
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer
hätt.
- Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
sentences:
- >-
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai,
do gëtt jo een ganz neie Wunnquartier gebaut.
- >-
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden
wor re eso'gucr me' we' 90 prozent.
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
Non-profit organisation Passerell, which provides legal council to
refugees in Luxembourg, announced that it has to make four employees
redundant in August due to a lack of funding.
sentences:
- Oetringen nach Remich....8.20» 215»
- >-
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache
Rechtsfroe këmmert, wäert am August mussen hir véier fix Salariéen
entloossen.
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
sentences:
- Six Jours vu New-York si fir d’équipe Girgetti — Debacco
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
sentences:
- D'grenzarbechetr missten och me' lo'n kre'en.
- >-
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der
Bréck gemâcht!
- >-
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
SentenceTransformer based on
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
results:
- task:
type: contemporary-lb
name: Contemporary-lb
dataset:
name: Contemporary-lb
type: contemporary-lb
metrics:
- type: accuracy
value: 0.594
name: SIB-200(LB) accuracy
- type: accuracy
value: 0.805
name: ParaLUX accuracy
- task:
type: bitext-mining
name: LBHistoricalBitextMining
dataset:
name: LBHistoricalBitextMining
type: lb-en
metrics:
- type: accuracy
value: 0.8932
name: LB<->FR accuracy
- type: accuracy
value: 0.8955
name: LB<->EN accuracy
- type: mean_accuracy
value: 0.9144
name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2
This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an paraphrase-multilingual-mpnet-base-v2 model that was further adapted by (Michail et al., 2025)
Limitations
This model only supports inputs of up to 128 subtokens long.
We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using histlux-gte-multilingual-base
However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.
Model Description
- Model Type: Sentence Transformer
- Base model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- Maximum Sequence Length: 128 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- LB-EN (Historical, Modern)
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
# Run inference
sentences = [
'The cross-border workers should also receive more wages.',
"D'grenzarbechetr missten och me' lo'n kre'en.",
"De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
(see introducing paper) Historical Bitext Mining (Accuracy):
LB -> FR: 88.6
FR -> LB: 90.0
LB -> EN: 88.7
EN -> LB: 90.4
LB -> DE: 91.1
DE -> LB: 91.8
Contemporary LB (Accuracy):
ParaLUX: 80.5
SIB-200(LB): 59.4
Training Details
Training Dataset
LB-EN (Historical, Modern)
- Dataset: lb-en (mixed)
- Size: 40,000 training samples
- Columns:
english
,luxembourgish
, andlabel (teacher's en embeddings)
- Approximate statistics based on the first 1000 samples:
english luxembourgish label type string string list details - min: 4 tokens
- mean: 25.32 tokens
- max: 128 tokens
- min: 5 tokens
- mean: 36.91 tokens
- max: 128 tokens
- size: 768 elements
- Samples:
english luxembourgish label A lesson for the next year
Eng le’er fir dat anert joer
[0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]
On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station.
Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare.
[-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]
The happiness, the peace is long gone now,
V ergângen ass nu läng dat gléck, de' fréd,
[0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]
- Loss:
MSELoss
Evaluation Dataset
Non-Default Hyperparameters
learning_rate
: 2e-05num_train_epochs
: 5warmup_ratio
: 0.1bf16
: True- Rest are default
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0
- Accelerate: 1.4.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
Citation
BibTeX
Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
@misc{michail2025adaptingmultilingualembeddingmodels,
title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
year={2025},
eprint={2502.07938},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07938},
}
Multilingual Knowledge Distillation
@inproceedings{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2004.09813",
}