Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an paraphrase-multilingual-mpnet-base-v2 model that was further adapted by (Michail et al., 2025)

Limitations

This model only supports inputs of up to 128 subtokens long.

We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using histlux-gte-multilingual-base

However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Maximum Sequence Length: 128 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- LB-EN (Historical, Modern)

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
# Run inference
sentences = [
    'The cross-border workers should also receive more wages.',
    "D'grenzarbechetr missten och me' lo'n kre'en.",
    "De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

(see introducing paper) Historical Bitext Mining (Accuracy):

LB -> FR: 88.6

FR -> LB: 90.0

LB -> EN: 88.7

EN -> LB: 90.4

LB -> DE: 91.1

DE -> LB: 91.8

Contemporary LB (Accuracy):

ParaLUX: 80.5

SIB-200(LB): 59.4

Training Details

Training Dataset

LB-EN (Historical, Modern)

Dataset: lb-en (mixed)
Size: 40,000 training samples
Columns: english, luxembourgish, and label (teacher's en embeddings)
Approximate statistics based on the first 1000 samples:
english luxembourgish label
type string string list
details
min: 4 tokens
mean: 25.32 tokens
max: 128 tokens

min: 5 tokens
mean: 36.91 tokens
max: 128 tokens

size: 768 elements

	english	luxembourgish	label
type	string	string	list
details	min: 4 tokens mean: 25.32 tokens max: 128 tokens	min: 5 tokens mean: 36.91 tokens max: 128 tokens	size: 768 elements

Samples:

english	luxembourgish	label
`A lesson for the next year`	`Eng le’er fir dat anert joer`	`[0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]`
`On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station.`	`Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare.`	`[-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]`
`The happiness, the peace is long gone now,`	`V ergângen ass nu läng dat gléck, de' fréd,`	`[0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]`

Loss: MSELoss

Evaluation Dataset

Non-Default Hyperparameters

learning_rate: 2e-05
num_train_epochs: 5
warmup_ratio: 0.1
bf16: True
Rest are default

Framework Versions

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.49.0
PyTorch: 2.6.0
Accelerate: 1.4.0
Datasets: 3.3.2
Tokenizers: 0.21.0

Citation

BibTeX

Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

@misc{michail2025adaptingmultilingualembeddingmodels,
      title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, 
      author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
      year={2025},
      eprint={2502.07938},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07938}, 
}

Multilingual Knowledge Distillation

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

impresso-project
/

histlux-paraphrase-multilingual-mpnet-base-v2