Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an paraphrase-multilingual-mpnet-base-v2 model that was further adapted by (Michail et al., 2025)

Limitations

This model only supports inputs of up to 128 subtokens long.

We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using histlux-gte-multilingual-base

However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.

Model Description

Model Sources

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
# Run inference
sentences = [
    'The cross-border workers should also receive more wages.',
    "D'grenzarbechetr missten och me' lo'n kre'en.",
    "De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

(see introducing paper) Historical Bitext Mining (Accuracy):

LB -> FR: 88.6

FR -> LB: 90.0

LB -> EN: 88.7

EN -> LB: 90.4

LB -> DE: 91.1

DE -> LB: 91.8

Contemporary LB (Accuracy):

ParaLUX: 80.5

SIB-200(LB): 59.4

Training Details

Training Dataset

LB-EN (Historical, Modern)

  • Dataset: lb-en (mixed)
  • Size: 40,000 training samples
  • Columns: english, luxembourgish, and label (teacher's en embeddings)
  • Approximate statistics based on the first 1000 samples:
    english luxembourgish label
    type string string list
    details
    • min: 4 tokens
    • mean: 25.32 tokens
    • max: 128 tokens
    • min: 5 tokens
    • mean: 36.91 tokens
    • max: 128 tokens
    • size: 768 elements
  • Samples:
    english luxembourgish label
    A lesson for the next year Eng le’er fir dat anert joer [0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]
    On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station. Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare. [-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]
    The happiness, the peace is long gone now, V ergângen ass nu läng dat gléck, de' fréd, [0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]
  • Loss: MSELoss

Evaluation Dataset

Non-Default Hyperparameters

  • learning_rate: 2e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.1
  • bf16: True
  • Rest are default

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.49.0
  • PyTorch: 2.6.0
  • Accelerate: 1.4.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

@misc{michail2025adaptingmultilingualembeddingmodels,
      title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, 
      author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
      year={2025},
      eprint={2502.07938},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07938}, 
}

Multilingual Knowledge Distillation

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}
Downloads last month
9
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2

Datasets used to train impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2

Evaluation results