MMLW-retrieval-roberta-large-v2

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. The second version is based on the same foundational model (polish-roberta-large-v2), but the training process incorporated modern LLM-based English retrievers and rerankers, which led to improved results. This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors. The model was developed using a two-step procedure:

  • In the first step, it was initialized with Polish RoBERTa checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 20 million Polish-English text pairs. We utilised stella_en_1.5B_v5 as the teacher models for distillation.
  • The second step involved fine-tuning the model with contrastrive loss using a dataset consisting of over 4 million queries. Positive and negative passages for each query have been selected with the help of BAAI/bge-reranker-v2.5-gemma2-lightweight reranker.

Usage (Sentence-Transformers)

The model supports both information retrieval and semantic textual similarity. For retrieval, queries should be prefixed with "[query]: ". For symmetric tasks such as semantic similarity, both texts should be prefixed with "[sts]: ".

Please note that the model uses a custom implementation, so you should add trust_remote_code=True argument when loading it. It is also recommended to use Flash Attention 2, which can be enabled with attn_implementation argument. You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "sdadas/mmlw-retrieval-roberta-large-v2",
    trust_remote_code=True,
    device="cuda",
    model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
)
# Flash-Attention works only in 16-bit mode, so we need to cast the model to float16 or bfloat16
model.bfloat16()

# Retrieval example
query_prefix = "[query]: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])

# Semantic similarity example
sim_prefix = "[sts]: "
sentences = [
    sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
    sim_prefix + "One should eat healthy and engage in sports.",
    sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
]
emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(emb, emb))

Evaluation Results

The model achieves NDCG@10 of 60.71 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Citation

@inproceedings{dadas2024pirb,
  title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
  author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  pages={12761--12774},
  year={2024}
}
Downloads last month
62
Safetensors
Model size
435M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including sdadas/mmlw-retrieval-roberta-large-v2