lengocduc195's picture
pushNe
2359bda

Paraphrase Data

This page is currently work-in-progress and will be extended in the future

In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we showed that paraphrase dataset together with MultipleNegativesRankingLoss is a powerful combination to learn sentence embeddings models.

You can find here: NLI - MultipleNegativesRankingLoss more information how the loss can be used.

In this folder, we collect different datasets and scripts to train using paraphrase data.

Datasets

You can find here: sbert.net/datasets/paraphrases a list of datasets with paraphrases suitable for training.

See the respective linked source website for the dataset license.

All datasets have a sample per line and the individual sentences are separated by a tab (\t). Some datasets (like AllNLI) has three sentences per line: An anchor, a positive, and a hard negative.

We measure for each dataset the performance on the STSb development dataset after 2k training steps with a distilroberta-base model and a batch size of 256.

Note: We find that the STSb dataset is a suboptimal dataset to evaluate the quality of sentence embedding models. It consists mainly of rather simple sentences, it does not require any domain specific knowledge, and the included sentences are of rather high quality compared to noisy, user-written content. Please do not infer from the above numbers how the approaches will perform on your domain specific dataset.

Training

See training.py for the training script.

The training script allows to load one or multiple files. We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.

As the dataset sizes are quite different in size, we perform a temperature controlled sampling from the datasets: Smaller datasets are up-sampled, while larger datasets are down-sampled. This allows an effective training with very large and smaller datasets.

Pre-Trained Models

Have a look at pre-trained models to view all models that were trained on these paraphrase datasets.

  • paraphrase-MiniLM-L12-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
  • paraphrase-distilroberta-base-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
  • paraphrase-distilroberta-base-v1 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split
  • paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base)

Work in Progress

Training with this data is currently work-in-progress. Things that will be added in the next time:

  • More datasets: Are you aware of more suitable training datasets? Let me know: [email protected]
  • Optimized batching: Currently batches are only drawn from one dataset. Future work might include also batches that are sampled across datasets
  • Optimized loss function: Currently the same parameters of MultipleNegativesRankingLoss is used for all datasets. Future work includes testing if the dataset benefit from individual loss functions.
  • Pre-trained models: Once all datasets are collected, we will train and release respective models.