| # MLM | |
| Masked Language Model (MLM) is the process how BERT was pre-trained. It has been shown, that to continue MLM on your own data can improve performances (see [Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964)). In our [TSDAE-paper](https://arxiv.org/abs/2104.06979) we also show that MLM is a powerful pre-training strategy for learning sentence embeddings. This is especially the case when you work on some specialized domain. | |
| **Note:** Only running MLM will not yield good sentence embeddings. But you can first tune your favorite transformer model with MLM on your domain specific data. Then you can fine-tune the model with the labeled data you have or using other data sets like [NLI](../../training/nli/README.md), [Paraphrases](../../training/paraphrases/README.md), or [STS](../../training/sts/README.md). | |
|  | |
| ## Running MLM | |
| The **[train_mlm.py](train_mlm.py)** script provides an easy option to run MLM on your data. You run this script by: | |
| ```bash | |
| python train_mlm.py distilbert-base path/train.txt | |
| ``` | |
| You can also provide an optional dev dataset: | |
| ```bash | |
| python train_mlm.py distilbert-base path/train.txt path/dev.txt | |
| ``` | |
| Each line in train.txt / dev.txt is interpreted as one input for the transformer network, i.e. as one sentence or paragraph. | |
| For more information how to run MLM with huggingface transformers, see the [Language model training examples](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling). | |