Spaces:

rynmurdock
/

Babel

Runtime error

App Files Files Community

Babel / Optimus /code /examples /distillation /README.md

rynmurdock

init

c5ca37a about 1 year ago

preview code

raw

history blame contribute delete

7.49 kB

	# DistilBERT

	This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.

	2019, September 19th - Update: We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!

	## What is DistilBERT

	DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.

	For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
	). Please note that we will publish a formal write-up with updated and more complete results in the near future (September 19th).

	Here's the updated results on the dev sets of GLUE:

	\| Model \| Macro-score \| CoLA \| MNLI \| MRPC \| QNLI \| QQP \| RTE \| SST-2 \| STS-B \| WNLI \|
	\| :---: \| :---: \| :---:\| :---:\| :---:\| :---:\| :---:\| :---:\| :---:\| :---:\| :---:\|
	\| BERT-base \| 77.6 \| 48.9 \| 84.3 \| 88.6 \| 89.3 \| 89.5 \| 71.3 \| 91.7 \| 91.2 \| 43.7 \|
	\| DistilBERT \| 75.2 \| 49.1 \| 81.8 \| 90.2 \| 87.0 \| 89.2 \| 62.9 \| 92.7 \| 90.7 \| 44.4 \|

	## Setup

	This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.

	Important note: The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0). It is important to note that there is a small internal bug in the current version of PyTorch available on pip that causes a memory leak in our training/distillation. It has been recently fixed and will likely be integrated into the next release. For the moment, we recommend to [compile PyTorch from source](https://github.com/pytorch/pytorch#from-source). Please refer to [issue 1179](https://github.com/huggingface/pytorch-transformers/issues/1179) for more details.

	## How to use DistilBERT

	PyTorch-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):

	- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
	- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).

	Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.

	```python
	tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
	model = DistilBertModel.from_pretrained('distilbert-base-uncased')

	input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
	outputs = model(input_ids)
	last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
	```

	## How to train DistilBERT

	In the following, we will explain how you can train your own compressed model.

	### A. Preparing the data

	The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT).

	To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences).

	First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary.

	```bash
	python scripts/binarized_data.py \
	--file_path data/dump.txt \
	--bert_tokenizer bert-base-uncased \
	--dump_file data/binarized_text
	```

	Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:

	```bash
	python scripts/token_counts.py \
	--data_file data/binarized_text.bert-base-uncased.pickle \
	--token_counts_dump data/token_counts.bert-base-uncased.pickle
	```

	### B. Training

	Training with distillation is really simple once you have pre-processed the data:

	```bash
	python train.py \
	--dump_path serialization_dir/my_first_training \
	--data_file data/binarized_text.bert-base-uncased.pickle \
	--token_counts data/token_counts.bert-base-uncased.pickle \
	--force # overwrites the `dump_path` if it already exists.
	```

	By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.

	We highly encourage you to use distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:

	```bash
	export NODE_RANK=0
	export N_NODES=1

	export N_GPU_NODE=4
	export WORLD_SIZE=4
	export MASTER_PORT=<AN_OPEN_PORT>
	export MASTER_ADDR=<I.P.>

	pkill -f 'python -u train.py'

	python -m torch.distributed.launch \
	--nproc_per_node=$N_GPU_NODE \
	--nnodes=$N_NODES \
	--node_rank $NODE_RANK \
	--master_addr $MASTER_ADDR \
	--master_port $MASTER_PORT \
	train.py \
	--force \
	--n_gpu $WORLD_SIZE \
	--data_file data/binarized_text.bert-base-uncased.pickle \
	--token_counts data/token_counts.bert-base-uncased.pickle \
	--dump_path serialization_dir/my_first_distillation
	```

	Tips: Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!

	Happy distillation!