GAMA-IT

Running on Zero

App Files Files Community

GAMA-IT / hf /transformers /examples /legacy /question-answering /README.md

sonalkum

bug fix

fa57c60 about 1 year ago

preview code

raw

history blame contribute delete

4.78 kB

	#### Fine-tuning BERT on SQuAD1.0 with relative position embeddings

	The following examples show how to fine-tune BERT models with different relative position embeddings. The BERT model
	`bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained
	models which were pre-trained on the same training data (BooksCorpus and English Wikipedia) as in the BERT model
	training, but with different relative position embeddings.

	* `zhiheng-huang/bert-base-uncased-embedding-relative-key`, trained from scratch with relative embedding proposed by
	Shaw et al., [Self-Attention with Relative Position Representations](https://arxiv.org/abs/1803.02155)
	* `zhiheng-huang/bert-base-uncased-embedding-relative-key-query`, trained from scratch with relative embedding method 4
	in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)
	* `zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query`, fine-tuned from model
	`bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al.
	[Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)


	##### Base models fine-tuning

	```bash
	export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
	python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
	--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
	--dataset_name squad \
	--do_train \
	--do_eval \
	--learning_rate 3e-5 \
	--num_train_epochs 2 \
	--max_seq_length 512 \
	--doc_stride 128 \
	--output_dir relative_squad \
	--per_device_eval_batch_size=60 \
	--per_device_train_batch_size=6
	```
	Training with the above command leads to the following results. It boosts the BERT default from f1 score of 88.52 to 90.54.

	```bash
	'exact': 83.6802270577105, 'f1': 90.54772098174814
	```

	The change of `max_seq_length` from 512 to 384 in the above command leads to the f1 score of 90.34. Replacing the above
	model `zhiheng-huang/bert-base-uncased-embedding-relative-key-query` with
	`zhiheng-huang/bert-base-uncased-embedding-relative-key` leads to the f1 score of 89.51. The changing of 8 gpus to one
	gpu training leads to the f1 score of 90.71.

	##### Large models fine-tuning

	```bash
	export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
	python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
	--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
	--dataset_name squad \
	--do_train \
	--do_eval \
	--learning_rate 3e-5 \
	--num_train_epochs 2 \
	--max_seq_length 512 \
	--doc_stride 128 \
	--output_dir relative_squad \
	--per_gpu_eval_batch_size=6 \
	--per_gpu_train_batch_size=2 \
	--gradient_accumulation_steps 3
	```
	Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for
	`bert-large-uncased-whole-word-masking`.

	#### Distributed training

	Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:

	```bash
	python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
	--model_name_or_path bert-large-uncased-whole-word-masking \
	--dataset_name squad \
	--do_train \
	--do_eval \
	--learning_rate 3e-5 \
	--num_train_epochs 2 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
	--per_device_eval_batch_size=3 \
	--per_device_train_batch_size=3 \
	```

	Training with the previously defined hyper-parameters yields the following results:

	```bash
	f1 = 93.15
	exact_match = 86.91
	```

	This fine-tuned model is available as a checkpoint under the reference
	[`bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad).

	## Results

	Larger batch size may improve the performance while costing more memory.

	##### Results for SQuAD1.0 with the previously defined hyper-parameters:

	```python
	{
	"exact": 85.45884578997162,
	"f1": 92.5974600601065,
	"total": 10570,
	"HasAns_exact": 85.45884578997162,
	"HasAns_f1": 92.59746006010651,
	"HasAns_total": 10570
	}
	```

	##### Results for SQuAD2.0 with the previously defined hyper-parameters:

	```python
	{
	"exact": 80.4177545691906,
	"f1": 84.07154997729623,
	"total": 11873,
	"HasAns_exact": 76.73751686909581,
	"HasAns_f1": 84.05558584352873,
	"HasAns_total": 5928,
	"NoAns_exact": 84.0874684608915,
	"NoAns_f1": 84.0874684608915,
	"NoAns_total": 5945
	}
	```