File size: 6,134 Bytes
c134b71 d89e6fa c134b71 f09ee5e c134b71 4827bb4 c134b71 923357d c134b71 f09ee5e c134b71 3bab858 f09ee5e 3bab858 f09ee5e 3bab858 c134b71 923357d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
tags:
- generated_from_trainer
datasets:
- Graphcore/wikipedia-bert-128
- Graphcore/wikipedia-bert-512
model-index:
- name: Graphcore/bert-base-uncased
results: []
---
# Graphcore/bert-base-uncased
This model is a pre-trained BERT-Base trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets.
It was trained on a Graphcore IPU-POD16 using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
Graphcore and Hugging Face are working together to make training of Transformer models on IPUs fast and easy. Learn more about how to take advantage of the power of Graphcore IPUs to train Transformers models at [hf.co/hardware/graphcore](https://huggingface.co/hardware/graphcore).
## Model description
BERT (Bidirectional Encoder Representations from Transformers) is a transformers model which is designed to pretrain bidirectional representations from unlabeled texts. It enables easy and fast fine-tuning for different downstream task such as Sequence Classification, Named Entity Recognition, Question Answering, Multiple Choice and MaskedLM.
It was trained with two objectives in pretraining : Masked language modeling(MLM) and Next sentence prediction(NSP). First, MLM is different from traditional LM which sees the words one after another while BERT allows the model to learn a bidirectional representation. In addition to MLM, NSP is used for jointly pertaining text-pair representations.
It reduces the need of many engineering efforts for building task specific architectures through pre-trained representation. And achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks.
## Training and evaluation data
Trained on wikipedia datasets:
- [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128)
- [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512)
## Fine-tuning with these weights
These weights can be used in either `transformers` or [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
For example, to fine-tune the GLUE task SST2 with `optimum-graphcore` you can do:
```
export TOKENIZERS_PARALLELISM=true
python examples/text-classification/run_glue.py \
--model_name_or_path bert-base-uncased \
--ipu_config_name Graphcore/bert-base-ipu \
--task_name sst2 \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 32 \
--pod_type pod4 \
--learning_rate 2e-5 \
--lr_scheduler_type linear \
--warmup_ratio 0.25 \
--num_train_epochs 3 \
--seed 1984 \
--save_steps -1 \
--dataloader_num_workers 64 \
--dataloader_drop_last \
--overwrite_output_dir \
--output_dir /tmp/sst2
```
## Training procedure
Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962).
Trained on a Graphcore IPU-POD16 using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
It was trained with the IPUConfig [Graphcore/bert-base-ipu](https://huggingface.co/Graphcore/bert-base-ipu/).
Command lines:
Phase 1:
```
python examples/language-modeling/run_pretraining.py \
--config_name bert-base-uncased \
--tokenizer_name bert-base-uncased \
--ipu_config_name Graphcore/bert-base-ipu \
--dataset_name Graphcore/wikipedia-bert-128 \
--do_train \
--logging_steps 5 \
--max_seq_length 128 \
--max_steps 10500 \
--is_already_preprocessed \
--dataloader_num_workers 64 \
--dataloader_mode async_rebatched \
--lamb \
--lamb_no_bias_correction \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 512 \
--learning_rate 0.006 \
--lr_scheduler_type linear \
--loss_scaling 16384 \
--weight_decay 0.01 \
--warmup_ratio 0.28 \
--save_steps 100 \
--config_overrides "layer_norm_eps=0.001" \
--ipu_config_overrides "device_iterations=1" \
--output_dir output-pretrain-bert-base-phase1
```
Phase 2:
```
python examples/language-modeling/run_pretraining.py \
--config_name bert-base-uncased \
--tokenizer_name bert-base-uncased \
--ipu_config_name Graphcore/bert-base-ipu \
--dataset_name Graphcore/wikipedia-bert-512 \
--model_name_or_path ./output-pretrain-bert-base-phase1 \
--do_train \
--logging_steps 5 \
--max_seq_length 512 \
--max_steps 2038 \
--is_already_preprocessed \
--dataloader_num_workers 128 \
--dataloader_mode async_rebatched \
--lamb \
--lamb_no_bias_correction \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 512 \
--learning_rate 0.002828 \
--lr_scheduler_type linear \
--loss_scaling 128.0 \
--weight_decay 0.01 \
--warmup_ratio 0.128 \
--config_overrides "layer_norm_eps=0.001" \
--ipu_config_overrides "device_iterations=1,embedding_serialization_factor=2,matmul_proportion=0.22" \
--output_dir output-pretrain-bert-base-phase2
```
### Training hyperparameters
The following hyperparameters were used during phase 1 training:
- learning_rate: 0.006
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 65536
- total_eval_batch_size: 128
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.28
- training_steps: 10500
- training precision: Mixed Precision
The following hyperparameters were used during phase 2 training:
- learning_rate: 0.002828
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 16384
- total_eval_batch_size: 128
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.128
- training_steps: 2038
- training precision: Mixed Precision
### Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.0+cpu
- Datasets 1.18.3.dev0
- Tokenizers 0.10.3 |