Transformers
PyTorch
Graphcore
bert
Generated from Trainer
File size: 5,242 Bytes
c134b71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f09ee5e
 
 
c134b71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f09ee5e
 
 
c134b71
3bab858
 
 
 
 
 
 
f09ee5e
 
3bab858
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f09ee5e
 
3bab858
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c134b71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f09ee5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
tags:
- generated_from_trainer
datasets:
- Graphcore/wikipedia-bert-128
- Graphcore/wikipedia-bert-512
model-index:
- name: Graphcore/bert-base-uncased
  results: []
---

# Graphcore/bert-base-uncased

This model is a pre-trained BERT-Base trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets.

It was trained on a Graphcore IPU-POD16 using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
Graphcore and Hugging Face are working together to make training of Transformer models on IPUs fast and easy. Learn more about how to take advantage of the power of Graphcore IPUs to train Transformers models at [hf.co/hardware/graphcore](https://huggingface.co/hardware/graphcore).

## Model description

Pre-trained BERT Base model trained on Wikipedia data.


## Training and evaluation data

Trained on wikipedia datasets:
- [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128)
- [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512)


## Training procedure

Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962).
Trained on a Graphcore IPU-POD16 using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).

It was trained with the IPUConfig [Graphcore/bert-base-ipu](https://huggingface.co/Graphcore/bert-base-ipu/).

Command lines:

Phase 1:
```
python examples/language-modeling/run_pretraining.py \
  --config_name bert-base-uncased \
  --tokenizer_name bert-base-uncased \
  --ipu_config_name Graphcore/bert-base-ipu \
  --dataset_name Graphcore/wikipedia-bert-128 \
  --do_train \
  --logging_steps 5 \
  --max_seq_length 128 \
  --max_steps 10500 \
  --is_already_preprocessed \
  --dataloader_num_workers 64 \
  --dataloader_mode async_rebatched \
  --lamb \
  --lamb_no_bias_correction \
  --per_device_train_batch_size 32 \
  --gradient_accumulation_steps 512 \
  --learning_rate 0.006 \
  --lr_scheduler_type linear \
  --loss_scaling 16384 \
  --weight_decay 0.01 \
  --warmup_ratio 0.28 \
  --save_steps 100 \
  --config_overrides "layer_norm_eps=0.001" \
  --ipu_config_overrides "device_iterations=1" \
  --output_dir output-pretrain-bert-base-phase1
```

Phase 2:
```
python examples/language-modeling/run_pretraining.py \
  --config_name bert-base-uncased \
  --tokenizer_name bert-base-uncased \
  --ipu_config_name Graphcore/bert-base-ipu \
  --dataset_name Graphcore/wikipedia-bert-512 \
  --model_name_or_path ./output-pretrain-bert-base-phase1 \
  --do_train \
  --logging_steps 5 \
  --max_seq_length 512 \
  --max_steps 2038 \
  --is_already_preprocessed \
  --dataloader_num_workers 128 \
  --dataloader_mode async_rebatched \
  --lamb \
  --lamb_no_bias_correction \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 512 \
  --learning_rate 0.002828 \
  --lr_scheduler_type linear \
  --loss_scaling 128.0 \
  --weight_decay 0.01 \
  --warmup_ratio 0.128 \
  --config_overrides "layer_norm_eps=0.001" \
  --ipu_config_overrides "device_iterations=1,embedding_serialization_factor=2,matmul_proportion=0.22" \
  --output_dir output-pretrain-bert-base-phase2
```

### Training hyperparameters

The following hyperparameters were used during phase 1 training:
- learning_rate: 0.006
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 65536
- total_eval_batch_size: 128
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.28
- training_steps: 10500
- training precision: Mixed Precision

The following hyperparameters were used during phase 2 training:
- learning_rate: 0.002828
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 16384
- total_eval_batch_size: 128
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.128
- training_steps: 2038
- training precision: Mixed Precision


### Framework versions

- Transformers 4.17.0.dev0
- Pytorch 1.10.0+cpu
- Datasets 1.18.3.dev0
- Tokenizers 0.10.3

## Fine-tuning with these weights

These weights can be used in either `transformers` or [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).

For example, to fine-tune the GLUE task SST2 with `optimum-graphcore` you can do:

```
export TOKENIZERS_PARALLELISM=true
python examples/text-classification/run_glue.py \
  --model_name_or_path bert-base-uncased \
  --ipu_config_name Graphcore/bert-base-ipu \
  --task_name sst2 \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 32 \
  --pod_type pod4 \
  --learning_rate 2e-5 \
  --lr_scheduler_type linear \
  --warmup_ratio 0.25 \
  --num_train_epochs 3 \
  --seed 1984 \
  --save_steps -1 \
  --dataloader_num_workers 64 \
  --dataloader_drop_last \
  --overwrite_output_dir \
  --output_dir /tmp/sst2
```