File size: 202 Bytes
5fa1a76
 
 
 
1
2
3
4
This is shown in Figure 2d of the paper, see below for a sample attention mask:

Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
length.