This is shown in Figure 2d of the paper, see below for a sample attention mask: | |
Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence | |
length. |
This is shown in Figure 2d of the paper, see below for a sample attention mask: | |
Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence | |
length. |