File size: 210 Bytes
5fa1a76
 
 
1
2
3
Also, by stacking attention layers that have a small
window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
representation of the whole sentence.