Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Attention is only computed within a local window, and the window is shifted between attention layers to create connections to help the model learn better.