File size: 480 Bytes
57bdca5
 
 
 
 
 
 
1
2
3
4
5
6
7
This involves repeatedly
sliding the context window so that the model has more context when making each prediction.

This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
1 token a time.