This involves repeatedly | |
sliding the context window so that the model has more context when making each prediction. | |
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more | |
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good | |
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by | |
1 token a time. |