File size: 480 Bytes
57bdca5 |
1 2 3 4 5 6 7 |
This involves repeatedly sliding the context window so that the model has more context when making each prediction. This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by 1 token a time. |