File size: 398 Bytes
57bdca5
 
 
 
 
1
2
3
4
5
The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be.
When we run the above with stride = 1024, i.e. no overlap, the resulting PPL is 19.44, which is about the same
as the 19.93 reported in the GPT-2 paper. By using stride = 512 and thereby employing our striding window
strategy, this jumps down to 16.45.