The smaller the stride, the more context the model will have in making each prediction, | |
and the better the reported perplexity will typically be. | |
When we run the above with stride = 1024, i.e. no overlap, the resulting PPL is 19.44, which is about the same | |
as the 19.93 reported in the GPT-2 paper. By using stride = 512 and thereby employing our striding window | |
strategy, this jumps down to 16.45. |