Ahmadzei's picture
update 1
57bdca5
raw
history blame contribute delete
398 Bytes
The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be.
When we run the above with stride = 1024, i.e. no overlap, the resulting PPL is 19.44, which is about the same
as the 19.93 reported in the GPT-2 paper. By using stride = 512 and thereby employing our striding window
strategy, this jumps down to 16.45.