lr = 2e-6, ~2.5 mil tokens of Python instruct data, all around ~7k tokens ish for each sample (300 total samples). 1 epoch distillation of 70b logprobs, topk=200