Running pytorch 2.4.1+cu121 running with DDP: False, device: cuda, world size: 1 total desired batch size: 524288 => calculated gradient accumulation steps: 64 /scratch/user/mtseng/llm.c/myenv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( DataLoader: total number of tokens: 10,255,324,043 across 103 files DataLoader: total number of tokens: 100,000,000 across 1 files num decayed parameter tensors: 169, with 328,238,080 parameters num non-decayed parameter tensors: 145, with 123,904 parameters W1012 17:53:40.071000 47645404906688 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] xindex is not in var_ranges, defaulting to unknown range. val loss: 11.029037 saving model checkpoint to ./results/gpt2-350M-gqa/step_0.pth W1012 17:55:17.561000 47645404906688 torch/fx/experimental/symbolic_shapes.py:4449] [0/1] xindex is not in var_ranges, defaulting to unknown range. step 1/76294 | train loss 11.031350 | norm 3.8118 | lr 1.20e-04 | (114544.47 ms | 4577 tok/s) step 2/76294 | train loss 10.597936 | norm 4.3013 | lr 1.22e-04 | (9681.10 ms | 54156 tok/s) step 3/76294 | train loss 9.891030 | norm 3.3333 | lr 1.23e-04 | (9587.41 ms | 54685 tok/s) step 4/76294 | train loss 9.517614 | norm 2.6683 | lr 1.25e-04 | (9618.32 ms | 54509 tok/s) step 5/76294 | train loss 9.316612 | norm 2.2820 | lr 1.26e-04 | (9648.45 ms | 54339 tok/s) step 6/76294 | train loss 9.117299 | norm 1.9910 | lr 1.28e-04 | (9693.20 ms | 54088 tok/s) step 7/76294 | train loss 8.907764 | norm 1.6779 | lr 1.29e-04 | (9720.04 ms | 53939 tok/s) step 8/76294 | train loss 8.812067 | norm 1.4087 | lr 1.31e-04 | (9751.54 ms | 53765 tok/s) step 9/76294 | train loss 8.638920 | norm 1.4555 | lr 1.32e-04 | (9760.78 ms | 53714 tok/s) step 10/76294 | train loss 8.519670 | norm 1.5738 | lr 1.34e-04 | (9789.42 ms | 53557 tok/s) step 11/76294 | train loss 8.348865 | norm 1.4550 | lr 1.35e-04 | (9829.76 ms | 53337 tok/s) step 12/76294 | train loss 8.251865 | norm 1.2213 | lr 1.37e-04 | (9811.94 ms | 53434 tok/s) step 13/76294 | train loss 8.110162 | norm 1.1533 | lr 1.38e-04 | (10003.56 ms | 52410 tok/s) step 14/76294 | train loss 8.006787 | norm 1.1335 | lr 1.40e-04 | (9929.03 ms | 52804 tok/s) step 15/76294 | train loss 7.891543 | norm 1.1158 | lr 1.41e-04 | (9839.16 ms | 53286 tok/s) step 16/76294 | train loss 7.789068 | norm 1.0054 | lr 1.43e-04 | (9840.61 ms | 53278 tok/s) step 17/76294 | train loss 7.679129 | norm 0.9306 | lr 1.44e-04 | (9852.67 ms | 53213 tok/s) step 18/76294 | train loss 7.570749 | norm 1.0001 | lr 1.46e-04 | (9886.24 ms | 53032 tok/s) step 19/76294 | train loss 7.482420 | norm 0.8326 | lr 1.47e-04 | (9890.08 ms | 53011 tok/s) step 20/76294 | train loss 7.477214 | norm 0.7132 | lr 1.49e-04 | (9859.09 ms | 53178 tok/s) step 21/76294 | train loss 7.416646 | norm 0.7399 | lr 1.50e-04 | (9860.24 ms | 53172 tok/s) step 22/76294 | train loss 7.313285 | norm 0.6967 | lr 1.52e-04 | (9864.21 ms | 53151 tok/s) step 23/76294 | train loss 7.258123 | norm 0.6545 | lr 1.53e-04 | (9884.85 ms | 53040 tok/s) step 24/76294 | train loss 7.193429 | norm 0.5893 | lr 1.55e-04 | (9882.19 ms | 53054 tok/s) step 25/76294 | train loss 7.187572 | norm 0.6809 | lr 1.56e-04 | (9878.00 ms | 53076 tok/s) step 26/76294 | train loss 7.170836 | norm 0.4974 | lr 1.58e-04 | (9884.07 ms | 53044 tok/s) step 27/76294 | train loss 7.054507 | norm 0.6223 | lr 1.59e-04 | (9941.34 ms | 52738 tok/s) step 28/76294 | train loss 7.080179 | norm 0.5357 | lr 1.61e-04 | (9880.31 ms | 53064 tok/s) step 29/76294 | train loss 7.044565 | norm 0.6210 | lr 1.62e-04 | (9909.51 ms | 52908 tok/s) step 30/76294 | train loss 6.897587 | norm 0.5369 | lr 1.64e-04 | (9940.90 ms | 52740 tok/s) step 31/76294 | train loss 6.951240 | norm 0.6245 | lr 1.65e-04 | (9893.26 ms | 52994 tok/s) step 32/76294 | train loss 6.888274 | norm 0.4759 | lr 1.67e-04 | (9894.02 ms | 52990 tok/s) step 33/76294 | train loss 6.865813 | norm 0.5774 | lr 1.68e-04 | (9908.65 ms | 52912 tok/s) step 34/76294 | train loss 6.913774 | norm 0.4918 | lr 1.70e-04 | (9877.35 ms | 53080 tok/s) step 35/76294 | train loss 6.837761 | norm 0.6476 | lr 1.71e-04 | (9882.77 ms | 53051 tok/s) step 36/76294 | train loss 6.861959 | norm 0.4677 | lr 1.73e-04 | (9880.68 ms | 53062 tok/s) step 37/76294 | train loss 6.818452 | norm 0.5707 | lr 1.74e-04 | (9883.61 ms | 53046 tok/s) step 38/76294 | train loss 6.755625 | norm 0.5401 | lr 1.76e-04 | (9888.09 ms | 53022 tok/s) step 39/76294 | train loss 6.753144 | norm 0.7771 | lr 1.77e-04 | (9877.93 ms | 53077 tok/s) step 40/76294 | train loss 6.852108 | norm 0.6865 | lr 1.79e-04 | (9885.81 ms | 53034 tok/s) step 41/76294 | train loss 6.752762 | norm 0.9123 | lr 1.80e-04 | (9921.49 ms | 52844 tok/s) step 42/76294 | train loss 6.761864 | norm 0.9527 | lr 1.82e-04 | (9883.01 ms | 53049 tok/s) step 43/76294 | train loss 6.675234 | norm 0.6426 | lr 1.83e-04 | (9887.66 ms | 53024 tok/s) step 44/76294 | train loss 6.755922 | norm 0.6367 | lr 1.85e-04 | (9885.15 ms | 53038 tok/s) step 45/76294 | train loss 6.618926 | norm 0.7229 | lr 1.86e-04 | (9886.29 ms | 53032 tok/s) step 46/76294 | train loss 6.690403 | norm 0.5264 | lr 1.88e-04 | (9886.67 ms | 53030 tok/s) step 47/76294 | train loss 6.635604 | norm 0.7723 | lr 1.89e-04 | (9930.45 ms | 52796 tok/s) step 48/76294 | train loss 6.745103 | norm 0.5616 | lr 1.91e-04 | (9895.88 ms | 52980 tok/s) step 49/76294 | train loss 6.609724 | norm 0.6023 | lr 1.93e-04 | (9897.50 ms | 52972 tok/s) step 50/76294 | train loss 6.560349 | norm 0.5601 | lr 1.94e-04 | (9886.62 ms | 53030 tok/s) step 51/76294 | train loss 6.597063 | norm 0.5528 | lr 1.96e-04 | (9892.43 ms | 52999 tok/s) step 52/76294 | train loss 6.546556 | norm 0.5094 | lr 1.97e-04 | (9885.37 ms | 53037 tok/s) step 53/76294 | train loss 6.505567 | norm 0.6041 | lr 1.99e-04 | (9890.92 ms | 53007 tok/s) step 54/76294 | train loss 6.606180 | norm 0.4423 | lr 2.00e-04 | (9887.78 ms | 53024 tok/s) step 55/76294 | train loss 6.441758 | norm 0.4441 | lr 2.02e-04 | (9912.69 ms | 52891 tok/s) step 56/76294 | train loss 6.496853 | norm 0.4766 | lr 2.03e-04 | (9886.74 ms | 53029 tok/s) step 57/76294 | train loss 6.494996 | norm 0.3764 | lr 2.05e-04 | (9890.03 ms | 53012 tok/s) step 58/76294 | train loss 6.397686 | norm 0.4401 | lr 2.06e-04 | (9888.44 ms | 53020 tok/s) step 59/76294 | train loss 6.474599 | norm 0.4306 | lr 2.08e-04 | (9896.14 ms | 52979 tok/s) step 60/76294 | train loss 6.386726 | norm 0.4627 | lr 2.09e-04 | (9896.10 ms | 52979 tok/s) step 61/76294 | train loss 6.424536 | norm 0.4080 | lr 2.11e-04 | (9887.78 ms | 53024 tok/s) step 62/76294 | train loss 6.388732 | norm 0.3802 | lr 2.12e-04 | (9895.78 ms | 52981 tok/s) step 63/76294 | train loss 6.358540 | norm 0.3750 | lr 2.14e-04 | (9882.95 ms | 53050 tok/s) step 64/76294 | train loss 6.479663 | norm 0.3818 | lr 2.15e-04 | (9941.41 ms | 52738 tok/s) step 65/76294 | train loss 6.347420 | norm 0.5555 | lr 2.17e-04 | (9886.05 ms | 53033 tok/s) step 66/76294 | train loss 6.326612 | norm 0.4008 | lr 2.18e-04 | (9965.01 ms | 52613 tok/s) step 67/76294 | train loss 6.362984 | norm 0.3635 | lr 2.20e-04 | (9886.52 ms | 53031 tok/s) step 68/76294 | train loss 6.302211 | norm 0.3508 | lr 2.21e-04 | (9891.12 ms | 53006 tok/s) step 69/76294 | train loss 6.408684 | norm 0.4066 | lr 2.23e-04 | (9886.15 ms | 53033 tok/s) step 70/76294 | train loss 6.343723 | norm 0.3672 | lr 2.24e-04 | (9889.66 ms | 53014 tok/s) step 71/76294 | train loss 6.273463 | norm 0.3520 | lr 2.26e-04 | (9887.45 ms | 53026 tok/s) step 72/76294 | train loss 6.340117 | norm 0.3636 | lr 2.27e-04 | (9943.11 ms | 52729 tok/s) step 73/76294 | train loss 6.265882 | norm 0.3907 | lr 2.29e-04 | (9892.00 ms | 53001 tok/s) step 74/76294 | train loss 6.358171 | norm 0.3616 | lr 2.30e-04 | (9893.26 ms | 52994 tok/s) step 75/76294 | train loss 6.158112 | norm 0.3729 | lr 2.32e-04 | (9939.28 ms | 52749 tok/s) step 76/76294 | train loss 6.355137 | norm 0.4164 | lr 2.33e-04 | (9886.20 ms | 53032 tok/s) step 77/76294 | train loss 6.296048 | norm 0.5095 | lr 2.35e-04 | (9882.77 ms | 53051 tok/s) step 78/76294 | train loss 6.240786 | norm 0.4648 | lr 2.36e-04 | (9902.82 ms | 52943 tok/s) step 79/76294 | train loss 6.338998 | norm 0.4123 | lr 2.38e-04 | (9883.39 ms | 53047 tok/s) step 80/76294 | train loss 6.195395 | norm 0.4113 | lr 2.39e-04 | (9891.23 ms | 53005 tok/s) step 81/76294 | train loss 6.308849 | norm 0.4094 | lr 2.41e-04 | (9883.61 ms | 53046 tok/s) step 82/76294 | train loss 6.245669 | norm 0.3551 | lr 2.42e-04 | (9891.72 ms | 53003 tok/s) step 83/76294 | train loss 6.186605 | norm 0.4220 | lr 2.44e-04 | (9965.98 ms | 52608 tok/s) step 84/76294 | train loss 6.304008 | norm 0.3883 | lr 2.45e-04 | (9884.70 ms | 53040 tok/s) step 85/76294 | train loss 6.168880 | norm 0.3925 | lr 2.47e-04 | (9883.95 ms | 53044 tok/s) step 86/76294 | train loss 6.179297 | norm 0.3976 | lr 2.48e-04 | (9890.25 ms | 53011 tok/s) step 87/76294 | train loss 6.257281 | norm 0.4543 | lr 2.50e-04 | (9959.79 ms | 52640 tok/s) step 88/76294 | train loss 6.211866 | norm 0.4054 | lr 2.51e-04 | (9884.89 ms | 53039 tok/s) step 89/76294 | train loss 6.238784 | norm 0.3945 | lr 2.53e-04 | (9889.02 ms | 53017 tok/s) step 90/76294 | train loss 6.151757 | norm 0.3987 | lr 2.54e-04 | (9889.25 ms | 53016 tok/s) step 91/76294 | train loss 6.255118 | norm 0.3919 | lr 2.56e-04 | (9894.49 ms | 52988 tok/s) step 92/76294 | train loss 6.147254 | norm 0.4379 | lr 2.57e-04 | (9889.48 ms | 53015 tok/s) step 93/76294 | train loss 6.119381 | norm 0.5478 | lr 2.59e-04 | (9889.81 ms | 53013 tok/s) step 94/76294 | train loss 6.216216 | norm 0.5630 | lr 2.60e-04 | (9888.73 ms | 53019 tok/s) step 95/76294 | train loss 6.215128 | norm 0.6377 | lr 2.62e-04 | (10904.31 ms | 48081 tok/s) step 96/76294 | train loss 6.255719 | norm 0.6143 | lr 2.63e-04 | (9885.03 ms | 53039 tok/s) step 97/76294 | train loss 6.218292 | norm 0.4129 | lr 2.65e-04 | (9909.87 ms | 52906 tok/s) step 98/76294 | train loss 6.112579 | norm 0.5260 | lr 2.67e-04 | (9884.12 ms | 53043 tok/s) step 99/76294 | train loss 6.103196 | norm 0.4195 | lr 2.68e-04 | (10468.49 ms | 50082 tok/s) step 100/76294 | train loss 6.063644 | norm 0.4409 | lr 2.70e-04 | (9922.06 ms | 52841 tok/s) step 101/76294 | train loss 6.170124 | norm 0.4022 | lr 2.71e-04 | (9911.04 ms | 52899 tok/s) step 102/76294 | train loss 6.154124 | norm 0.4021 | lr 2.73e-04 | (9890.27 ms | 53010 tok/s) step 103/76294 | train loss 6.345792 | norm 0.6770 | lr 2.74e-04 | (9882.58 ms | 53052 tok/s) step 104/76294 | train loss 6.174712 | norm 0.4859 | lr 2.76e-04 | (9892.84 ms | 52997 tok/s) step 105/76294 | train loss 6.099867 | norm 0.4599 | lr 2.77e-04 | (9884.13 ms | 53043 tok/s) step 106/76294 | train loss 6.127089 | norm 0.3811 | lr 2.79e-04 | (9895.95 ms | 52980 tok/s) step 107/76294 | train loss 6.091408 | norm 0.5422 | lr 2.80e-04 | (9887.71 ms | 53024 tok/s) step 108/76294 | train loss 6.156108 | norm 0.3946 | lr 2.82e-04 | (9901.45 ms | 52951 tok/s) step 109/76294 | train loss 6.124790 | norm 0.4732 | lr 2.83e-04 | (9888.20 ms | 53022 tok/s) step 110/76294 | train loss 6.055641 | norm 0.4179 | lr 2.85e-04 | (9922.08 ms | 52841 tok/s) step 111/76294 | train loss 6.151158 | norm 0.3969 | lr 2.86e-04 | (9891.17 ms | 53006 tok/s) step 112/76294 | train loss 6.086026 | norm 0.4463 | lr 2.88e-04 | (9942.56 ms | 52732 tok/s) step 113/76294 | train loss 6.019742 | norm 0.3611 | lr 2.89e-04 | (9891.97 ms | 53001 tok/s) step 114/76294 | train loss 6.026054 | norm 0.4632 | lr 2.91e-04 | (9892.65 ms | 52998 tok/s) step 115/76294 | train loss 6.094489 | norm 0.4175 | lr 2.92e-04 | (9891.04 ms | 53006 tok/s) step 116/76294 | train loss 6.086138 | norm 0.5341 | lr 2.94e-04 | (9968.74 ms | 52593 tok/s) step 117/76294 | train loss 6.052069 | norm 0.4274 | lr 2.95e-04 | (9900.05 ms | 52958 tok/s) step 118/76294 | train loss 6.099169 | norm 0.4167 | lr 2.97e-04 | (9899.65 ms | 52960 tok/s) step 119/76294 | train loss 6.021327 | norm 0.4196 | lr 2.98e-04 | (9933.78 ms | 52778 tok/s) step 120/76294 | train loss 6.099752 | norm 0.4040 | lr 3.00e-04 | (9931.45 ms | 52791 tok/s) step 121/76294 | train loss 6.005539 | norm 0.5189 | lr 3.01e-04 | (9898.10 ms | 52969 tok/s) step 122/76294 | train loss 6.080661 | norm 0.4599 | lr 3.03e-04 | (9901.00 ms | 52953 tok/s) step 123/76294 | train loss 5.888552 | norm 0.6675 | lr 3.04e-04 | (9895.00 ms | 52985 tok/s) step 124/76294 | train loss 6.051828 | norm 0.7425 | lr 3.06e-04 | (9940.57 ms | 52742 tok/s) step 125/76294 | train loss 6.077074 | norm 0.7958 | lr 3.07e-04 | (9900.53 ms | 52956 tok/s) step 126/76294 | train loss 5.975741 | norm 0.5957 | lr 3.09e-04 | (9892.90 ms | 52996 tok/s) step 127/76294 | train loss 6.009570 | norm 0.5133 | lr 3.10e-04 | (9926.34 ms | 52818 tok/s) step 128/76294 | train loss 6.059709 | norm 0.4438 | lr 3.12e-04 | (9898.95 ms | 52964 tok/s) step 129/76294 | train loss 6.002483 | norm 0.4746 | lr 3.13e-04 | (9888.55 ms | 53020 tok/s) step 130/76294 | train loss 5.954606 | norm 0.6445 | lr 3.15e-04 | (9902.19 ms | 52947 tok/s) step 131/76294 | train loss 6.070031 | norm 0.4875 | lr 3.16e-04 | (9889.66 ms | 53014 tok/s) step 132/76294 | train loss 6.023100 | norm 0.6016 | lr 3.18e-04 | (9902.06 ms | 52947 tok/s) step 133/76294 | train loss 5.992368 | norm 0.3850 | lr 3.19e-04 | (9893.60 ms | 52993 tok/s) step 134/76294 | train loss 6.021454 | norm 0.4999 | lr 3.21e-04 | (9904.05 ms | 52937 tok/s) step 135/76294 | train loss 5.956980 | norm 0.3871 | lr 3.22e-04 | (9887.06 ms | 53028 tok/s) step 136/76294 | train loss 5.992865 | norm 0.3561 | lr 3.24e-04 | (9959.20 ms | 52644 tok/s) step 137/76294 | train loss 5.993916 | norm 0.3594 | lr 3.25e-04 | (9894.55 ms | 52988 tok/s) step 138/76294 | train loss 5.988059 | norm 0.4261 | lr 3.27e-04 | (9902.73 ms | 52944 tok/s) step 139/76294 | train loss 5.967088 | norm 0.4212 | lr 3.28e-04 | (9911.82 ms | 52895 tok/s) step 140/76294 | train loss 5.945770 | norm 0.3812 | lr 3.30e-04 | (9898.60 ms | 52966 tok/s) step 141/76294 | train loss 6.023530 | norm 0.3888 | lr 3.31e-04 | (9891.37 ms | 53005 tok/s) step 142/76294 | train loss 5.937634 | norm 0.5785 | lr 3.33e-04 | (9898.05 ms | 52969 tok/s) step 143/76294 | train loss 6.023452 | norm 0.4956 | lr 3.34e-04 | (9890.92 ms | 53007 tok/s) step 144/76294 | train loss 5.967951 | norm 0.4360 | lr 3.36e-04 | (9900.34 ms | 52957 tok/s) step 145/76294 | train loss 5.965269 | norm 0.5727 | lr 3.38e-04 | (9891.36 ms | 53005 tok/s) step 146/76294 | train loss 5.945799 | norm 0.4650 | lr 3.39e-04 | (9906.15 ms | 52926 tok/s) step 147/76294 | train loss 5.981552 | norm 0.5382 | lr 3.41e-04 | (9890.35 ms | 53010 tok/s) step 148/76294 | train loss 5.884224 | norm 0.5167 | lr 3.42e-04 | (9898.89 ms | 52964 tok/s) step 149/76294 | train loss 5.953542 | norm 0.4377 | lr 3.44e-04 | (9890.67 ms | 53008 tok/s) step 150/76294 | train loss 5.962864 | norm 0.5363 | lr 3.45e-04 | (9896.64 ms | 52976 tok/s) step 151/76294 | train loss 5.881218 | norm 0.4965 | lr 3.47e-04 | (9908.22 ms | 52914 tok/s) step 152/76294 | train loss 5.959088 | norm 0.4419 | lr 3.48e-04 | (9898.48 ms | 52967 tok/s) step 153/76294 | train loss 5.967602 | norm 0.4753 | lr 3.50e-04 | (9902.13 ms | 52947 tok/s) step 154/76294 | train loss 5.923328 | norm 0.5577 | lr 3.51e-04 | (9891.92 ms | 53002 tok/s) step 155/76294 | train loss 5.952086 | norm 0.7653 | lr 3.53e-04 | (9890.31 ms | 53010 tok/s) step 156/76294 | train loss 5.872672 | norm 0.5661 | lr 3.54e-04 | (10022.11 ms | 52313 tok/s) step 157/76294 | train loss 5.931106 | norm 0.4952 | lr 3.56e-04 | (9897.36 ms | 52973 tok/s) step 158/76294 | train loss 5.874266 | norm 0.4663 | lr 3.57e-04 | (9972.36 ms | 52574 tok/s) step 159/76294 | train loss 5.855693 | norm 0.5206 | lr 3.59e-04 | (9887.87 ms | 53023 tok/s) step 160/76294 | train loss 5.898845 | norm 0.5875 | lr 3.60e-04 | (9895.80 ms | 52981 tok/s) step 161/76294 | train loss 5.833582 | norm 0.6938 | lr 3.62e-04 | (9887.27 ms | 53027 tok/s) step 162/76294 | train loss 5.929477 | norm 0.4333 | lr 3.63e-04 | (9895.60 ms | 52982 tok/s) step 163/76294 | train loss 5.872556 | norm 0.4280 | lr 3.65e-04 | (9884.74 ms | 53040 tok/s) step 164/76294 | train loss 5.919539 | norm 0.4177 | lr 3.66e-04 | (9947.50 ms | 52706 tok/s) step 165/76294 | train loss 5.793824 | norm 0.4776 | lr 3.68e-04 | (9882.17 ms | 53054 tok/s) step 166/76294 | train loss 5.868997 | norm 0.4996 | lr 3.69e-04 | (9891.71 ms | 53003 tok/s) step 167/76294 | train loss 5.848558 | norm 0.4364 | lr 3.71e-04 | (9902.91 ms | 52943 tok/s) step 168/76294 | train loss 5.836934 | norm 0.4198 | lr 3.72e-04 | (9926.43 ms | 52817 tok/s) step 169/76294 | train loss 5.789634 | norm 0.3927 | lr 3.74e-04 | (9884.03 ms | 53044 tok/s) step 170/76294 | train loss 5.758573 | norm 0.4743 | lr 3.75e-04 | (9892.49 ms | 52999 tok/s) step 171/76294 | train loss 5.866843 | norm 0.4846 | lr 3.77e-04 | (9882.87 ms | 53050 tok/s) step 172/76294 | train loss 5.766506 | norm 0.5037 | lr 3.78e-04 | (9887.29 ms | 53026 tok/s) step 173/76294 | train loss 5.900252 | norm 0.5307 | lr 3.80e-04 | (9948.00 ms | 52703 tok/s) step 174/76294 | train loss 5.846866 | norm 0.5591 | lr 3.81e-04 | (9929.84 ms | 52799 tok/s) step 175/76294 | train loss 5.831402 | norm 0.6387 | lr 3.83e-04 | (9917.53 ms | 52865 tok/s) step 176/76294 | train loss 5.866314 | norm 0.4857 | lr 3.84e-04 | (9886.21 ms | 53032 tok/s) step 177/76294 | train loss 5.788192 | norm 0.5442 | lr 3.86e-04 | (9878.46 ms | 53074 tok/s) step 178/76294 | train loss 5.798162 | norm 0.4258 | lr 3.87e-04 | (9904.58 ms | 52934 tok/s) step 179/76294 | train loss 5.800677 | norm 0.4824 | lr 3.89e-04 | (9885.14 ms | 53038 tok/s) step 180/76294 | train loss 5.823168 | norm 0.4347 | lr 3.90e-04 | (9890.69 ms | 53008 tok/s) step 181/76294 | train loss 5.881947 | norm 0.4446 | lr 3.92e-04 | (9883.27 ms | 53048 tok/s) step 182/76294 | train loss 5.824960 | norm 0.4742 | lr 3.93e-04 | (10103.76 ms | 51890 tok/s) step 183/76294 | train loss 5.819134 | norm 0.4179 | lr 3.95e-04 | (9880.89 ms | 53061 tok/s) step 184/76294 | train loss 5.882401 | norm 0.4983 | lr 3.96e-04 | (9888.53 ms | 53020 tok/s) step 185/76294 | train loss 5.784586 | norm 0.6953 | lr 3.98e-04 | (9882.49 ms | 53052 tok/s) step 186/76294 | train loss 5.770123 | norm 0.5016 | lr 3.99e-04 | (9893.00 ms | 52996 tok/s) step 187/76294 | train loss 5.861338 | norm 0.5879 | lr 4.01e-04 | (9887.11 ms | 53027 tok/s) step 188/76294 | train loss 5.797780 | norm 0.6465 | lr 4.02e-04 | (9915.50 ms | 52876 tok/s) step 189/76294 | train loss 5.790572 | norm 0.6376 | lr 4.04e-04 | (9885.28 ms | 53037 tok/s) step 190/76294 | train loss 5.740734 | norm 0.7888 | lr 4.05e-04 | (9894.32 ms | 52989 tok/s) step 191/76294 | train loss 5.814236 | norm 0.8417 | lr 4.07e-04 | (10120.42 ms | 51805 tok/s) step 192/76294 | train loss 5.828202 | norm 1.1318 | lr 4.09e-04 | (9926.24 ms | 52818 tok/s) step 193/76294 | train loss 5.805613 | norm 3.1726 | lr 4.10e-04 | (10938.00 ms | 47933 tok/s) step 194/76294 | train loss 5.809988 | norm 0.7280 | lr 4.12e-04 | (9882.83 ms | 53050 tok/s) step 195/76294 | train loss 5.957567 | norm 4.5282 | lr 4.13e-04 | (9886.18 ms | 53032 tok/s) step 196/76294 | train loss 5.865098 | norm 1.8157 | lr 4.15e-04 | (9882.76 ms | 53051 tok/s) step 197/76294 | train loss 5.832108 | norm 1.7216 | lr 4.16e-04 | (9882.15 ms | 53054 tok/s) step 198/76294 | train loss 5.750165 | norm 0.8428 | lr 4.18e-04 | (9891.02 ms | 53006 tok/s) step 199/76294 | train loss 5.773762 | norm 0.6674 | lr 4.19e-04 | (9927.98 ms | 52809 tok/s) step 200/76294 | train loss 5.834831 | norm 1.0765 | lr 4.21e-04 | (9890.45 ms | 53010 tok/s) step 201/76294 | train loss 5.845541 | norm 0.6391 | lr 4.22e-04 | (9889.23 ms | 53016 tok/s) step 202/76294 | train loss 5.799948 | norm 0.9984 | lr 4.24e-04 | (9884.90 ms | 53039 tok/s) step 203/76294 | train loss 5.789663 | norm 0.6381 | lr 4.25e-04 | (9931.32 ms | 52791 tok/s) step 204/76294 | train loss 5.770714 | norm 0.5509 | lr 4.27e-04 | (9892.11 ms | 53001 tok/s) step 205/76294 | train loss 5.752384 | norm 0.5612 | lr 4.28e-04 | (9885.06 ms | 53038 tok/s) step 206/76294 | train loss 5.811839 | norm 0.5698 | lr 4.30e-04 | (9897.19 ms | 52973 tok/s) step 207/76294 | train loss 5.712717 | norm 0.5509 | lr 4.31e-04 | (9892.35 ms | 52999 tok/s) step 208/76294 | train loss 5.756553 | norm 1.1475 | lr 4.33e-04 | (9989.69 ms | 52483 tok/s) step 209/76294 | train loss 5.692771 | norm 1.3849 | lr 4.34e-04 | (9885.90 ms | 53034 tok/s) step 210/76294 | train loss 5.708157 | norm 0.4278 | lr 4.36e-04 | (9900.17 ms | 52957 tok/s) step 211/76294 | train loss 5.716491 | norm 0.7725 | lr 4.37e-04 | (9888.76 ms | 53019 tok/s) step 212/76294 | train loss 5.683296 | norm 0.3572 | lr 4.39e-04 | (9918.07 ms | 52862 tok/s) step 213/76294 | train loss 5.747744 | norm 0.5288 | lr 4.40e-04 | (9885.75 ms | 53035 tok/s) step 214/76294 | train loss 5.706313 | norm 0.4072 | lr 4.42e-04 | (9944.89 ms | 52719 tok/s) step 215/76294 | train loss 5.706160 | norm 0.5230 | lr 4.43e-04 | (9888.72 ms | 53019 tok/s) step 216/76294 | train loss 5.719422 | norm 0.8995 | lr 4.45e-04 | (9887.90 ms | 53023 tok/s) step 217/76294 | train loss 5.739998 | norm 0.6765 | lr 4.46e-04 | (9891.91 ms | 53002 tok/s) step 218/76294 | train loss 5.722473 | norm 0.4153 | lr 4.48e-04 | (9931.42 ms | 52791 tok/s) step 219/76294 | train loss 5.691540 | norm 0.4097 | lr 4.49e-04 | (9891.07 ms | 53006 tok/s) step 220/76294 | train loss 5.668976 | norm 0.4766 | lr 4.51e-04 | (9898.66 ms | 52966 tok/s) step 221/76294 | train loss 5.708479 | norm 0.4136 | lr 4.52e-04 | (9888.04 ms | 53022 tok/s) step 222/76294 | train loss 5.610696 | norm 0.3825 | lr 4.54e-04 | (9896.44 ms | 52977 tok/s) step 223/76294 | train loss 5.625279 | norm 0.3509 | lr 4.55e-04 | (9890.29 ms | 53010 tok/s) step 224/76294 | train loss 5.667222 | norm 0.3597 | lr 4.57e-04 | (9893.60 ms | 52993 tok/s) step 225/76294 | train loss 5.704672 | norm 0.3591 | lr 4.58e-04 | (9957.64 ms | 52652 tok/s) step 226/76294 | train loss 5.653193 | norm 1.1165 | lr 4.60e-04 | (9897.03 ms | 52974 tok/s) step 227/76294 | train loss 5.671149 | norm 0.6444 | lr 4.61e-04 | (9886.49 ms | 53031 tok/s) step 228/76294 | train loss 5.643335 | norm 0.7121 | lr 4.63e-04 | (9895.87 ms | 52980 tok/s) step 229/76294 | train loss 5.693441 | norm 1.1419 | lr 4.64e-04 | (9926.18 ms | 52819 tok/s) step 230/76294 | train loss 5.651615 | norm 0.8508 | lr 4.66e-04 | (9889.53 ms | 53014 tok/s) step 231/76294 | train loss 5.683752 | norm 1.4938 | lr 4.67e-04 | (9892.76 ms | 52997 tok/s) step 232/76294 | train loss 5.665679 | norm 0.5989 | lr 4.69e-04 | (9895.31 ms | 52983 tok/s) step 233/76294 | train loss 5.659912 | norm 1.0149 | lr 4.70e-04 | (9891.88 ms | 53002 tok/s) step 234/76294 | train loss 5.765800 | norm 2.6806 | lr 4.72e-04 | (9894.57 ms | 52987 tok/s) step 235/76294 | train loss 5.741571 | norm 1.0869 | lr 4.73e-04 | (9893.83 ms | 52991 tok/s) step 236/76294 | train loss 5.684859 | norm 1.0114 | lr 4.75e-04 | (9894.50 ms | 52988 tok/s) step 237/76294 | train loss 5.716286 | norm 1.2187 | lr 4.76e-04 | (9929.06 ms | 52803 tok/s) step 238/76294 | train loss 5.638370 | norm 0.6659 | lr 4.78e-04 | (9894.99 ms | 52985 tok/s) step 239/76294 | train loss 5.677870 | norm 0.6695 | lr 4.79e-04 | (9888.60 ms | 53019 tok/s) step 240/76294 | train loss 5.625371 | norm 0.6766 | lr 4.81e-04 | (9892.42 ms | 52999 tok/s) step 241/76294 | train loss 5.646983 | norm 0.8071 | lr 4.83e-04 | (9948.67 ms | 52699 tok/s) step 242/76294 | train loss 5.601237 | norm 1.7007 | lr 4.84e-04 | (9913.58 ms | 52886 tok/s) step 243/76294 | train loss 5.622427 | norm 0.4920 | lr 4.86e-04 | (9890.03 ms | 53012 tok/s) step 244/76294 | train loss 5.663560 | norm 0.7289 | lr 4.87e-04 | (9895.59 ms | 52982 tok/s) step 245/76294 | train loss 5.690288 | norm 1.3484 | lr 4.89e-04 | (9893.48 ms | 52993 tok/s) step 246/76294 | train loss 5.693542 | norm 2.1505 | lr 4.90e-04 | (9891.06 ms | 53006 tok/s) step 247/76294 | train loss 5.598453 | norm 0.5288 | lr 4.92e-04 | (9936.30 ms | 52765 tok/s) step 248/76294 | train loss 5.636040 | norm 0.6862 | lr 4.93e-04 | (9907.63 ms | 52918 tok/s) step 249/76294 | train loss 5.639123 | norm 1.1686 | lr 4.95e-04 | (9934.76 ms | 52773 tok/s) step 250/76294 | train loss 5.699427 | norm 4.1658 | lr 4.96e-04 | (9890.08 ms | 53012 tok/s) val loss: 5.736021 saving model checkpoint to ./results/gpt2-350M-gqa/step_250.pth step 251/76294 | train loss 5.702993 | norm 1.2677 | lr 4.98e-04 | (9960.95 ms | 52634 tok/s) step 252/76294 | train loss 5.664093 | norm 1.4498 | lr 4.99e-04 | (9885.74 ms | 53035 tok/s) step 253/76294 | train loss 5.641922 | norm 0.7030 | lr 5.01e-04 | (9893.69 ms | 52992 tok/s) step 254/76294 | train loss 5.694129 | norm 0.5884 | lr 5.02e-04 | (9881.55 ms | 53057 tok/s) step 255/76294 | train loss 5.596355 | norm 1.4088 | lr 5.04e-04 | (9891.62 ms | 53003 tok/s) step 256/76294 | train loss 5.687199 | norm 0.6834 | lr 5.05e-04 | (9885.85 ms | 53034 tok/s) step 257/76294 | train loss 5.574026 | norm 0.5434 | lr 5.07e-04 | (9895.82 ms | 52981 tok/s) step 258/76294 | train loss 5.600068 | norm 0.6888 | lr 5.08e-04 | (9899.50 ms | 52961 tok/s) step 259/76294 | train loss 5.559476 | norm 0.6606 | lr 5.10e-04 | (9909.34 ms | 52908 tok/s) step 260/76294 | train loss 5.656444 | norm 0.7079 | lr 5.11e-04 | (9897.38 ms | 52972 tok/s) step 261/76294 | train loss 5.515496 | norm 0.4496 | lr 5.13e-04 | (9896.59 ms | 52977 tok/s) step 262/76294 | train loss 5.649503 | norm 0.6857 | lr 5.14e-04 | (9895.57 ms | 52982 tok/s) step 263/76294 | train loss 5.607790 | norm 0.5485 | lr 5.16e-04 | (9896.22 ms | 52979 tok/s) step 264/76294 | train loss 5.597355 | norm 0.6050 | lr 5.17e-04 | (9894.66 ms | 52987 tok/s) step 265/76294 | train loss 5.569529 | norm 0.4861 | lr 5.19e-04 | (11138.21 ms | 47071 tok/s) step 266/76294 | train loss 5.509722 | norm 0.6113 | lr 5.20e-04 | (9883.92 ms | 53045 tok/s) step 267/76294 | train loss 5.599534 | norm 0.6270 | lr 5.22e-04 | (9896.20 ms | 52979 tok/s) step 268/76294 | train loss 5.518114 | norm 0.6260 | lr 5.23e-04 | (9884.39 ms | 53042 tok/s) step 269/76294 | train loss 5.593895 | norm 0.8509 | lr 5.25e-04 | (9895.93 ms | 52980 tok/s) step 270/76294 | train loss 5.478763 | norm 0.7531 | lr 5.26e-04 | (9892.43 ms | 52999 tok/s) step 271/76294 | train loss 5.560463 | norm 1.0415 | lr 5.28e-04 | (9903.30 ms | 52941 tok/s) step 272/76294 | train loss 5.509972 | norm 0.7681 | lr 5.29e-04 | (9896.94 ms | 52975 tok/s) step 273/76294 | train loss 5.539974 | norm 0.9079 | lr 5.31e-04 | (9931.83 ms | 52789 tok/s) step 274/76294 | train loss 5.560831 | norm 1.8704 | lr 5.32e-04 | (9891.62 ms | 53003 tok/s) step 275/76294 | train loss 5.420214 | norm 1.0264 | lr 5.34e-04 | (9902.31 ms | 52946 tok/s) step 276/76294 | train loss 5.538644 | norm 0.5665 | lr 5.35e-04 | (9890.10 ms | 53011 tok/s) step 277/76294 | train loss 5.535426 | norm 0.8273 | lr 5.37e-04 | (9904.54 ms | 52934 tok/s) step 278/76294 | train loss 5.498051 | norm 0.4117 | lr 5.38e-04 | (9892.17 ms | 53000 tok/s) step 279/76294 | train loss 5.530401 | norm 0.5660 | lr 5.40e-04 | (9900.01 ms | 52958 tok/s) step 280/76294 | train loss 5.535640 | norm 0.4735 | lr 5.41e-04 | (9890.94 ms | 53007 tok/s) step 281/76294 | train loss 5.495443 | norm 0.4519 | lr 5.43e-04 | (9921.27 ms | 52845 tok/s) step 282/76294 | train loss 5.469006 | norm 0.3619 | lr 5.44e-04 | (9895.06 ms | 52985 tok/s) step 283/76294 | train loss 5.518887 | norm 0.4509 | lr 5.46e-04 | (9946.52 ms | 52711 tok/s) step 284/76294 | train loss 5.517770 | norm 0.3988 | lr 5.47e-04 | (9900.29 ms | 52957 tok/s) step 285/76294 | train loss 5.438155 | norm 0.3991 | lr 5.49e-04 | (9903.37 ms | 52940 tok/s) step 286/76294 | train loss 5.483918 | norm 0.6168 | lr 5.50e-04 | (9889.56 ms | 53014 tok/s) step 287/76294 | train loss 5.505394 | norm 1.0634 | lr 5.52e-04 | (9904.39 ms | 52935 tok/s) step 288/76294 | train loss 5.484044 | norm 0.8172 | lr 5.54e-04 | (9891.19 ms | 53006 tok/s) step 289/76294 | train loss 5.478149 | norm 0.9498 | lr 5.55e-04 | (9894.55 ms | 52988 tok/s) step 290/76294 | train loss 5.529213 | norm 1.4283 | lr 5.57e-04 | (11075.68 ms | 47337 tok/s) step 291/76294 | train loss 5.512658 | norm 1.0033 | lr 5.58e-04 | (9918.47 ms | 52860 tok/s) step 292/76294 | train loss 5.493767 | norm 0.8783 | lr 5.60e-04 | (9883.76 ms | 53045 tok/s) step 293/76294 | train loss 5.504838 | norm 1.0658 | lr 5.61e-04 | (9894.31 ms | 52989 tok/s) step 294/76294 | train loss 5.440929 | norm 1.3559 | lr 5.63e-04 | (9938.93 ms | 52751 tok/s) step 295/76294 | train loss 5.425115 | norm 0.6018 | lr 5.64e-04 | (9890.54 ms | 53009 tok/s) step 296/76294 | train loss 5.482959 | norm 0.5882 | lr 5.66e-04 | (9897.99 ms | 52969 tok/s) step 297/76294 | train loss 5.472940 | norm 1.0881 | lr 5.67e-04 | (9894.71 ms | 52987 tok/s) step 298/76294 | train loss 5.470275 | norm 1.0266 | lr 5.69e-04 | (9895.74 ms | 52981 tok/s) step 299/76294 | train loss 5.430500 | norm 0.7322 | lr 5.70e-04 | (9891.91 ms | 53002 tok/s) step 300/76294 | train loss 5.550170 | norm 1.2555 | lr 5.72e-04 | (9888.38 ms | 53021 tok/s) step 301/76294 | train loss 5.462029 | norm 1.1070 | lr 5.73e-04 | (9957.47 ms | 52653 tok/s) step 302/76294 | train loss 5.543270 | norm 0.7537 | lr 5.75e-04 | (9925.52 ms | 52822 tok/s) step 303/76294 | train loss 5.523931 | norm 1.5733 | lr 5.76e-04 | (9906.78 ms | 52922 tok/s) step 304/76294 | train loss 5.464600 | norm 0.4949 | lr 5.78e-04 | (9930.18 ms | 52797 tok/s) step 305/76294 | train loss 5.465177 | norm 0.6412 | lr 5.79e-04 | (9895.27 ms | 52984 tok/s) step 306/76294 | train loss 5.450580 | norm 0.7030 | lr 5.81e-04 | (9918.21 ms | 52861 tok/s) step 307/76294 | train loss 5.478374 | norm 1.4162 | lr 5.82e-04 | (9893.52 ms | 52993 tok/s) step 308/76294 | train loss 5.554897 | norm 0.7788 | lr 5.84e-04 | (9899.19 ms | 52963 tok/s) step 309/76294 | train loss 5.411901 | norm 0.7456 | lr 5.85e-04 | (9897.76 ms | 52970 tok/s) step 310/76294 | train loss 5.516634 | norm 0.8435 | lr 5.87e-04 | (9897.60 ms | 52971 tok/s) step 311/76294 | train loss 5.454666 | norm 0.6868 | lr 5.88e-04 | (9903.90 ms | 52938 tok/s) step 312/76294 | train loss 5.401124 | norm 0.7201 | lr 5.90e-04 | (9956.84 ms | 52656 tok/s) step 313/76294 | train loss 5.393017 | norm 0.4896 | lr 5.91e-04 | (9901.35 ms | 52951 tok/s) step 314/76294 | train loss 5.438046 | norm 0.5904 | lr 5.93e-04 | (9895.73 ms | 52981 tok/s) step 315/76294 | train loss 5.399494 | norm 0.5009 | lr 5.94e-04 | (9931.93 ms | 52788 tok/s) step 316/76294 | train loss 5.404799 | norm 0.4081 | lr 5.96e-04 | (9892.05 ms | 53001 tok/s) step 317/76294 | train loss 5.351253 | norm 0.3884 | lr 5.97e-04 | (9904.10 ms | 52936 tok/s) step 318/76294 | train loss 5.385037 | norm 0.4100 | lr 5.99e-04 | (9901.80 ms | 52949 tok/s) step 319/76294 | train loss 5.423574 | norm 0.4408 | lr 6.00e-04 | (9929.64 ms | 52800 tok/s) step 320/76294 | train loss 5.420057 | norm 0.4577 | lr 6.02e-04 | (9887.14 ms | 53027 tok/s) step 321/76294 | train loss 5.496399 | norm 0.7632 | lr 6.03e-04 | (9918.91 ms | 52857 tok/s) step 322/76294 | train loss 5.376628 | norm 0.6963 | lr 6.05e-04 | (9888.89 ms | 53018 tok/s) step 323/76294 | train loss 5.425930 | norm 0.5599 | lr 6.06e-04 | (9897.97 ms | 52969 tok/s) step 324/76294 | train loss 5.341594 | norm 0.5586 | lr 6.08e-04 | (9892.39 ms | 52999 tok/s) step 325/76294 | train loss 5.432466 | norm 0.4977 | lr 6.09e-04 | (9900.84 ms | 52954 tok/s) step 326/76294 | train loss 5.371561 | norm 0.4924 | lr 6.11e-04 | (9894.39 ms | 52988 tok/s) step 327/76294 | train loss 5.350919 | norm 0.4574 | lr 6.12e-04 | (9896.18 ms | 52979 tok/s) step 328/76294 | train loss 5.383198 | norm 0.4660 | lr 6.14e-04 | (9890.03 ms | 53012 tok/s) step 329/76294 | train loss 5.361810 | norm 0.3688 | lr 6.15e-04 | (9890.04 ms | 53012 tok/s) step 330/76294 | train loss 5.331781 | norm 0.4436 | lr 6.17e-04 | (9895.75 ms | 52981 tok/s) step 331/76294 | train loss 5.323027 | norm 0.4293 | lr 6.18e-04 | (9888.85 ms | 53018 tok/s) step 332/76294 | train loss 5.333634 | norm 0.5041 | lr 6.20e-04 | (9890.20 ms | 53011 tok/s) step 333/76294 | train loss 5.364352 | norm 0.3361 | lr 6.21e-04 | (9890.31 ms | 53010 tok/s) step 334/76294 | train loss 5.313067 | norm 0.7748 | lr 6.23e-04 | (9893.57 ms | 52993 tok/s) step 335/76294 | train loss 5.325735 | norm 0.4523 | lr 6.25e-04 | (9891.08 ms | 53006 tok/s) step 336/76294 | train loss 5.261936 | norm 0.3987 | lr 6.26e-04 | (9898.37 ms | 52967 tok/s) step 337/76294 | train loss 5.306444 | norm 0.3737 | lr 6.28e-04 | (9892.67 ms | 52998 tok/s) step 338/76294 | train loss 5.279874 | norm 0.4431 | lr 6.29e-04 | (9894.24 ms | 52989 tok/s) step 339/76294 | train loss 5.340438 | norm 0.5776 | lr 6.31e-04 | (9889.18 ms | 53016 tok/s) step 340/76294 | train loss 5.340178 | norm 0.8685 | lr 6.32e-04 | (9894.99 ms | 52985 tok/s) step 341/76294 | train loss 5.312431 | norm 1.1782 | lr 6.34e-04 | (9892.09 ms | 53001 tok/s) step 342/76294 | train loss 5.371229 | norm 1.2327 | lr 6.35e-04 | (9939.88 ms | 52746 tok/s) step 343/76294 | train loss 5.528363 | norm 3.4209 | lr 6.37e-04 | (9888.60 ms | 53019 tok/s) step 344/76294 | train loss 5.447314 | norm 1.4833 | lr 6.38e-04 | (9884.97 ms | 53039 tok/s) step 345/76294 | train loss 5.524874 | norm 5.0366 | lr 6.40e-04 | (9899.15 ms | 52963 tok/s) step 346/76294 | train loss 5.376451 | norm 0.8340 | lr 6.41e-04 | (9890.69 ms | 53008 tok/s) step 347/76294 | train loss 5.409552 | norm 1.2437 | lr 6.43e-04 | (10890.47 ms | 48142 tok/s) step 348/76294 | train loss 5.390554 | norm 1.4612 | lr 6.44e-04 | (9911.62 ms | 52896 tok/s) step 349/76294 | train loss 5.367055 | norm 0.6994 | lr 6.46e-04 | (9918.37 ms | 52860 tok/s) step 350/76294 | train loss 5.355744 | norm 0.8527 | lr 6.47e-04 | (9941.01 ms | 52740 tok/s) step 351/76294 | train loss 5.358567 | norm 1.0044 | lr 6.49e-04 | (9882.43 ms | 53053 tok/s) step 352/76294 | train loss 5.388524 | norm 0.8232 | lr 6.50e-04 | (9934.46 ms | 52775 tok/s) step 353/76294 | train loss 5.367613 | norm 1.1788 | lr 6.52e-04 | (9963.99 ms | 52618 tok/s) step 354/76294 | train loss 5.409075 | norm 0.7200 | lr 6.53e-04 | (9884.55 ms | 53041 tok/s) step 355/76294 | train loss 5.324286 | norm 0.7140 | lr 6.55e-04 | (9884.27 ms | 53043 tok/s) step 356/76294 | train loss 5.382836 | norm 0.6974 | lr 6.56e-04 | (9903.71 ms | 52939 tok/s) step 357/76294 | train loss 5.352641 | norm 0.6553 | lr 6.58e-04 | (9886.68 ms | 53030 tok/s) step 358/76294 | train loss 5.391264 | norm 0.6085 | lr 6.59e-04 | (9887.13 ms | 53027 tok/s) step 359/76294 | train loss 5.260673 | norm 0.5611 | lr 6.61e-04 | (9881.56 ms | 53057 tok/s) step 360/76294 | train loss 5.325659 | norm 0.6727 | lr 6.62e-04 | (9890.66 ms | 53008 tok/s) step 361/76294 | train loss 5.270515 | norm 0.7484 | lr 6.64e-04 | (9882.68 ms | 53051 tok/s) step 362/76294 | train loss 5.264636 | norm 1.0741 | lr 6.65e-04 | (9880.85 ms | 53061 tok/s) step 363/76294 | train loss 5.265118 | norm 0.4256 | lr 6.67e-04 | (9880.01 ms | 53066 tok/s) step 364/76294 | train loss 5.282086 | norm 0.7236 | lr 6.68e-04 | (9890.21 ms | 53011 tok/s) step 365/76294 | train loss 5.244337 | norm 0.5747 | lr 6.70e-04 | (9886.47 ms | 53031 tok/s) step 366/76294 | train loss 5.233576 | norm 0.4875 | lr 6.71e-04 | (9889.44 ms | 53015 tok/s) step 367/76294 | train loss 5.274999 | norm 0.6531 | lr 6.73e-04 | (9878.52 ms | 53074 tok/s) step 368/76294 | train loss 5.340042 | norm 0.9268 | lr 6.74e-04 | (9885.81 ms | 53034 tok/s) step 369/76294 | train loss 5.276465 | norm 0.5559 | lr 6.76e-04 | (9885.59 ms | 53036 tok/s) step 370/76294 | train loss 5.280053 | norm 0.5789 | lr 6.77e-04 | (9887.17 ms | 53027 tok/s) step 371/76294 | train loss 5.243349 | norm 0.4298 | lr 6.79e-04 | (9888.96 ms | 53017 tok/s) step 372/76294 | train loss 5.235163 | norm 0.4236 | lr 6.80e-04 | (9923.13 ms | 52835 tok/s) step 373/76294 | train loss 5.187348 | norm 0.4635 | lr 6.82e-04 | (9886.83 ms | 53029 tok/s) step 374/76294 | train loss 5.242270 | norm 0.6059 | lr 6.83e-04 | (10550.95 ms | 49691 tok/s) step 375/76294 | train loss 5.202226 | norm 0.5885 | lr 6.85e-04 | (9928.29 ms | 52807 tok/s) step 376/76294 | train loss 5.200719 | norm 0.5675 | lr 6.86e-04 | (9998.85 ms | 52435 tok/s) step 377/76294 | train loss 5.166571 | norm 0.5295 | lr 6.88e-04 | (9883.32 ms | 53048 tok/s) step 378/76294 | train loss 5.144824 | norm 0.7215 | lr 6.89e-04 | (9886.85 ms | 53029 tok/s) step 379/76294 | train loss 5.248479 | norm 0.4108 | lr 6.91e-04 | (9884.30 ms | 53043 tok/s) step 380/76294 | train loss 5.155865 | norm 0.4218 | lr 6.92e-04 | (9888.20 ms | 53022 tok/s) step 381/76294 | train loss 5.165354 | norm 0.5687 | lr 6.94e-04 | (9888.05 ms | 53022 tok/s) step 382/76294 | train loss 5.163488 | norm 0.4526 | lr 6.95e-04 | (10855.39 ms | 48298 tok/s) step 383/76294 | train loss 5.135785 | norm 0.5356 | lr 6.97e-04 | (9875.20 ms | 53091 tok/s) step 384/76294 | train loss 5.160192 | norm 0.8758 | lr 6.99e-04 | (9886.66 ms | 53030 tok/s) step 385/76294 | train loss 5.195888 | norm 1.1053 | lr 7.00e-04 | (9880.71 ms | 53062 tok/s) step 386/76294 | train loss 5.380044 | norm 0.7917 | lr 7.02e-04 | (9933.21 ms | 52781 tok/s) step 387/76294 | train loss 5.163270 | norm 1.1871 | lr 7.03e-04 | (9878.00 ms | 53076 tok/s) step 388/76294 | train loss 5.251794 | norm 1.2581 | lr 7.05e-04 | (11002.66 ms | 47651 tok/s) step 389/76294 | train loss 5.206205 | norm 1.1473 | lr 7.06e-04 | (9879.02 ms | 53071 tok/s) step 390/76294 | train loss 5.224670 | norm 0.8416 | lr 7.08e-04 | (9879.08 ms | 53071 tok/s) step 391/76294 | train loss 5.203274 | norm 0.7386 | lr 7.09e-04 | (9924.59 ms | 52827 tok/s) step 392/76294 | train loss 5.197707 | norm 0.6211 | lr 7.11e-04 | (11601.06 ms | 45193 tok/s) step 393/76294 | train loss 5.212644 | norm 0.7005 | lr 7.12e-04 | (11070.42 ms | 47359 tok/s) step 394/76294 | train loss 5.209659 | norm 0.6683 | lr 7.14e-04 | (9917.50 ms | 52865 tok/s) step 395/76294 | train loss 5.174769 | norm 0.4567 | lr 7.15e-04 | (9868.93 ms | 53125 tok/s) step 396/76294 | train loss 5.140568 | norm 0.6911 | lr 7.17e-04 | (9880.37 ms | 53064 tok/s) step 397/76294 | train loss 5.145677 | norm 0.3547 | lr 7.18e-04 | (9936.42 ms | 52764 tok/s) step 398/76294 | train loss 5.186993 | norm 0.4398 | lr 7.20e-04 | (9882.12 ms | 53054 tok/s) step 399/76294 | train loss 5.152946 | norm 0.4343 | lr 7.21e-04 | (9904.54 ms | 52934 tok/s) step 400/76294 | train loss 5.144923 | norm 0.4481 | lr 7.23e-04 | (9920.41 ms | 52849 tok/s) step 401/76294 | train loss 5.074814 | norm 0.4113 | lr 7.24e-04 | (9882.47 ms | 53052 tok/s) step 402/76294 | train loss 5.065199 | norm 0.5263 | lr 7.26e-04 | (9887.62 ms | 53025 tok/s) step 403/76294 | train loss 5.150239 | norm 0.5446 | lr 7.27e-04 | (9886.42 ms | 53031 tok/s) step 404/76294 | train loss 5.057960 | norm 0.7646 | lr 7.29e-04 | (9922.55 ms | 52838 tok/s) step 405/76294 | train loss 5.190615 | norm 2.9858 | lr 7.30e-04 | (9893.41 ms | 52994 tok/s) step 406/76294 | train loss 5.161133 | norm 0.6076 | lr 7.32e-04 | (9881.34 ms | 53058 tok/s) step 407/76294 | train loss 5.153310 | norm 1.4693 | lr 7.33e-04 | (9885.98 ms | 53033 tok/s) step 408/76294 | train loss 5.136389 | norm 0.6792 | lr 7.35e-04 | (9892.92 ms | 52996 tok/s) step 409/76294 | train loss 5.149155 | norm 1.1592 | lr 7.36e-04 | (10027.82 ms | 52283 tok/s) step 410/76294 | train loss 5.156420 | norm 1.0229 | lr 7.38e-04 | (9910.10 ms | 52904 tok/s) step 411/76294 | train loss 5.117841 | norm 0.9009 | lr 7.39e-04 | (9894.25 ms | 52989 tok/s) step 412/76294 | train loss 5.150931 | norm 0.8478 | lr 7.41e-04 | (9891.67 ms | 53003 tok/s) step 413/76294 | train loss 5.200789 | norm 1.8219 | lr 7.42e-04 | (9890.64 ms | 53008 tok/s) step 414/76294 | train loss 5.141452 | norm 0.6193 | lr 7.44e-04 | (9888.37 ms | 53021 tok/s) step 415/76294 | train loss 5.141849 | norm 0.6052 | lr 7.45e-04 | (9894.03 ms | 52990 tok/s) step 416/76294 | train loss 5.124654 | norm 0.5259 | lr 7.47e-04 | (9943.97 ms | 52724 tok/s) step 417/76294 | train loss 5.121129 | norm 0.4117 | lr 7.48e-04 | (9893.90 ms | 52991 tok/s) step 418/76294 | train loss 5.112175 | norm 0.8666 | lr 7.50e-04 | (9888.89 ms | 53018 tok/s) step 419/76294 | train loss 5.202899 | norm 1.0481 | lr 7.51e-04 | (9901.48 ms | 52950 tok/s) step 420/76294 | train loss 5.148122 | norm 1.5630 | lr 7.53e-04 | (9923.12 ms | 52835 tok/s) step 421/76294 | train loss 5.184059 | norm 2.4938 | lr 7.54e-04 | (9891.14 ms | 53006 tok/s) step 422/76294 | train loss 5.150636 | norm 0.6684 | lr 7.56e-04 | (9897.07 ms | 52974 tok/s) step 423/76294 | train loss 5.178630 | norm 0.6914 | lr 7.57e-04 | (9892.85 ms | 52997 tok/s) step 424/76294 | train loss 5.112668 | norm 0.7338 | lr 7.59e-04 | (9891.89 ms | 53002 tok/s) step 425/76294 | train loss 5.112743 | norm 0.9651 | lr 7.60e-04 | (9891.16 ms | 53006 tok/s) step 426/76294 | train loss 5.090844 | norm 0.7424 | lr 7.62e-04 | (9893.87 ms | 52991 tok/s) step 427/76294 | train loss 5.167052 | norm 0.8423 | lr 7.63e-04 | (9889.32 ms | 53016 tok/s) step 428/76294 | train loss 5.102296 | norm 0.6640 | lr 7.65e-04 | (9932.64 ms | 52784 tok/s) step 429/76294 | train loss 5.113709 | norm 0.8401 | lr 7.66e-04 | (9895.27 ms | 52984 tok/s) step 430/76294 | train loss 5.102810 | norm 0.6994 | lr 7.68e-04 | (9914.60 ms | 52880 tok/s) step 431/76294 | train loss 5.102957 | norm 0.7054 | lr 7.70e-04 | (9899.38 ms | 52962 tok/s) step 432/76294 | train loss 5.051329 | norm 0.6404 | lr 7.71e-04 | (9896.44 ms | 52977 tok/s) step 433/76294 | train loss 5.072172 | norm 0.7387 | lr 7.73e-04 | (9893.61 ms | 52993 tok/s) step 434/76294 | train loss 5.097017 | norm 0.3341 | lr 7.74e-04 | (9892.24 ms | 53000 tok/s) step 435/76294 | train loss 5.044148 | norm 0.4564 | lr 7.76e-04 | (9892.62 ms | 52998 tok/s) step 436/76294 | train loss 5.083607 | norm 0.4617 | lr 7.77e-04 | (9894.82 ms | 52986 tok/s) step 437/76294 | train loss 5.026398 | norm 0.4417 | lr 7.79e-04 | (9892.38 ms | 52999 tok/s) step 438/76294 | train loss 5.067292 | norm 0.4784 | lr 7.80e-04 | (9898.18 ms | 52968 tok/s) step 439/76294 | train loss 5.043212 | norm 0.4647 | lr 7.82e-04 | (9900.65 ms | 52955 tok/s) step 440/76294 | train loss 5.032260 | norm 0.3293 | lr 7.83e-04 | (9892.72 ms | 52997 tok/s) step 441/76294 | train loss 4.995745 | norm 0.3893 | lr 7.85e-04 | (9897.42 ms | 52972 tok/s) step 442/76294 | train loss 5.019684 | norm 0.3173 | lr 7.86e-04 | (9950.58 ms | 52689 tok/s) step 443/76294 | train loss 4.951891 | norm 0.4258 | lr 7.88e-04 | (9903.29 ms | 52941 tok/s) step 444/76294 | train loss 5.008308 | norm 0.3285 | lr 7.89e-04 | (9890.59 ms | 53009 tok/s) step 445/76294 | train loss 4.987628 | norm 0.2846 | lr 7.91e-04 | (9903.89 ms | 52938 tok/s) step 446/76294 | train loss 4.938637 | norm 0.2998 | lr 7.92e-04 | (9943.39 ms | 52727 tok/s) step 447/76294 | train loss 5.027228 | norm 0.3181 | lr 7.94e-04 | (9915.03 ms | 52878 tok/s) step 448/76294 | train loss 5.011009 | norm 0.3107 | lr 7.95e-04 | (9916.48 ms | 52870 tok/s) step 449/76294 | train loss 5.067533 | norm 0.5283 | lr 7.97e-04 | (9892.32 ms | 52999 tok/s) step 450/76294 | train loss 4.988284 | norm 0.4446 | lr 7.98e-04 | (9901.57 ms | 52950 tok/s) step 451/76294 | train loss 4.988476 | norm 0.5955 | lr 8.00e-04 | (9895.14 ms | 52984 tok/s) step 452/76294 | train loss 5.058425 | norm 0.6252 | lr 8.01e-04 | (9896.08 ms | 52979 tok/s) step 453/76294 | train loss 4.999166 | norm 0.6465 | lr 8.03e-04 | (9955.68 ms | 52662 tok/s) step 454/76294 | train loss 5.018733 | norm 0.8293 | lr 8.04e-04 | (9890.88 ms | 53007 tok/s) step 455/76294 | train loss 5.007386 | norm 0.5384 | lr 8.06e-04 | (9940.18 ms | 52744 tok/s) step 456/76294 | train loss 4.982219 | norm 0.6230 | lr 8.07e-04 | (9890.38 ms | 53010 tok/s) step 457/76294 | train loss 4.958410 | norm 0.7796 | lr 8.09e-04 | (9894.12 ms | 52990 tok/s) step 458/76294 | train loss 4.938695 | norm 0.6813 | lr 8.10e-04 | (9933.88 ms | 52778 tok/s) step 459/76294 | train loss 5.007226 | norm 1.2566 | lr 8.12e-04 | (9890.84 ms | 53007 tok/s) step 460/76294 | train loss 5.063214 | norm 0.5571 | lr 8.13e-04 | (9894.14 ms | 52990 tok/s) step 461/76294 | train loss 4.982633 | norm 0.7839 | lr 8.15e-04 | (9893.36 ms | 52994 tok/s) step 462/76294 | train loss 5.048808 | norm 0.4106 | lr 8.16e-04 | (9892.55 ms | 52998 tok/s) step 463/76294 | train loss 4.971037 | norm 0.4131 | lr 8.18e-04 | (9893.07 ms | 52995 tok/s) step 464/76294 | train loss 4.980223 | norm 0.5704 | lr 8.19e-04 | (9925.25 ms | 52824 tok/s) step 465/76294 | train loss 4.908988 | norm 0.2993 | lr 8.21e-04 | (9914.08 ms | 52883 tok/s) step 466/76294 | train loss 4.880504 | norm 0.3300 | lr 8.22e-04 | (9894.79 ms | 52986 tok/s) step 467/76294 | train loss 4.989271 | norm 0.3983 | lr 8.24e-04 | (9893.04 ms | 52996 tok/s) step 468/76294 | train loss 4.926760 | norm 0.6860 | lr 8.25e-04 | (9896.00 ms | 52980 tok/s) step 469/76294 | train loss 4.913934 | norm 0.5498 | lr 8.27e-04 | (9894.49 ms | 52988 tok/s) step 470/76294 | train loss 5.001529 | norm 0.5198 | lr 8.28e-04 | (9897.39 ms | 52972 tok/s) step 471/76294 | train loss 5.012350 | norm 0.4983 | lr 8.30e-04 | (9894.68 ms | 52987 tok/s) step 472/76294 | train loss 4.903589 | norm 0.4358 | lr 8.31e-04 | (9895.81 ms | 52981 tok/s) step 473/76294 | train loss 4.919457 | norm 0.4012 | lr 8.33e-04 | (9893.89 ms | 52991 tok/s) step 474/76294 | train loss 4.908997 | norm 0.3947 | lr 8.34e-04 | (9897.05 ms | 52974 tok/s) step 475/76294 | train loss 4.958866 | norm 0.3297 | lr 8.36e-04 | (9892.43 ms | 52999 tok/s) step 476/76294 | train loss 5.042771 | norm 0.3662 | lr 8.37e-04 | (9892.11 ms | 53001 tok/s) step 477/76294 | train loss 4.955619 | norm 0.3636 | lr 8.39e-04 | (9891.33 ms | 53005 tok/s) step 478/76294 | train loss 4.867921 | norm 0.3679 | lr 8.41e-04 | (9900.79 ms | 52954 tok/s) step 479/76294 | train loss 4.880584 | norm 0.2904 | lr 8.42e-04 | (9958.44 ms | 52648 tok/s) step 480/76294 | train loss 4.921338 | norm 0.4321 | lr 8.44e-04 | (9885.83 ms | 53034 tok/s) step 481/76294 | train loss 4.891218 | norm 0.4622 | lr 8.45e-04 | (9887.12 ms | 53027 tok/s) step 482/76294 | train loss 4.872432 | norm 1.1371 | lr 8.47e-04 | (9916.54 ms | 52870 tok/s) step 483/76294 | train loss 4.899729 | norm 0.4258 | lr 8.48e-04 | (9892.33 ms | 52999 tok/s) step 484/76294 | train loss 4.872415 | norm 0.3535 | lr 8.50e-04 | (9888.78 ms | 53018 tok/s) step 485/76294 | train loss 4.881004 | norm 0.7676 | lr 8.51e-04 | (10986.51 ms | 47721 tok/s) step 486/76294 | train loss 4.894490 | norm 0.4720 | lr 8.53e-04 | (9877.99 ms | 53076 tok/s) step 487/76294 | train loss 4.892941 | norm 0.4674 | lr 8.54e-04 | (9920.62 ms | 52848 tok/s) step 488/76294 | train loss 4.837861 | norm 0.6296 | lr 8.56e-04 | (9886.23 ms | 53032 tok/s) step 489/76294 | train loss 4.911885 | norm 1.1080 | lr 8.57e-04 | (9908.13 ms | 52915 tok/s) step 490/76294 | train loss 5.008920 | norm 1.8103 | lr 8.59e-04 | (9924.48 ms | 52828 tok/s) step 491/76294 | train loss 4.981424 | norm 1.2358 | lr 8.60e-04 | (9904.59 ms | 52934 tok/s) step 492/76294 | train loss 5.086230 | norm 1.6473 | lr 8.62e-04 | (9903.10 ms | 52942 tok/s) step 493/76294 | train loss 5.493223 | norm 21.1895 | lr 8.63e-04 | (9957.66 ms | 52652 tok/s) step 494/76294 | train loss 5.275486 | norm 4.1374 | lr 8.65e-04 | (9895.72 ms | 52981 tok/s) step 495/76294 | train loss 5.111360 | norm 1.1324 | lr 8.66e-04 | (9900.23 ms | 52957 tok/s) step 496/76294 | train loss 5.059646 | norm 1.1455 | lr 8.68e-04 | (9935.15 ms | 52771 tok/s) step 497/76294 | train loss 5.072196 | norm 0.9151 | lr 8.69e-04 | (9893.90 ms | 52991 tok/s) step 498/76294 | train loss 5.142521 | norm 1.5368 | lr 8.71e-04 | (9913.94 ms | 52884 tok/s) step 499/76294 | train loss 5.111035 | norm 0.6061 | lr 8.72e-04 | (9894.61 ms | 52987 tok/s) step 500/76294 | train loss 5.037500 | norm 0.7067 | lr 8.74e-04 | (9898.45 ms | 52967 tok/s) val loss: 5.061464 saving model checkpoint to ./results/gpt2-350M-gqa/step_500.pth step 501/76294 | train loss 5.059775 | norm 0.7370 | lr 8.75e-04 | (9981.80 ms | 52524 tok/s) step 502/76294 | train loss 4.985019 | norm 0.8311 | lr 8.77e-04 | (9883.21 ms | 53048 tok/s) step 503/76294 | train loss 5.046012 | norm 0.6470 | lr 8.78e-04 | (9883.67 ms | 53046 tok/s) step 504/76294 | train loss 5.000539 | norm 0.7821 | lr 8.80e-04 | (9923.00 ms | 52836 tok/s) step 505/76294 | train loss 4.956366 | norm 0.5645 | lr 8.81e-04 | (9891.40 ms | 53004 tok/s) step 506/76294 | train loss 4.928059 | norm 0.5028 | lr 8.83e-04 | (9920.14 ms | 52851 tok/s) step 507/76294 | train loss 5.000540 | norm 0.4070 | lr 8.84e-04 | (9966.26 ms | 52606 tok/s) step 508/76294 | train loss 4.926644 | norm 0.3678 | lr 8.86e-04 | (9890.31 ms | 53010 tok/s) step 509/76294 | train loss 4.891923 | norm 0.3861 | lr 8.87e-04 | (9904.63 ms | 52934 tok/s) step 510/76294 | train loss 4.929175 | norm 0.3712 | lr 8.89e-04 | (9895.70 ms | 52981 tok/s) step 511/76294 | train loss 4.890589 | norm 0.3554 | lr 8.90e-04 | (9908.32 ms | 52914 tok/s) step 512/76294 | train loss 4.909049 | norm 0.4076 | lr 8.92e-04 | (9897.51 ms | 52972 tok/s) step 513/76294 | train loss 4.979857 | norm 0.4656 | lr 8.93e-04 | (9906.28 ms | 52925 tok/s) step 514/76294 | train loss 4.830269 | norm 0.3934 | lr 8.95e-04 | (9892.61 ms | 52998 tok/s) step 515/76294 | train loss 4.795240 | norm 0.3861 | lr 8.96e-04 | (9916.96 ms | 52868 tok/s) step 516/76294 | train loss 4.809076 | norm 0.3075 | lr 8.98e-04 | (9897.66 ms | 52971 tok/s) step 517/76294 | train loss 4.816547 | norm 0.3311 | lr 8.99e-04 | (9972.53 ms | 52573 tok/s) step 518/76294 | train loss 4.748245 | norm 0.3421 | lr 9.01e-04 | (9894.99 ms | 52985 tok/s) step 519/76294 | train loss 4.840396 | norm 0.3703 | lr 9.02e-04 | (9886.85 ms | 53029 tok/s) step 520/76294 | train loss 4.884612 | norm 0.5147 | lr 9.04e-04 | (9949.57 ms | 52695 tok/s) step 521/76294 | train loss 4.853326 | norm 0.5278 | lr 9.05e-04 | (9890.72 ms | 53008 tok/s) step 522/76294 | train loss 4.817846 | norm 0.4507 | lr 9.07e-04 | (9896.91 ms | 52975 tok/s) step 523/76294 | train loss 4.846192 | norm 0.4805 | lr 9.08e-04 | (9963.41 ms | 52621 tok/s) step 524/76294 | train loss 4.815675 | norm 0.3439 | lr 9.10e-04 | (9895.17 ms | 52984 tok/s) step 525/76294 | train loss 4.792274 | norm 0.4072 | lr 9.11e-04 | (9965.69 ms | 52609 tok/s) step 526/76294 | train loss 4.885030 | norm 0.3950 | lr 9.13e-04 | (9900.24 ms | 52957 tok/s) step 527/76294 | train loss 4.803636 | norm 0.3195 | lr 9.15e-04 | (9988.84 ms | 52487 tok/s) step 528/76294 | train loss 4.846376 | norm 0.3019 | lr 9.16e-04 | (9898.70 ms | 52965 tok/s) step 529/76294 | train loss 4.822646 | norm 0.3268 | lr 9.18e-04 | (9937.16 ms | 52760 tok/s) step 530/76294 | train loss 4.772926 | norm 0.3071 | lr 9.19e-04 | (9897.04 ms | 52974 tok/s) step 531/76294 | train loss 4.773027 | norm 0.2650 | lr 9.21e-04 | (9907.29 ms | 52919 tok/s) step 532/76294 | train loss 4.775497 | norm 0.2511 | lr 9.22e-04 | (9902.72 ms | 52944 tok/s) step 533/76294 | train loss 4.821372 | norm 0.2512 | lr 9.24e-04 | (9910.52 ms | 52902 tok/s) step 534/76294 | train loss 4.782161 | norm 0.2916 | lr 9.25e-04 | (9930.79 ms | 52794 tok/s) step 535/76294 | train loss 4.757273 | norm 0.2474 | lr 9.27e-04 | (9896.35 ms | 52978 tok/s) step 536/76294 | train loss 4.810995 | norm 0.2806 | lr 9.28e-04 | (9928.33 ms | 52807 tok/s) step 537/76294 | train loss 4.769211 | norm 0.2814 | lr 9.30e-04 | (9971.49 ms | 52579 tok/s) step 538/76294 | train loss 4.714281 | norm 0.3335 | lr 9.31e-04 | (9929.04 ms | 52803 tok/s) step 539/76294 | train loss 4.782691 | norm 0.4495 | lr 9.33e-04 | (9894.62 ms | 52987 tok/s) step 540/76294 | train loss 4.833931 | norm 0.4572 | lr 9.34e-04 | (9895.45 ms | 52983 tok/s) step 541/76294 | train loss 4.759755 | norm 0.4307 | lr 9.36e-04 | (9900.22 ms | 52957 tok/s) step 542/76294 | train loss 4.729202 | norm 0.4601 | lr 9.37e-04 | (9886.41 ms | 53031 tok/s) step 543/76294 | train loss 4.897824 | norm 0.4522 | lr 9.39e-04 | (9897.91 ms | 52970 tok/s) step 544/76294 | train loss 4.771846 | norm 0.4316 | lr 9.40e-04 | (9886.58 ms | 53030 tok/s) step 545/76294 | train loss 4.806242 | norm 0.4158 | lr 9.42e-04 | (9911.30 ms | 52898 tok/s) step 546/76294 | train loss 4.761252 | norm 0.3959 | lr 9.43e-04 | (9966.78 ms | 52604 tok/s) step 547/76294 | train loss 4.775460 | norm 0.3829 | lr 9.45e-04 | (9884.60 ms | 53041 tok/s) step 548/76294 | train loss 4.751615 | norm 0.3339 | lr 9.46e-04 | (9882.40 ms | 53053 tok/s) step 549/76294 | train loss 4.734097 | norm 0.3874 | lr 9.48e-04 | (9889.62 ms | 53014 tok/s) step 550/76294 | train loss 4.781749 | norm 0.4626 | lr 9.49e-04 | (9920.29 ms | 52850 tok/s) step 551/76294 | train loss 4.758152 | norm 0.4455 | lr 9.51e-04 | (9888.81 ms | 53018 tok/s) step 552/76294 | train loss 4.716816 | norm 0.3774 | lr 9.52e-04 | (9972.24 ms | 52575 tok/s) step 553/76294 | train loss 4.761244 | norm 0.3164 | lr 9.54e-04 | (9882.66 ms | 53051 tok/s) step 554/76294 | train loss 4.699126 | norm 0.3006 | lr 9.55e-04 | (9890.00 ms | 53012 tok/s) step 555/76294 | train loss 4.772230 | norm 0.3401 | lr 9.57e-04 | (9881.59 ms | 53057 tok/s) step 556/76294 | train loss 4.758654 | norm 0.3555 | lr 9.58e-04 | (9908.59 ms | 52912 tok/s) step 557/76294 | train loss 4.753751 | norm 0.3286 | lr 9.60e-04 | (9885.19 ms | 53038 tok/s) step 558/76294 | train loss 4.685586 | norm 0.3109 | lr 9.61e-04 | (9890.98 ms | 53007 tok/s) step 559/76294 | train loss 4.656309 | norm 0.2974 | lr 9.63e-04 | (9887.13 ms | 53027 tok/s) step 560/76294 | train loss 4.699413 | norm 0.3172 | lr 9.64e-04 | (9893.56 ms | 52993 tok/s) step 561/76294 | train loss 4.652070 | norm 0.2902 | lr 9.66e-04 | (9888.76 ms | 53019 tok/s) step 562/76294 | train loss 4.646441 | norm 0.2778 | lr 9.67e-04 | (9888.60 ms | 53019 tok/s) step 563/76294 | train loss 4.657440 | norm 0.2874 | lr 9.69e-04 | (9886.31 ms | 53032 tok/s) step 564/76294 | train loss 4.653122 | norm 0.2701 | lr 9.70e-04 | (9893.96 ms | 52991 tok/s) step 565/76294 | train loss 4.646162 | norm 0.3971 | lr 9.72e-04 | (9890.46 ms | 53009 tok/s) step 566/76294 | train loss 4.641915 | norm 0.2517 | lr 9.73e-04 | (9896.03 ms | 52980 tok/s) step 567/76294 | train loss 4.685186 | norm 0.2649 | lr 9.75e-04 | (9889.15 ms | 53017 tok/s) step 568/76294 | train loss 4.656653 | norm 0.2871 | lr 9.76e-04 | (9895.29 ms | 52984 tok/s) step 569/76294 | train loss 4.660697 | norm 0.3542 | lr 9.78e-04 | (9960.28 ms | 52638 tok/s) step 570/76294 | train loss 4.790763 | norm 0.4821 | lr 9.79e-04 | (9885.08 ms | 53038 tok/s) step 571/76294 | train loss 4.791978 | norm 1.0829 | lr 9.81e-04 | (9974.75 ms | 52562 tok/s) step 572/76294 | train loss 4.802705 | norm 1.1887 | lr 9.82e-04 | (9883.21 ms | 53048 tok/s) step 573/76294 | train loss 4.758778 | norm 0.6557 | lr 9.84e-04 | (10792.41 ms | 48579 tok/s) step 574/76294 | train loss 4.903246 | norm 2.0811 | lr 9.86e-04 | (9884.16 ms | 53043 tok/s) step 575/76294 | train loss 4.828483 | norm 1.4437 | lr 9.87e-04 | (9882.51 ms | 53052 tok/s) step 576/76294 | train loss 4.976301 | norm 1.9449 | lr 9.89e-04 | (9886.62 ms | 53030 tok/s) step 577/76294 | train loss 4.870887 | norm 1.5682 | lr 9.90e-04 | (9927.32 ms | 52813 tok/s) step 578/76294 | train loss 4.917866 | norm 1.4923 | lr 9.92e-04 | (9884.92 ms | 53039 tok/s) step 579/76294 | train loss 4.779038 | norm 0.6493 | lr 9.93e-04 | (9899.55 ms | 52961 tok/s) step 580/76294 | train loss 4.865939 | norm 1.1605 | lr 9.95e-04 | (9903.55 ms | 52939 tok/s) step 581/76294 | train loss 4.956699 | norm 1.7096 | lr 9.96e-04 | (9892.27 ms | 53000 tok/s) step 582/76294 | train loss 4.819366 | norm 1.1107 | lr 9.98e-04 | (11306.83 ms | 46369 tok/s) step 583/76294 | train loss 4.818801 | norm 0.6058 | lr 9.99e-04 | (10758.28 ms | 48733 tok/s) step 584/76294 | train loss 4.778215 | norm 0.6454 | lr 1.00e-03 | (9881.26 ms | 53059 tok/s) step 585/76294 | train loss 4.814414 | norm 0.8907 | lr 1.00e-03 | (9954.53 ms | 52668 tok/s) step 586/76294 | train loss 4.795233 | norm 0.7747 | lr 1.00e-03 | (9882.93 ms | 53050 tok/s) step 587/76294 | train loss 4.763772 | norm 0.6962 | lr 1.01e-03 | (9885.54 ms | 53036 tok/s) step 588/76294 | train loss 4.802669 | norm 0.6460 | lr 1.01e-03 | (9894.39 ms | 52988 tok/s) step 589/76294 | train loss 4.759731 | norm 0.5257 | lr 1.01e-03 | (9982.95 ms | 52518 tok/s) step 590/76294 | train loss 4.815632 | norm 0.4566 | lr 1.01e-03 | (9893.58 ms | 52993 tok/s) step 591/76294 | train loss 4.655960 | norm 0.3738 | lr 1.01e-03 | (9903.32 ms | 52941 tok/s) step 592/76294 | train loss 4.770168 | norm 0.4182 | lr 1.01e-03 | (9896.89 ms | 52975 tok/s) step 593/76294 | train loss 4.758412 | norm 0.4724 | lr 1.01e-03 | (9910.09 ms | 52904 tok/s) step 594/76294 | train loss 4.670585 | norm 0.4460 | lr 1.02e-03 | (9892.12 ms | 53001 tok/s) step 595/76294 | train loss 4.808486 | norm 0.5888 | lr 1.02e-03 | (9898.80 ms | 52965 tok/s) step 596/76294 | train loss 4.718916 | norm 0.6058 | lr 1.02e-03 | (9899.31 ms | 52962 tok/s) step 597/76294 | train loss 4.705032 | norm 0.4764 | lr 1.02e-03 | (9906.16 ms | 52925 tok/s) step 598/76294 | train loss 4.658396 | norm 0.3908 | lr 1.02e-03 | (9922.23 ms | 52840 tok/s) step 599/76294 | train loss 4.874396 | norm 0.4017 | lr 1.02e-03 | (9905.13 ms | 52931 tok/s) step 600/76294 | train loss 4.688007 | norm 0.5450 | lr 1.02e-03 | (9895.03 ms | 52985 tok/s) step 601/76294 | train loss 4.651491 | norm 0.5578 | lr 1.03e-03 | (9894.06 ms | 52990 tok/s) step 602/76294 | train loss 4.707658 | norm 0.3295 | lr 1.03e-03 | (9898.04 ms | 52969 tok/s) step 603/76294 | train loss 4.597477 | norm 0.3251 | lr 1.03e-03 | (9894.73 ms | 52987 tok/s) step 604/76294 | train loss 4.720360 | norm 0.2819 | lr 1.03e-03 | (9892.76 ms | 52997 tok/s) step 605/76294 | train loss 4.581365 | norm 0.4523 | lr 1.03e-03 | (9893.74 ms | 52992 tok/s) step 606/76294 | train loss 4.656949 | norm 0.3229 | lr 1.03e-03 | (9894.89 ms | 52986 tok/s) step 607/76294 | train loss 4.611527 | norm 0.2445 | lr 1.04e-03 | (9899.26 ms | 52962 tok/s) step 608/76294 | train loss 4.577824 | norm 0.3139 | lr 1.04e-03 | (9930.05 ms | 52798 tok/s) step 609/76294 | train loss 4.637552 | norm 0.3570 | lr 1.04e-03 | (9894.71 ms | 52987 tok/s) step 610/76294 | train loss 4.596532 | norm 0.2712 | lr 1.04e-03 | (9906.50 ms | 52924 tok/s) step 611/76294 | train loss 4.578384 | norm 0.3059 | lr 1.04e-03 | (9896.63 ms | 52976 tok/s) step 612/76294 | train loss 4.526636 | norm 0.3031 | lr 1.04e-03 | (9891.06 ms | 53006 tok/s) step 613/76294 | train loss 4.592541 | norm 0.3261 | lr 1.04e-03 | (9900.01 ms | 52958 tok/s) step 614/76294 | train loss 4.592650 | norm 0.3411 | lr 1.05e-03 | (9888.29 ms | 53021 tok/s) step 615/76294 | train loss 4.574864 | norm 0.4527 | lr 1.05e-03 | (9920.50 ms | 52849 tok/s) step 616/76294 | train loss 4.606156 | norm 0.5541 | lr 1.05e-03 | (9893.84 ms | 52991 tok/s) step 617/76294 | train loss 4.577943 | norm 0.5224 | lr 1.05e-03 | (9930.85 ms | 52794 tok/s) step 618/76294 | train loss 4.630161 | norm 0.3732 | lr 1.05e-03 | (9893.03 ms | 52996 tok/s) step 619/76294 | train loss 4.600977 | norm 2.0113 | lr 1.05e-03 | (9898.09 ms | 52969 tok/s) step 620/76294 | train loss 4.683922 | norm 0.4686 | lr 1.05e-03 | (9953.85 ms | 52672 tok/s) step 621/76294 | train loss 4.672027 | norm 0.4454 | lr 1.06e-03 | (9892.14 ms | 53000 tok/s) step 622/76294 | train loss 4.572186 | norm 0.8166 | lr 1.06e-03 | (9889.57 ms | 53014 tok/s) step 623/76294 | train loss 4.658771 | norm 0.4088 | lr 1.06e-03 | (9901.96 ms | 52948 tok/s) step 624/76294 | train loss 4.501819 | norm 0.3122 | lr 1.06e-03 | (9953.75 ms | 52672 tok/s) step 625/76294 | train loss 4.619207 | norm 0.3184 | lr 1.06e-03 | (9907.96 ms | 52916 tok/s) step 626/76294 | train loss 4.548374 | norm 0.3376 | lr 1.06e-03 | (9894.99 ms | 52985 tok/s) step 627/76294 | train loss 4.598216 | norm 0.3564 | lr 1.07e-03 | (9984.84 ms | 52508 tok/s) step 628/76294 | train loss 4.524783 | norm 0.3343 | lr 1.07e-03 | (9891.06 ms | 53006 tok/s) step 629/76294 | train loss 4.538423 | norm 0.3596 | lr 1.07e-03 | (9897.93 ms | 52969 tok/s) step 630/76294 | train loss 4.571936 | norm 0.3146 | lr 1.07e-03 | (9889.98 ms | 53012 tok/s) step 631/76294 | train loss 4.514371 | norm 0.3870 | lr 1.07e-03 | (9903.18 ms | 52941 tok/s) step 632/76294 | train loss 4.618529 | norm 0.6678 | lr 1.07e-03 | (9893.85 ms | 52991 tok/s) step 633/76294 | train loss 4.475997 | norm 0.3829 | lr 1.07e-03 | (9936.75 ms | 52763 tok/s) step 634/76294 | train loss 4.605440 | norm 0.5992 | lr 1.08e-03 | (9900.03 ms | 52958 tok/s) step 635/76294 | train loss 4.517220 | norm 0.5235 | lr 1.08e-03 | (9894.13 ms | 52990 tok/s) step 636/76294 | train loss 4.565156 | norm 0.3757 | lr 1.08e-03 | (9895.47 ms | 52983 tok/s) step 637/76294 | train loss 4.550333 | norm 0.5621 | lr 1.08e-03 | (9931.86 ms | 52789 tok/s) step 638/76294 | train loss 4.501504 | norm 0.4386 | lr 1.08e-03 | (9893.96 ms | 52991 tok/s) step 639/76294 | train loss 4.520545 | norm 0.7961 | lr 1.08e-03 | (9901.48 ms | 52950 tok/s) step 640/76294 | train loss 4.496151 | norm 0.5333 | lr 1.09e-03 | (9890.76 ms | 53008 tok/s) step 641/76294 | train loss 4.577823 | norm 0.8558 | lr 1.09e-03 | (9894.47 ms | 52988 tok/s) step 642/76294 | train loss 4.546202 | norm 0.8762 | lr 1.09e-03 | (9890.71 ms | 53008 tok/s) step 643/76294 | train loss 4.571847 | norm 0.5542 | lr 1.09e-03 | (9900.11 ms | 52958 tok/s) step 644/76294 | train loss 4.549397 | norm 0.4689 | lr 1.09e-03 | (9891.77 ms | 53002 tok/s) step 645/76294 | train loss 4.578952 | norm 0.2959 | lr 1.09e-03 | (9956.83 ms | 52656 tok/s) step 646/76294 | train loss 4.429479 | norm 0.2812 | lr 1.09e-03 | (9892.73 ms | 52997 tok/s) step 647/76294 | train loss 4.500968 | norm 6.0771 | lr 1.10e-03 | (9902.55 ms | 52945 tok/s) step 648/76294 | train loss 4.621703 | norm 4.3254 | lr 1.10e-03 | (9894.15 ms | 52990 tok/s) step 649/76294 | train loss 4.523185 | norm 0.5507 | lr 1.10e-03 | (9903.39 ms | 52940 tok/s) step 650/76294 | train loss 4.501370 | norm 0.4912 | lr 1.10e-03 | (9893.22 ms | 52995 tok/s) step 651/76294 | train loss 4.519279 | norm 0.4925 | lr 1.10e-03 | (9901.40 ms | 52951 tok/s) step 652/76294 | train loss 4.594001 | norm 0.6813 | lr 1.10e-03 | (9894.03 ms | 52990 tok/s) step 653/76294 | train loss 4.587995 | norm 0.4791 | lr 1.10e-03 | (9972.52 ms | 52573 tok/s) step 654/76294 | train loss 4.540799 | norm 0.5362 | lr 1.11e-03 | (9900.19 ms | 52957 tok/s) step 655/76294 | train loss 4.502909 | norm 0.7415 | lr 1.11e-03 | (9898.34 ms | 52967 tok/s) step 656/76294 | train loss 4.625202 | norm 1.0179 | lr 1.11e-03 | (9930.31 ms | 52797 tok/s) step 657/76294 | train loss 4.451262 | norm 1.0028 | lr 1.11e-03 | (9895.42 ms | 52983 tok/s) step 658/76294 | train loss 4.478811 | norm 0.8583 | lr 1.11e-03 | (9899.32 ms | 52962 tok/s) step 659/76294 | train loss 4.545234 | norm 0.9812 | lr 1.11e-03 | (9967.01 ms | 52602 tok/s) step 660/76294 | train loss 4.532812 | norm 0.9230 | lr 1.12e-03 | (9892.62 ms | 52998 tok/s) step 661/76294 | train loss 4.574505 | norm 0.8896 | lr 1.12e-03 | (9893.42 ms | 52994 tok/s) step 662/76294 | train loss 4.545694 | norm 1.7693 | lr 1.12e-03 | (9963.45 ms | 52621 tok/s) step 663/76294 | train loss 4.590130 | norm 1.3299 | lr 1.12e-03 | (9918.52 ms | 52859 tok/s) step 664/76294 | train loss 4.499274 | norm 1.3823 | lr 1.12e-03 | (9952.92 ms | 52677 tok/s) step 665/76294 | train loss 4.571806 | norm 0.6508 | lr 1.12e-03 | (9899.89 ms | 52959 tok/s) step 666/76294 | train loss 4.456483 | norm 1.8632 | lr 1.12e-03 | (9898.73 ms | 52965 tok/s) step 667/76294 | train loss 4.551179 | norm 0.5397 | lr 1.13e-03 | (9936.79 ms | 52762 tok/s) step 668/76294 | train loss 4.546321 | norm 0.4944 | lr 1.13e-03 | (9893.44 ms | 52993 tok/s) step 669/76294 | train loss 4.537047 | norm 0.5151 | lr 1.13e-03 | (9904.58 ms | 52934 tok/s) step 670/76294 | train loss 4.634519 | norm 0.5722 | lr 1.13e-03 | (9894.62 ms | 52987 tok/s) step 671/76294 | train loss 4.527796 | norm 0.4325 | lr 1.13e-03 | (9901.17 ms | 52952 tok/s) step 672/76294 | train loss 4.494952 | norm 0.6147 | lr 1.13e-03 | (9897.24 ms | 52973 tok/s) step 673/76294 | train loss 4.515805 | norm 6.9318 | lr 1.14e-03 | (9919.12 ms | 52856 tok/s) step 674/76294 | train loss 4.578739 | norm 0.9390 | lr 1.14e-03 | (9941.32 ms | 52738 tok/s) step 675/76294 | train loss 4.625749 | norm 0.9940 | lr 1.14e-03 | (9906.87 ms | 52922 tok/s) step 676/76294 | train loss 4.673625 | norm 1.4298 | lr 1.14e-03 | (9916.73 ms | 52869 tok/s) step 677/76294 | train loss 4.597513 | norm 1.8965 | lr 1.14e-03 | (9897.47 ms | 52972 tok/s) step 678/76294 | train loss 4.661639 | norm 1.9505 | lr 1.14e-03 | (9901.14 ms | 52952 tok/s) step 679/76294 | train loss 4.636376 | norm 0.9805 | lr 1.14e-03 | (9899.57 ms | 52961 tok/s) step 680/76294 | train loss 4.661345 | norm 1.0898 | lr 1.15e-03 | (9900.16 ms | 52958 tok/s) step 681/76294 | train loss 4.637335 | norm 1.6030 | lr 1.15e-03 | (11362.29 ms | 46143 tok/s) step 682/76294 | train loss 4.626709 | norm 0.7260 | lr 1.15e-03 | (9883.45 ms | 53047 tok/s) step 683/76294 | train loss 4.606590 | norm 0.5583 | lr 1.15e-03 | (9904.47 ms | 52934 tok/s) step 684/76294 | train loss 4.523669 | norm 0.4524 | lr 1.15e-03 | (9888.86 ms | 53018 tok/s) step 685/76294 | train loss 4.571009 | norm 0.4210 | lr 1.15e-03 | (9901.30 ms | 52951 tok/s) step 686/76294 | train loss 4.443433 | norm 0.3856 | lr 1.15e-03 | (9911.98 ms | 52894 tok/s) step 687/76294 | train loss 4.503326 | norm 0.3771 | lr 1.16e-03 | (9903.67 ms | 52939 tok/s) step 688/76294 | train loss 4.476548 | norm 0.3477 | lr 1.16e-03 | (9900.37 ms | 52956 tok/s) step 689/76294 | train loss 4.548041 | norm 0.2894 | lr 1.16e-03 | (9905.26 ms | 52930 tok/s) step 690/76294 | train loss 4.475660 | norm 0.2943 | lr 1.16e-03 | (9901.09 ms | 52953 tok/s) step 691/76294 | train loss 4.456817 | norm 0.3001 | lr 1.16e-03 | (9905.33 ms | 52930 tok/s) step 692/76294 | train loss 4.424180 | norm 0.2912 | lr 1.16e-03 | (9914.49 ms | 52881 tok/s) step 693/76294 | train loss 4.456784 | norm 0.3057 | lr 1.17e-03 | (9903.02 ms | 52942 tok/s) step 694/76294 | train loss 4.392532 | norm 0.3657 | lr 1.17e-03 | (9907.48 ms | 52918 tok/s) step 695/76294 | train loss 4.506681 | norm 0.4643 | lr 1.17e-03 | (9904.55 ms | 52934 tok/s) step 696/76294 | train loss 4.403083 | norm 0.5398 | lr 1.17e-03 | (9893.92 ms | 52991 tok/s) step 697/76294 | train loss 4.475311 | norm 0.5005 | lr 1.17e-03 | (9926.25 ms | 52818 tok/s) step 698/76294 | train loss 4.445308 | norm 0.4685 | lr 1.17e-03 | (9906.39 ms | 52924 tok/s) step 699/76294 | train loss 4.450196 | norm 0.4263 | lr 1.17e-03 | (9942.86 ms | 52730 tok/s) step 700/76294 | train loss 4.391058 | norm 0.4656 | lr 1.18e-03 | (9908.98 ms | 52910 tok/s) step 701/76294 | train loss 4.473367 | norm 0.3093 | lr 1.18e-03 | (9901.93 ms | 52948 tok/s) step 702/76294 | train loss 4.440017 | norm 0.2845 | lr 1.18e-03 | (9903.32 ms | 52941 tok/s) step 703/76294 | train loss 4.414283 | norm 0.2790 | lr 1.18e-03 | (9897.77 ms | 52970 tok/s) step 704/76294 | train loss 4.443832 | norm 0.3147 | lr 1.18e-03 | (9909.32 ms | 52909 tok/s) step 705/76294 | train loss 4.432413 | norm 0.2972 | lr 1.18e-03 | (9968.53 ms | 52594 tok/s) step 706/76294 | train loss 4.429738 | norm 0.2627 | lr 1.18e-03 | (9945.00 ms | 52719 tok/s) step 707/76294 | train loss 4.395833 | norm 0.2583 | lr 1.19e-03 | (9895.87 ms | 52980 tok/s) step 708/76294 | train loss 4.370791 | norm 0.2768 | lr 1.19e-03 | (9890.36 ms | 53010 tok/s) step 709/76294 | train loss 4.381947 | norm 0.2207 | lr 1.19e-03 | (9898.68 ms | 52965 tok/s) step 710/76294 | train loss 4.403737 | norm 0.2630 | lr 1.19e-03 | (9932.82 ms | 52783 tok/s) step 711/76294 | train loss 4.464942 | norm 0.2661 | lr 1.19e-03 | (9899.65 ms | 52960 tok/s) step 712/76294 | train loss 4.349000 | norm 0.2826 | lr 1.19e-03 | (9899.33 ms | 52962 tok/s) step 713/76294 | train loss 4.383976 | norm 0.2966 | lr 1.20e-03 | (9896.10 ms | 52979 tok/s) step 714/76294 | train loss 4.369535 | norm 0.2748 | lr 1.20e-03 | (9951.42 ms | 52685 tok/s) step 715/76294 | train loss 4.379096 | norm 0.2923 | lr 1.20e-03 | (9895.85 ms | 52981 tok/s) step 716/76294 | train loss 4.367143 | norm 0.2875 | lr 1.20e-03 | (9976.19 ms | 52554 tok/s) step 717/76294 | train loss 4.366198 | norm 0.3053 | lr 1.20e-03 | (9894.74 ms | 52987 tok/s) step 718/76294 | train loss 4.341451 | norm 0.3772 | lr 1.20e-03 | (9895.67 ms | 52982 tok/s) step 719/76294 | train loss 4.371810 | norm 0.4291 | lr 1.20e-03 | (9933.05 ms | 52782 tok/s) step 720/76294 | train loss 4.283920 | norm 0.3377 | lr 1.20e-03 | (9889.37 ms | 53015 tok/s) step 721/76294 | train loss 4.419213 | norm 0.2751 | lr 1.20e-03 | (9899.40 ms | 52962 tok/s) step 722/76294 | train loss 4.333280 | norm 0.3027 | lr 1.20e-03 | (9935.22 ms | 52771 tok/s) step 723/76294 | train loss 4.408252 | norm 0.2856 | lr 1.20e-03 | (9895.13 ms | 52984 tok/s) step 724/76294 | train loss 4.350908 | norm 0.2990 | lr 1.20e-03 | (9901.83 ms | 52949 tok/s) step 725/76294 | train loss 4.316400 | norm 0.3576 | lr 1.20e-03 | (9891.03 ms | 53006 tok/s) step 726/76294 | train loss 4.325500 | norm 0.3570 | lr 1.20e-03 | (9895.78 ms | 52981 tok/s) step 727/76294 | train loss 4.312237 | norm 0.3839 | lr 1.20e-03 | (9896.96 ms | 52975 tok/s) step 728/76294 | train loss 4.338177 | norm 0.3095 | lr 1.20e-03 | (9899.30 ms | 52962 tok/s) step 729/76294 | train loss 4.386132 | norm 0.3386 | lr 1.20e-03 | (9895.66 ms | 52982 tok/s) step 730/76294 | train loss 4.383256 | norm 0.3392 | lr 1.20e-03 | (9888.45 ms | 53020 tok/s) step 731/76294 | train loss 4.314255 | norm 0.2816 | lr 1.20e-03 | (9908.47 ms | 52913 tok/s) step 732/76294 | train loss 4.381884 | norm 0.2839 | lr 1.20e-03 | (9930.41 ms | 52796 tok/s) step 733/76294 | train loss 4.290462 | norm 0.2764 | lr 1.20e-03 | (9886.21 ms | 53032 tok/s) step 734/76294 | train loss 4.378951 | norm 0.2849 | lr 1.20e-03 | (9922.33 ms | 52839 tok/s) step 735/76294 | train loss 4.253671 | norm 0.2803 | lr 1.20e-03 | (9890.92 ms | 53007 tok/s) step 736/76294 | train loss 4.313334 | norm 0.2599 | lr 1.20e-03 | (9895.56 ms | 52982 tok/s) step 737/76294 | train loss 4.375547 | norm 0.2944 | lr 1.20e-03 | (9913.31 ms | 52887 tok/s) step 738/76294 | train loss 4.305025 | norm 0.3464 | lr 1.20e-03 | (9891.10 ms | 53006 tok/s) step 739/76294 | train loss 4.344303 | norm 0.3361 | lr 1.20e-03 | (9886.63 ms | 53030 tok/s) step 740/76294 | train loss 4.337162 | norm 0.3170 | lr 1.20e-03 | (9893.11 ms | 52995 tok/s) step 741/76294 | train loss 4.377285 | norm 0.3308 | lr 1.20e-03 | (9886.68 ms | 53030 tok/s) step 742/76294 | train loss 4.273127 | norm 0.2804 | lr 1.20e-03 | (9891.56 ms | 53004 tok/s) step 743/76294 | train loss 4.433258 | norm 0.2699 | lr 1.20e-03 | (9888.53 ms | 53020 tok/s) step 744/76294 | train loss 4.267642 | norm 0.3157 | lr 1.20e-03 | (9893.19 ms | 52995 tok/s) step 745/76294 | train loss 4.353263 | norm 0.3174 | lr 1.20e-03 | (9886.59 ms | 53030 tok/s) step 746/76294 | train loss 4.301746 | norm 0.5383 | lr 1.20e-03 | (9894.65 ms | 52987 tok/s) step 747/76294 | train loss 4.332289 | norm 0.3867 | lr 1.20e-03 | (9884.40 ms | 53042 tok/s) step 748/76294 | train loss 4.264432 | norm 0.3583 | lr 1.20e-03 | (9882.58 ms | 53052 tok/s) step 749/76294 | train loss 4.322043 | norm 0.3295 | lr 1.20e-03 | (9903.01 ms | 52942 tok/s) step 750/76294 | train loss 4.279765 | norm 0.2903 | lr 1.20e-03 | (9932.18 ms | 52787 tok/s) val loss: 4.337692 saving model checkpoint to ./results/gpt2-350M-gqa/step_750.pth step 751/76294 | train loss 4.304524 | norm 0.2748 | lr 1.20e-03 | (9949.21 ms | 52696 tok/s) step 752/76294 | train loss 4.297956 | norm 0.3390 | lr 1.20e-03 | (9888.49 ms | 53020 tok/s) step 753/76294 | train loss 4.300531 | norm 0.2716 | lr 1.20e-03 | (9886.65 ms | 53030 tok/s) step 754/76294 | train loss 4.320193 | norm 0.5645 | lr 1.20e-03 | (9945.99 ms | 52713 tok/s) step 755/76294 | train loss 4.269460 | norm 0.3719 | lr 1.20e-03 | (9885.86 ms | 53034 tok/s) step 756/76294 | train loss 4.295651 | norm 0.6318 | lr 1.20e-03 | (9971.84 ms | 52577 tok/s) step 757/76294 | train loss 4.335117 | norm 0.7449 | lr 1.20e-03 | (9887.99 ms | 53023 tok/s) step 758/76294 | train loss 4.434615 | norm 0.6826 | lr 1.20e-03 | (9891.37 ms | 53005 tok/s) step 759/76294 | train loss 4.354695 | norm 0.6773 | lr 1.20e-03 | (9893.18 ms | 52995 tok/s) step 760/76294 | train loss 4.353889 | norm 0.4848 | lr 1.20e-03 | (9935.95 ms | 52767 tok/s) step 761/76294 | train loss 4.331406 | norm 0.8156 | lr 1.20e-03 | (9895.16 ms | 52984 tok/s) step 762/76294 | train loss 4.390905 | norm 0.4651 | lr 1.20e-03 | (9904.20 ms | 52936 tok/s) step 763/76294 | train loss 4.391034 | norm 0.4246 | lr 1.20e-03 | (10203.78 ms | 51382 tok/s) step 764/76294 | train loss 4.375451 | norm 1.1573 | lr 1.20e-03 | (9928.41 ms | 52807 tok/s) step 765/76294 | train loss 4.354039 | norm 0.3603 | lr 1.20e-03 | (9897.84 ms | 52970 tok/s) step 766/76294 | train loss 4.432147 | norm 0.3450 | lr 1.20e-03 | (10443.21 ms | 50204 tok/s) step 767/76294 | train loss 4.341081 | norm 0.4136 | lr 1.20e-03 | (9986.21 ms | 52501 tok/s) step 768/76294 | train loss 4.417610 | norm 0.4121 | lr 1.20e-03 | (9898.73 ms | 52965 tok/s) step 769/76294 | train loss 4.326787 | norm 0.4256 | lr 1.20e-03 | (9902.34 ms | 52946 tok/s) step 770/76294 | train loss 4.371364 | norm 0.4113 | lr 1.20e-03 | (9941.04 ms | 52740 tok/s) step 771/76294 | train loss 4.363015 | norm 0.3825 | lr 1.20e-03 | (9901.15 ms | 52952 tok/s) step 772/76294 | train loss 4.366879 | norm 0.3749 | lr 1.20e-03 | (9901.43 ms | 52951 tok/s) step 773/76294 | train loss 4.384025 | norm 0.3490 | lr 1.20e-03 | (9922.42 ms | 52839 tok/s) step 774/76294 | train loss 4.299470 | norm 0.3711 | lr 1.20e-03 | (9905.02 ms | 52932 tok/s) step 775/76294 | train loss 4.347232 | norm 0.3369 | lr 1.20e-03 | (9899.23 ms | 52963 tok/s) step 776/76294 | train loss 4.337865 | norm 0.3464 | lr 1.20e-03 | (9905.04 ms | 52931 tok/s) step 777/76294 | train loss 4.320539 | norm 0.3171 | lr 1.20e-03 | (9899.03 ms | 52964 tok/s) step 778/76294 | train loss 4.295093 | norm 0.2618 | lr 1.20e-03 | (11148.14 ms | 47029 tok/s) step 779/76294 | train loss 4.308771 | norm 0.2873 | lr 1.20e-03 | (9889.48 ms | 53015 tok/s) step 780/76294 | train loss 4.304818 | norm 0.2801 | lr 1.20e-03 | (9901.61 ms | 52950 tok/s) step 781/76294 | train loss 4.279483 | norm 0.2654 | lr 1.20e-03 | (9946.53 ms | 52711 tok/s) step 782/76294 | train loss 4.296213 | norm 0.3442 | lr 1.20e-03 | (9899.62 ms | 52960 tok/s) step 783/76294 | train loss 4.282892 | norm 0.2936 | lr 1.20e-03 | (9906.79 ms | 52922 tok/s) step 784/76294 | train loss 4.346374 | norm 0.5776 | lr 1.20e-03 | (13788.97 ms | 38022 tok/s) step 785/76294 | train loss 4.356223 | norm 0.3642 | lr 1.20e-03 | (11503.62 ms | 45576 tok/s) step 786/76294 | train loss 4.448181 | norm 0.4027 | lr 1.20e-03 | (9886.66 ms | 53030 tok/s) step 787/76294 | train loss 4.299244 | norm 0.3843 | lr 1.20e-03 | (9884.70 ms | 53040 tok/s) step 788/76294 | train loss 4.415982 | norm 0.3868 | lr 1.20e-03 | (9878.68 ms | 53073 tok/s) step 789/76294 | train loss 4.302353 | norm 0.3459 | lr 1.20e-03 | (9974.92 ms | 52561 tok/s) step 790/76294 | train loss 4.304962 | norm 0.3272 | lr 1.20e-03 | (9897.62 ms | 52971 tok/s) step 791/76294 | train loss 4.297873 | norm 0.2933 | lr 1.20e-03 | (9904.42 ms | 52935 tok/s) step 792/76294 | train loss 4.244191 | norm 0.2852 | lr 1.20e-03 | (9893.03 ms | 52996 tok/s) step 793/76294 | train loss 4.317328 | norm 0.3069 | lr 1.20e-03 | (9985.59 ms | 52504 tok/s) step 794/76294 | train loss 4.260611 | norm 0.2707 | lr 1.20e-03 | (9911.52 ms | 52897 tok/s) step 795/76294 | train loss 4.313509 | norm 0.2624 | lr 1.20e-03 | (9971.29 ms | 52580 tok/s) step 796/76294 | train loss 4.218153 | norm 0.3162 | lr 1.20e-03 | (9967.52 ms | 52600 tok/s) step 797/76294 | train loss 4.261139 | norm 0.3245 | lr 1.20e-03 | (9973.56 ms | 52568 tok/s) step 798/76294 | train loss 4.264166 | norm 0.2744 | lr 1.20e-03 | (9948.98 ms | 52698 tok/s) step 799/76294 | train loss 4.263732 | norm 0.2509 | lr 1.20e-03 | (9902.91 ms | 52943 tok/s) step 800/76294 | train loss 4.265678 | norm 0.2462 | lr 1.20e-03 | (9901.40 ms | 52951 tok/s) step 801/76294 | train loss 4.248132 | norm 0.2115 | lr 1.20e-03 | (9915.04 ms | 52878 tok/s) step 802/76294 | train loss 4.283677 | norm 0.2417 | lr 1.20e-03 | (9901.21 ms | 52952 tok/s) step 803/76294 | train loss 4.236804 | norm 0.2425 | lr 1.20e-03 | (9930.32 ms | 52797 tok/s) step 804/76294 | train loss 4.247781 | norm 0.2326 | lr 1.20e-03 | (9895.40 ms | 52983 tok/s) step 805/76294 | train loss 4.196436 | norm 0.3438 | lr 1.20e-03 | (9898.17 ms | 52968 tok/s) step 806/76294 | train loss 4.234895 | norm 0.3068 | lr 1.20e-03 | (9897.48 ms | 52972 tok/s) step 807/76294 | train loss 4.246832 | norm 0.2513 | lr 1.20e-03 | (9928.40 ms | 52807 tok/s) step 808/76294 | train loss 4.263801 | norm 0.2827 | lr 1.20e-03 | (9893.53 ms | 52993 tok/s) step 809/76294 | train loss 4.232164 | norm 0.3230 | lr 1.20e-03 | (9895.45 ms | 52983 tok/s) step 810/76294 | train loss 4.247456 | norm 0.2903 | lr 1.20e-03 | (9896.71 ms | 52976 tok/s) step 811/76294 | train loss 4.235573 | norm 0.2664 | lr 1.20e-03 | (9894.91 ms | 52986 tok/s) step 812/76294 | train loss 4.238935 | norm 0.2670 | lr 1.20e-03 | (9898.51 ms | 52966 tok/s) step 813/76294 | train loss 4.277710 | norm 0.2382 | lr 1.20e-03 | (9894.01 ms | 52990 tok/s) step 814/76294 | train loss 4.264829 | norm 0.2129 | lr 1.20e-03 | (9898.81 ms | 52965 tok/s) step 815/76294 | train loss 4.291865 | norm 0.2020 | lr 1.20e-03 | (9894.64 ms | 52987 tok/s) step 816/76294 | train loss 4.170051 | norm 0.2290 | lr 1.20e-03 | (9896.69 ms | 52976 tok/s) step 817/76294 | train loss 4.173640 | norm 0.2600 | lr 1.20e-03 | (9895.54 ms | 52982 tok/s) step 818/76294 | train loss 4.315606 | norm 0.2419 | lr 1.20e-03 | (9897.06 ms | 52974 tok/s) step 819/76294 | train loss 4.238867 | norm 0.2938 | lr 1.20e-03 | (9894.24 ms | 52989 tok/s) step 820/76294 | train loss 4.196881 | norm 0.3180 | lr 1.20e-03 | (9900.66 ms | 52955 tok/s) step 821/76294 | train loss 4.256514 | norm 0.3930 | lr 1.20e-03 | (9895.18 ms | 52984 tok/s) step 822/76294 | train loss 4.290561 | norm 0.6561 | lr 1.20e-03 | (9898.48 ms | 52967 tok/s) step 823/76294 | train loss 4.285634 | norm 0.5994 | lr 1.20e-03 | (9894.07 ms | 52990 tok/s) step 824/76294 | train loss 4.374467 | norm 0.5320 | lr 1.20e-03 | (9890.68 ms | 53008 tok/s) step 825/76294 | train loss 4.304547 | norm 0.5827 | lr 1.20e-03 | (9889.07 ms | 53017 tok/s) step 826/76294 | train loss 4.306759 | norm 0.4870 | lr 1.20e-03 | (9897.02 ms | 52974 tok/s) step 827/76294 | train loss 4.263971 | norm 0.3718 | lr 1.20e-03 | (9896.10 ms | 52979 tok/s) step 828/76294 | train loss 4.283217 | norm 0.3790 | lr 1.20e-03 | (10067.96 ms | 52075 tok/s) step 829/76294 | train loss 4.316199 | norm 0.3108 | lr 1.20e-03 | (9890.66 ms | 53008 tok/s) step 830/76294 | train loss 4.253509 | norm 0.2999 | lr 1.20e-03 | (9932.66 ms | 52784 tok/s) step 831/76294 | train loss 4.277315 | norm 0.2939 | lr 1.20e-03 | (9958.25 ms | 52649 tok/s) step 832/76294 | train loss 4.308788 | norm 0.3180 | lr 1.20e-03 | (9891.33 ms | 53005 tok/s) step 833/76294 | train loss 4.298710 | norm 0.2857 | lr 1.20e-03 | (9964.04 ms | 52618 tok/s) step 834/76294 | train loss 4.301363 | norm 0.2694 | lr 1.20e-03 | (9891.26 ms | 53005 tok/s) step 835/76294 | train loss 4.255923 | norm 0.3036 | lr 1.20e-03 | (9901.95 ms | 52948 tok/s) step 836/76294 | train loss 4.260345 | norm 0.2667 | lr 1.20e-03 | (9957.43 ms | 52653 tok/s) step 837/76294 | train loss 4.177586 | norm 0.2408 | lr 1.20e-03 | (9888.26 ms | 53021 tok/s) step 838/76294 | train loss 4.184165 | norm 0.2547 | lr 1.20e-03 | (9898.43 ms | 52967 tok/s) step 839/76294 | train loss 4.203111 | norm 1.2864 | lr 1.20e-03 | (9896.35 ms | 52978 tok/s) step 840/76294 | train loss 4.202215 | norm 0.2057 | lr 1.20e-03 | (9932.02 ms | 52788 tok/s) step 841/76294 | train loss 4.222151 | norm 0.2409 | lr 1.20e-03 | (9899.11 ms | 52963 tok/s) step 842/76294 | train loss 4.219214 | norm 0.2349 | lr 1.20e-03 | (9933.56 ms | 52779 tok/s) step 843/76294 | train loss 4.294374 | norm 0.2489 | lr 1.20e-03 | (9891.97 ms | 53001 tok/s) step 844/76294 | train loss 4.230586 | norm 0.2884 | lr 1.20e-03 | (9980.52 ms | 52531 tok/s) step 845/76294 | train loss 4.165182 | norm 1.6485 | lr 1.20e-03 | (10140.96 ms | 51700 tok/s) step 846/76294 | train loss 4.284963 | norm 0.2625 | lr 1.20e-03 | (9927.12 ms | 52814 tok/s) step 847/76294 | train loss 4.393237 | norm 0.3129 | lr 1.20e-03 | (9893.62 ms | 52993 tok/s) step 848/76294 | train loss 4.190279 | norm 0.3934 | lr 1.20e-03 | (9916.50 ms | 52870 tok/s) step 849/76294 | train loss 4.241995 | norm 1.3628 | lr 1.20e-03 | (9896.02 ms | 52980 tok/s) step 850/76294 | train loss 4.235052 | norm 1.0002 | lr 1.20e-03 | (9905.39 ms | 52930 tok/s) step 851/76294 | train loss 4.293146 | norm 1.4231 | lr 1.20e-03 | (9956.08 ms | 52660 tok/s) step 852/76294 | train loss 4.222185 | norm 0.4062 | lr 1.20e-03 | (9898.59 ms | 52966 tok/s) step 853/76294 | train loss 4.248789 | norm 2.2404 | lr 1.20e-03 | (9906.30 ms | 52925 tok/s) step 854/76294 | train loss 4.225044 | norm 0.5013 | lr 1.20e-03 | (9948.52 ms | 52700 tok/s) step 855/76294 | train loss 4.253801 | norm 1.1100 | lr 1.20e-03 | (9896.92 ms | 52975 tok/s) step 856/76294 | train loss 4.313910 | norm 0.9182 | lr 1.20e-03 | (9899.13 ms | 52963 tok/s) step 857/76294 | train loss 4.195289 | norm 0.6692 | lr 1.20e-03 | (9896.84 ms | 52975 tok/s) step 858/76294 | train loss 4.239798 | norm 0.6232 | lr 1.20e-03 | (9918.08 ms | 52862 tok/s) step 859/76294 | train loss 4.258479 | norm 0.4741 | lr 1.20e-03 | (9897.34 ms | 52973 tok/s) step 860/76294 | train loss 4.221509 | norm 0.7199 | lr 1.20e-03 | (9898.22 ms | 52968 tok/s) step 861/76294 | train loss 4.231713 | norm 0.5425 | lr 1.20e-03 | (9898.12 ms | 52968 tok/s) step 862/76294 | train loss 4.195694 | norm 0.4458 | lr 1.20e-03 | (9951.14 ms | 52686 tok/s) step 863/76294 | train loss 4.368414 | norm 0.3942 | lr 1.20e-03 | (9898.30 ms | 52967 tok/s) step 864/76294 | train loss 4.222347 | norm 0.4888 | lr 1.20e-03 | (9922.28 ms | 52839 tok/s) step 865/76294 | train loss 4.184788 | norm 0.6073 | lr 1.20e-03 | (9895.23 ms | 52984 tok/s) step 866/76294 | train loss 4.231678 | norm 0.6182 | lr 1.20e-03 | (9900.19 ms | 52957 tok/s) step 867/76294 | train loss 4.175254 | norm 0.4138 | lr 1.20e-03 | (9899.79 ms | 52960 tok/s) step 868/76294 | train loss 4.246089 | norm 0.4018 | lr 1.20e-03 | (9902.48 ms | 52945 tok/s) step 869/76294 | train loss 4.135755 | norm 0.3294 | lr 1.20e-03 | (9948.32 ms | 52701 tok/s) step 870/76294 | train loss 4.228400 | norm 0.3409 | lr 1.20e-03 | (9959.10 ms | 52644 tok/s) step 871/76294 | train loss 4.137024 | norm 0.2698 | lr 1.20e-03 | (9898.24 ms | 52968 tok/s) step 872/76294 | train loss 4.208407 | norm 0.2695 | lr 1.20e-03 | (9955.03 ms | 52666 tok/s) step 873/76294 | train loss 4.195927 | norm 0.2405 | lr 1.20e-03 | (9899.37 ms | 52962 tok/s) step 874/76294 | train loss 4.170949 | norm 0.2072 | lr 1.20e-03 | (9900.93 ms | 52953 tok/s) step 875/76294 | train loss 4.127517 | norm 0.2712 | lr 1.20e-03 | (9942.49 ms | 52732 tok/s) step 876/76294 | train loss 4.168240 | norm 1.0116 | lr 1.20e-03 | (11715.36 ms | 44752 tok/s) step 877/76294 | train loss 4.140630 | norm 0.8038 | lr 1.20e-03 | (9899.08 ms | 52963 tok/s) step 878/76294 | train loss 4.215783 | norm 0.6042 | lr 1.20e-03 | (9897.01 ms | 52974 tok/s) step 879/76294 | train loss 4.271192 | norm 0.4559 | lr 1.20e-03 | (9889.36 ms | 53015 tok/s) step 880/76294 | train loss 4.147886 | norm 0.3100 | lr 1.20e-03 | (9952.16 ms | 52681 tok/s) step 881/76294 | train loss 4.171493 | norm 0.3493 | lr 1.20e-03 | (9902.92 ms | 52943 tok/s) step 882/76294 | train loss 4.191820 | norm 0.3120 | lr 1.20e-03 | (9904.19 ms | 52936 tok/s) step 883/76294 | train loss 4.161695 | norm 0.3080 | lr 1.20e-03 | (9939.67 ms | 52747 tok/s) step 884/76294 | train loss 4.129570 | norm 0.2959 | lr 1.20e-03 | (9896.91 ms | 52975 tok/s) step 885/76294 | train loss 4.168167 | norm 0.2306 | lr 1.20e-03 | (9936.93 ms | 52762 tok/s) step 886/76294 | train loss 4.220807 | norm 0.2489 | lr 1.20e-03 | (9900.37 ms | 52956 tok/s) step 887/76294 | train loss 4.157622 | norm 0.2530 | lr 1.20e-03 | (9941.95 ms | 52735 tok/s) step 888/76294 | train loss 4.196012 | norm 0.2277 | lr 1.20e-03 | (9899.74 ms | 52960 tok/s) step 889/76294 | train loss 4.168231 | norm 0.2569 | lr 1.20e-03 | (9909.26 ms | 52909 tok/s) step 890/76294 | train loss 4.107166 | norm 0.2179 | lr 1.20e-03 | (9961.95 ms | 52629 tok/s) step 891/76294 | train loss 4.162041 | norm 0.2913 | lr 1.20e-03 | (9900.74 ms | 52954 tok/s) step 892/76294 | train loss 4.176039 | norm 0.2554 | lr 1.20e-03 | (9964.95 ms | 52613 tok/s) step 893/76294 | train loss 4.153203 | norm 0.2379 | lr 1.20e-03 | (9905.07 ms | 52931 tok/s) step 894/76294 | train loss 4.376465 | norm 0.2511 | lr 1.20e-03 | (9904.61 ms | 52934 tok/s) step 895/76294 | train loss 4.174424 | norm 0.3259 | lr 1.20e-03 | (9943.42 ms | 52727 tok/s) step 896/76294 | train loss 4.171519 | norm 0.3364 | lr 1.20e-03 | (9903.12 ms | 52942 tok/s) step 897/76294 | train loss 4.163944 | norm 0.2806 | lr 1.20e-03 | (9909.04 ms | 52910 tok/s) step 898/76294 | train loss 4.095512 | norm 0.2618 | lr 1.20e-03 | (9896.71 ms | 52976 tok/s) step 899/76294 | train loss 4.082944 | norm 0.2731 | lr 1.20e-03 | (9921.04 ms | 52846 tok/s) step 900/76294 | train loss 4.188754 | norm 0.3156 | lr 1.20e-03 | (9901.22 ms | 52952 tok/s) step 901/76294 | train loss 4.247375 | norm 0.2839 | lr 1.20e-03 | (9898.28 ms | 52968 tok/s) step 902/76294 | train loss 4.126247 | norm 0.2514 | lr 1.20e-03 | (9906.12 ms | 52926 tok/s) step 903/76294 | train loss 4.229450 | norm 0.2621 | lr 1.20e-03 | (9901.24 ms | 52952 tok/s) step 904/76294 | train loss 4.163995 | norm 0.2520 | lr 1.20e-03 | (9906.73 ms | 52922 tok/s) step 905/76294 | train loss 4.120062 | norm 0.2393 | lr 1.20e-03 | (9900.87 ms | 52954 tok/s) step 906/76294 | train loss 4.087691 | norm 0.2250 | lr 1.20e-03 | (9905.67 ms | 52928 tok/s) step 907/76294 | train loss 4.178839 | norm 0.2740 | lr 1.20e-03 | (9899.04 ms | 52964 tok/s) step 908/76294 | train loss 4.112104 | norm 0.2753 | lr 1.20e-03 | (9907.71 ms | 52917 tok/s) step 909/76294 | train loss 4.073190 | norm 0.1808 | lr 1.20e-03 | (9899.13 ms | 52963 tok/s) step 910/76294 | train loss 4.167888 | norm 0.2253 | lr 1.20e-03 | (9898.02 ms | 52969 tok/s) step 911/76294 | train loss 4.164840 | norm 0.2149 | lr 1.20e-03 | (9905.27 ms | 52930 tok/s) step 912/76294 | train loss 4.135846 | norm 0.2337 | lr 1.20e-03 | (9895.49 ms | 52982 tok/s) step 913/76294 | train loss 4.116427 | norm 0.2397 | lr 1.20e-03 | (9982.78 ms | 52519 tok/s) step 914/76294 | train loss 4.122346 | norm 0.2433 | lr 1.20e-03 | (9941.80 ms | 52736 tok/s) step 915/76294 | train loss 4.122529 | norm 0.2461 | lr 1.20e-03 | (9910.90 ms | 52900 tok/s) step 916/76294 | train loss 4.074738 | norm 0.2639 | lr 1.20e-03 | (9978.28 ms | 52543 tok/s) step 917/76294 | train loss 4.143919 | norm 0.2497 | lr 1.20e-03 | (9901.84 ms | 52949 tok/s) step 918/76294 | train loss 4.102147 | norm 0.2122 | lr 1.20e-03 | (9902.21 ms | 52947 tok/s) step 919/76294 | train loss 4.316836 | norm 0.2705 | lr 1.20e-03 | (9898.89 ms | 52964 tok/s) step 920/76294 | train loss 4.155794 | norm 0.3503 | lr 1.20e-03 | (9895.76 ms | 52981 tok/s) step 921/76294 | train loss 4.119522 | norm 0.2866 | lr 1.20e-03 | (9994.35 ms | 52458 tok/s) step 922/76294 | train loss 4.137739 | norm 0.2962 | lr 1.20e-03 | (9901.70 ms | 52949 tok/s) step 923/76294 | train loss 4.078828 | norm 0.3882 | lr 1.20e-03 | (9970.01 ms | 52587 tok/s) step 924/76294 | train loss 4.099474 | norm 0.2385 | lr 1.20e-03 | (9933.15 ms | 52782 tok/s) step 925/76294 | train loss 4.063240 | norm 0.2077 | lr 1.20e-03 | (10055.85 ms | 52138 tok/s) step 926/76294 | train loss 4.105053 | norm 0.2045 | lr 1.20e-03 | (9906.88 ms | 52922 tok/s) step 927/76294 | train loss 4.091989 | norm 0.2250 | lr 1.20e-03 | (9943.06 ms | 52729 tok/s) step 928/76294 | train loss 4.174023 | norm 0.2232 | lr 1.20e-03 | (9931.78 ms | 52789 tok/s) step 929/76294 | train loss 4.144410 | norm 0.2050 | lr 1.20e-03 | (9944.37 ms | 52722 tok/s) step 930/76294 | train loss 4.067896 | norm 0.2063 | lr 1.20e-03 | (9902.62 ms | 52944 tok/s) step 931/76294 | train loss 4.113733 | norm 0.2306 | lr 1.20e-03 | (9938.19 ms | 52755 tok/s) step 932/76294 | train loss 4.108129 | norm 0.2795 | lr 1.20e-03 | (9900.71 ms | 52955 tok/s) step 933/76294 | train loss 4.067140 | norm 0.2487 | lr 1.20e-03 | (9909.33 ms | 52909 tok/s) step 934/76294 | train loss 4.084460 | norm 0.2396 | lr 1.20e-03 | (9953.97 ms | 52671 tok/s) step 935/76294 | train loss 4.090564 | norm 0.2405 | lr 1.20e-03 | (9905.58 ms | 52929 tok/s) step 936/76294 | train loss 4.120667 | norm 0.2235 | lr 1.20e-03 | (9908.64 ms | 52912 tok/s) step 937/76294 | train loss 4.179167 | norm 0.2257 | lr 1.20e-03 | (9908.43 ms | 52913 tok/s) step 938/76294 | train loss 4.130530 | norm 0.2808 | lr 1.20e-03 | (10096.81 ms | 51926 tok/s) step 939/76294 | train loss 4.131753 | norm 0.2993 | lr 1.20e-03 | (9906.09 ms | 52926 tok/s) step 940/76294 | train loss 4.093685 | norm 0.2807 | lr 1.20e-03 | (9944.69 ms | 52720 tok/s) step 941/76294 | train loss 4.122718 | norm 0.2766 | lr 1.20e-03 | (9947.92 ms | 52703 tok/s) step 942/76294 | train loss 4.052570 | norm 0.2531 | lr 1.20e-03 | (9904.13 ms | 52936 tok/s) step 943/76294 | train loss 4.108890 | norm 0.2704 | lr 1.20e-03 | (9906.43 ms | 52924 tok/s) step 944/76294 | train loss 4.122762 | norm 0.2702 | lr 1.20e-03 | (9925.67 ms | 52821 tok/s) step 945/76294 | train loss 4.119603 | norm 0.2500 | lr 1.20e-03 | (9902.00 ms | 52948 tok/s) step 946/76294 | train loss 4.071261 | norm 0.5667 | lr 1.20e-03 | (9908.55 ms | 52913 tok/s) step 947/76294 | train loss 4.041117 | norm 0.2629 | lr 1.20e-03 | (9905.85 ms | 52927 tok/s) step 948/76294 | train loss 4.246560 | norm 2.3441 | lr 1.20e-03 | (9902.78 ms | 52944 tok/s) step 949/76294 | train loss 4.120604 | norm 0.3454 | lr 1.20e-03 | (9902.21 ms | 52947 tok/s) step 950/76294 | train loss 4.100206 | norm 0.6091 | lr 1.20e-03 | (9908.70 ms | 52912 tok/s) step 951/76294 | train loss 4.119717 | norm 0.4132 | lr 1.20e-03 | (9905.40 ms | 52930 tok/s) step 952/76294 | train loss 4.126725 | norm 0.9154 | lr 1.20e-03 | (9959.66 ms | 52641 tok/s) step 953/76294 | train loss 4.118961 | norm 0.4639 | lr 1.20e-03 | (9908.02 ms | 52916 tok/s) step 954/76294 | train loss 4.182833 | norm 0.3736 | lr 1.20e-03 | (10703.26 ms | 48984 tok/s) step 955/76294 | train loss 4.194678 | norm 0.4709 | lr 1.20e-03 | (9894.47 ms | 52988 tok/s) step 956/76294 | train loss 4.084741 | norm 0.3299 | lr 1.20e-03 | (9906.67 ms | 52923 tok/s) step 957/76294 | train loss 4.093500 | norm 0.3783 | lr 1.20e-03 | (9971.40 ms | 52579 tok/s) step 958/76294 | train loss 4.121212 | norm 0.3509 | lr 1.20e-03 | (9892.63 ms | 52998 tok/s) step 959/76294 | train loss 4.197959 | norm 0.2587 | lr 1.20e-03 | (9979.26 ms | 52538 tok/s) step 960/76294 | train loss 4.114579 | norm 0.2493 | lr 1.20e-03 | (9900.01 ms | 52958 tok/s) step 961/76294 | train loss 4.187855 | norm 0.2190 | lr 1.20e-03 | (9898.63 ms | 52966 tok/s) step 962/76294 | train loss 4.159155 | norm 0.2424 | lr 1.20e-03 | (9901.04 ms | 52953 tok/s) step 963/76294 | train loss 4.091851 | norm 0.2398 | lr 1.20e-03 | (9949.34 ms | 52696 tok/s) step 964/76294 | train loss 4.118753 | norm 0.2533 | lr 1.20e-03 | (9902.32 ms | 52946 tok/s) step 965/76294 | train loss 4.084396 | norm 0.2565 | lr 1.20e-03 | (9917.57 ms | 52865 tok/s) step 966/76294 | train loss 4.058550 | norm 0.2485 | lr 1.20e-03 | (9900.07 ms | 52958 tok/s) step 967/76294 | train loss 4.064351 | norm 0.2829 | lr 1.20e-03 | (9921.67 ms | 52843 tok/s) step 968/76294 | train loss 4.099771 | norm 0.2421 | lr 1.20e-03 | (9975.46 ms | 52558 tok/s) step 969/76294 | train loss 4.113860 | norm 0.2215 | lr 1.20e-03 | (9917.77 ms | 52863 tok/s) step 970/76294 | train loss 4.064780 | norm 0.2304 | lr 1.20e-03 | (9928.73 ms | 52805 tok/s) step 971/76294 | train loss 4.117024 | norm 0.2855 | lr 1.20e-03 | (9901.06 ms | 52953 tok/s) step 972/76294 | train loss 4.103640 | norm 0.2385 | lr 1.20e-03 | (9969.50 ms | 52589 tok/s) step 973/76294 | train loss 4.041280 | norm 0.2279 | lr 1.20e-03 | (10065.16 ms | 52089 tok/s) step 974/76294 | train loss 4.104295 | norm 0.2074 | lr 1.20e-03 | (11734.28 ms | 44680 tok/s) step 975/76294 | train loss 4.034379 | norm 0.2310 | lr 1.20e-03 | (9891.06 ms | 53006 tok/s) step 976/76294 | train loss 4.128356 | norm 0.3309 | lr 1.20e-03 | (9895.08 ms | 52985 tok/s) step 977/76294 | train loss 4.070185 | norm 0.2279 | lr 1.20e-03 | (9899.44 ms | 52961 tok/s) step 978/76294 | train loss 4.089427 | norm 0.2107 | lr 1.20e-03 | (9901.22 ms | 52952 tok/s) step 979/76294 | train loss 4.098774 | norm 0.2119 | lr 1.20e-03 | (9900.94 ms | 52953 tok/s) step 980/76294 | train loss 4.075529 | norm 0.2193 | lr 1.20e-03 | (9903.27 ms | 52941 tok/s) step 981/76294 | train loss 4.052826 | norm 0.2054 | lr 1.20e-03 | (9903.51 ms | 52940 tok/s) step 982/76294 | train loss 4.050352 | norm 0.2380 | lr 1.20e-03 | (9987.27 ms | 52496 tok/s) step 983/76294 | train loss 4.052063 | norm 0.2613 | lr 1.20e-03 | (9981.36 ms | 52527 tok/s) step 984/76294 | train loss 4.077664 | norm 0.2631 | lr 1.20e-03 | (9907.49 ms | 52918 tok/s) step 985/76294 | train loss 4.068653 | norm 0.2769 | lr 1.20e-03 | (9911.38 ms | 52898 tok/s) step 986/76294 | train loss 4.089540 | norm 0.2840 | lr 1.20e-03 | (9952.63 ms | 52678 tok/s) step 987/76294 | train loss 4.064081 | norm 0.2546 | lr 1.20e-03 | (9900.17 ms | 52957 tok/s) step 988/76294 | train loss 4.139880 | norm 0.2322 | lr 1.20e-03 | (9905.61 ms | 52928 tok/s) step 989/76294 | train loss 4.081170 | norm 0.2366 | lr 1.20e-03 | (9902.97 ms | 52943 tok/s) step 990/76294 | train loss 4.108212 | norm 0.2114 | lr 1.20e-03 | (9929.62 ms | 52800 tok/s) step 991/76294 | train loss 4.060761 | norm 1.4845 | lr 1.20e-03 | (9902.76 ms | 52944 tok/s) step 992/76294 | train loss 3.966929 | norm 0.2122 | lr 1.20e-03 | (9909.97 ms | 52905 tok/s) step 993/76294 | train loss 4.075574 | norm 0.2443 | lr 1.20e-03 | (9900.39 ms | 52956 tok/s) step 994/76294 | train loss 4.040949 | norm 0.2148 | lr 1.20e-03 | (9902.75 ms | 52944 tok/s) step 995/76294 | train loss 4.039788 | norm 0.2179 | lr 1.20e-03 | (9900.54 ms | 52956 tok/s) step 996/76294 | train loss 4.046026 | norm 0.3182 | lr 1.20e-03 | (9903.35 ms | 52940 tok/s) step 997/76294 | train loss 4.082095 | norm 0.4106 | lr 1.20e-03 | (9899.39 ms | 52962 tok/s) step 998/76294 | train loss 3.995374 | norm 0.3498 | lr 1.20e-03 | (9906.87 ms | 52922 tok/s) step 999/76294 | train loss 4.082946 | norm 0.6561 | lr 1.20e-03 | (9904.37 ms | 52935 tok/s) step 1000/76294 | train loss 4.085847 | norm 0.2735 | lr 1.20e-03 | (9900.40 ms | 52956 tok/s) val loss: 4.055618 saving model checkpoint to ./results/gpt2-350M-gqa/step_1000.pth step 1001/76294 | train loss 4.052648 | norm 0.2578 | lr 1.20e-03 | (9978.74 ms | 52540 tok/s) step 1002/76294 | train loss 4.095875 | norm 0.3679 | lr 1.20e-03 | (9883.05 ms | 53049 tok/s) step 1003/76294 | train loss 4.091320 | norm 0.2848 | lr 1.20e-03 | (10009.12 ms | 52381 tok/s) step 1004/76294 | train loss 4.067641 | norm 0.2168 | lr 1.20e-03 | (9881.96 ms | 53055 tok/s) step 1005/76294 | train loss 3.995989 | norm 0.1986 | lr 1.20e-03 | (9922.27 ms | 52840 tok/s) step 1006/76294 | train loss 4.082467 | norm 0.2328 | lr 1.20e-03 | (10109.90 ms | 51859 tok/s) step 1007/76294 | train loss 4.101764 | norm 0.2352 | lr 1.20e-03 | (9908.65 ms | 52912 tok/s) step 1008/76294 | train loss 4.049859 | norm 0.7407 | lr 1.20e-03 | (9965.06 ms | 52613 tok/s) step 1009/76294 | train loss 4.050117 | norm 0.2452 | lr 1.20e-03 | (9910.41 ms | 52903 tok/s) step 1010/76294 | train loss 4.097829 | norm 0.3785 | lr 1.20e-03 | (9898.26 ms | 52968 tok/s) step 1011/76294 | train loss 4.010635 | norm 0.3468 | lr 1.20e-03 | (9900.29 ms | 52957 tok/s) step 1012/76294 | train loss 4.139482 | norm 0.2984 | lr 1.20e-03 | (9902.79 ms | 52943 tok/s) step 1013/76294 | train loss 4.101614 | norm 0.3602 | lr 1.20e-03 | (9928.22 ms | 52808 tok/s) step 1014/76294 | train loss 4.032573 | norm 0.4225 | lr 1.20e-03 | (9904.17 ms | 52936 tok/s) step 1015/76294 | train loss 4.050520 | norm 0.4316 | lr 1.20e-03 | (9907.07 ms | 52921 tok/s) step 1016/76294 | train loss 4.049220 | norm 0.3665 | lr 1.20e-03 | (9969.27 ms | 52590 tok/s) step 1017/76294 | train loss 4.075304 | norm 0.3187 | lr 1.20e-03 | (9970.04 ms | 52586 tok/s) step 1018/76294 | train loss 4.015409 | norm 0.2354 | lr 1.20e-03 | (9899.15 ms | 52963 tok/s) step 1019/76294 | train loss 4.089792 | norm 0.2146 | lr 1.20e-03 | (9919.86 ms | 52852 tok/s) step 1020/76294 | train loss 4.032980 | norm 0.4240 | lr 1.20e-03 | (9894.23 ms | 52989 tok/s) step 1021/76294 | train loss 3.996435 | norm 0.2758 | lr 1.20e-03 | (9898.13 ms | 52968 tok/s) step 1022/76294 | train loss 4.146430 | norm 0.2064 | lr 1.20e-03 | (9963.87 ms | 52619 tok/s) step 1023/76294 | train loss 4.016929 | norm 0.2449 | lr 1.20e-03 | (9903.57 ms | 52939 tok/s) step 1024/76294 | train loss 4.037046 | norm 0.2542 | lr 1.20e-03 | (9897.44 ms | 52972 tok/s) step 1025/76294 | train loss 4.135566 | norm 0.2919 | lr 1.20e-03 | (9928.41 ms | 52807 tok/s) step 1026/76294 | train loss 4.045193 | norm 0.3258 | lr 1.20e-03 | (9900.06 ms | 52958 tok/s) step 1027/76294 | train loss 4.013672 | norm 0.3063 | lr 1.20e-03 | (9904.08 ms | 52937 tok/s) step 1028/76294 | train loss 4.104789 | norm 0.3471 | lr 1.20e-03 | (9896.86 ms | 52975 tok/s) step 1029/76294 | train loss 4.047657 | norm 0.3037 | lr 1.20e-03 | (9910.07 ms | 52905 tok/s) step 1030/76294 | train loss 4.024135 | norm 0.2903 | lr 1.20e-03 | (9899.04 ms | 52964 tok/s) step 1031/76294 | train loss 4.070183 | norm 0.2215 | lr 1.20e-03 | (9967.48 ms | 52600 tok/s) step 1032/76294 | train loss 4.017454 | norm 0.2073 | lr 1.20e-03 | (9897.19 ms | 52973 tok/s) step 1033/76294 | train loss 4.098360 | norm 0.2079 | lr 1.20e-03 | (9967.97 ms | 52597 tok/s) step 1034/76294 | train loss 4.028294 | norm 0.2218 | lr 1.20e-03 | (9900.25 ms | 52957 tok/s) step 1035/76294 | train loss 4.042871 | norm 0.2134 | lr 1.20e-03 | (9902.14 ms | 52947 tok/s) step 1036/76294 | train loss 4.032766 | norm 0.1972 | lr 1.20e-03 | (9937.54 ms | 52758 tok/s) step 1037/76294 | train loss 4.034849 | norm 0.1964 | lr 1.20e-03 | (9903.35 ms | 52940 tok/s) step 1038/76294 | train loss 4.009943 | norm 0.1880 | lr 1.20e-03 | (9901.16 ms | 52952 tok/s) step 1039/76294 | train loss 4.033597 | norm 0.1974 | lr 1.20e-03 | (9902.59 ms | 52945 tok/s) step 1040/76294 | train loss 4.257720 | norm 0.2138 | lr 1.20e-03 | (9905.78 ms | 52927 tok/s) step 1041/76294 | train loss 4.007493 | norm 0.3034 | lr 1.20e-03 | (9895.82 ms | 52981 tok/s) step 1042/76294 | train loss 4.023927 | norm 0.3083 | lr 1.20e-03 | (9906.82 ms | 52922 tok/s) step 1043/76294 | train loss 4.048942 | norm 0.2346 | lr 1.20e-03 | (9899.82 ms | 52959 tok/s) step 1044/76294 | train loss 4.036765 | norm 0.2422 | lr 1.20e-03 | (9917.68 ms | 52864 tok/s) step 1045/76294 | train loss 4.013047 | norm 0.2222 | lr 1.20e-03 | (9895.11 ms | 52985 tok/s) step 1046/76294 | train loss 4.031313 | norm 0.1920 | lr 1.20e-03 | (9900.52 ms | 52956 tok/s) step 1047/76294 | train loss 4.003044 | norm 0.2136 | lr 1.20e-03 | (9934.41 ms | 52775 tok/s) step 1048/76294 | train loss 4.001449 | norm 0.2096 | lr 1.20e-03 | (9895.72 ms | 52981 tok/s) step 1049/76294 | train loss 3.998891 | norm 0.2032 | lr 1.20e-03 | (9935.18 ms | 52771 tok/s) step 1050/76294 | train loss 4.061696 | norm 0.2165 | lr 1.20e-03 | (9898.77 ms | 52965 tok/s) step 1051/76294 | train loss 4.049259 | norm 0.2207 | lr 1.20e-03 | (9902.75 ms | 52944 tok/s) step 1052/76294 | train loss 3.983507 | norm 0.2175 | lr 1.20e-03 | (9896.38 ms | 52978 tok/s) step 1053/76294 | train loss 4.005523 | norm 0.1967 | lr 1.20e-03 | (9906.05 ms | 52926 tok/s) step 1054/76294 | train loss 4.073493 | norm 0.2028 | lr 1.20e-03 | (9898.16 ms | 52968 tok/s) step 1055/76294 | train loss 3.996950 | norm 0.2196 | lr 1.20e-03 | (9901.58 ms | 52950 tok/s) step 1056/76294 | train loss 4.054158 | norm 0.2476 | lr 1.20e-03 | (9895.45 ms | 52983 tok/s) step 1057/76294 | train loss 4.005864 | norm 0.2373 | lr 1.20e-03 | (9906.20 ms | 52925 tok/s) step 1058/76294 | train loss 3.914799 | norm 0.2107 | lr 1.20e-03 | (9910.78 ms | 52901 tok/s) step 1059/76294 | train loss 3.997279 | norm 0.1883 | lr 1.20e-03 | (9903.23 ms | 52941 tok/s) step 1060/76294 | train loss 3.972054 | norm 0.2020 | lr 1.20e-03 | (9899.41 ms | 52962 tok/s) step 1061/76294 | train loss 3.980600 | norm 0.2002 | lr 1.20e-03 | (9943.14 ms | 52729 tok/s) step 1062/76294 | train loss 4.030481 | norm 0.2281 | lr 1.20e-03 | (9896.56 ms | 52977 tok/s) step 1063/76294 | train loss 4.029502 | norm 0.2213 | lr 1.20e-03 | (9903.92 ms | 52937 tok/s) step 1064/76294 | train loss 3.982551 | norm 0.2130 | lr 1.20e-03 | (9895.74 ms | 52981 tok/s) step 1065/76294 | train loss 4.000574 | norm 0.2288 | lr 1.20e-03 | (9905.78 ms | 52927 tok/s) step 1066/76294 | train loss 3.997035 | norm 0.2384 | lr 1.20e-03 | (9894.99 ms | 52985 tok/s) step 1067/76294 | train loss 3.970244 | norm 0.2625 | lr 1.20e-03 | (9900.37 ms | 52956 tok/s) step 1068/76294 | train loss 4.017632 | norm 0.2641 | lr 1.20e-03 | (9902.20 ms | 52947 tok/s) step 1069/76294 | train loss 3.975777 | norm 0.2382 | lr 1.20e-03 | (9902.43 ms | 52945 tok/s) step 1070/76294 | train loss 3.976436 | norm 0.2275 | lr 1.20e-03 | (9900.00 ms | 52958 tok/s) step 1071/76294 | train loss 4.008833 | norm 0.2192 | lr 1.20e-03 | (9955.23 ms | 52665 tok/s) step 1072/76294 | train loss 4.012187 | norm 0.2220 | lr 1.20e-03 | (10932.49 ms | 47957 tok/s) step 1073/76294 | train loss 4.031314 | norm 0.2277 | lr 1.20e-03 | (9964.27 ms | 52617 tok/s) step 1074/76294 | train loss 3.972491 | norm 0.2476 | lr 1.20e-03 | (9891.87 ms | 53002 tok/s) step 1075/76294 | train loss 3.946759 | norm 0.2474 | lr 1.20e-03 | (9889.76 ms | 53013 tok/s) step 1076/76294 | train loss 4.011775 | norm 0.2010 | lr 1.20e-03 | (9962.84 ms | 52624 tok/s) step 1077/76294 | train loss 3.991869 | norm 0.1864 | lr 1.20e-03 | (9900.43 ms | 52956 tok/s) step 1078/76294 | train loss 3.974617 | norm 0.1941 | lr 1.20e-03 | (9896.80 ms | 52976 tok/s) step 1079/76294 | train loss 4.029789 | norm 0.1690 | lr 1.20e-03 | (9975.99 ms | 52555 tok/s) step 1080/76294 | train loss 4.019081 | norm 0.1872 | lr 1.20e-03 | (9894.70 ms | 52987 tok/s) step 1081/76294 | train loss 3.972981 | norm 0.2315 | lr 1.20e-03 | (9905.42 ms | 52929 tok/s) step 1082/76294 | train loss 3.988390 | norm 0.2379 | lr 1.20e-03 | (9895.66 ms | 52982 tok/s) step 1083/76294 | train loss 3.999988 | norm 0.2231 | lr 1.20e-03 | (9902.64 ms | 52944 tok/s) step 1084/76294 | train loss 3.965547 | norm 0.1998 | lr 1.20e-03 | (9930.94 ms | 52793 tok/s) step 1085/76294 | train loss 3.959779 | norm 0.1984 | lr 1.20e-03 | (9903.72 ms | 52938 tok/s) step 1086/76294 | train loss 3.989268 | norm 0.2034 | lr 1.20e-03 | (9939.73 ms | 52747 tok/s) step 1087/76294 | train loss 3.991441 | norm 0.1997 | lr 1.20e-03 | (9897.89 ms | 52970 tok/s) step 1088/76294 | train loss 3.977077 | norm 0.2356 | lr 1.20e-03 | (9891.09 ms | 53006 tok/s) step 1089/76294 | train loss 4.017477 | norm 0.2640 | lr 1.20e-03 | (9902.09 ms | 52947 tok/s) step 1090/76294 | train loss 3.977698 | norm 0.2813 | lr 1.20e-03 | (9926.93 ms | 52815 tok/s) step 1091/76294 | train loss 3.935295 | norm 0.2707 | lr 1.20e-03 | (9898.87 ms | 52964 tok/s) step 1092/76294 | train loss 3.979250 | norm 0.2506 | lr 1.20e-03 | (9902.30 ms | 52946 tok/s) step 1093/76294 | train loss 4.002258 | norm 0.2545 | lr 1.20e-03 | (9899.03 ms | 52964 tok/s) step 1094/76294 | train loss 3.930640 | norm 0.2410 | lr 1.20e-03 | (9899.14 ms | 52963 tok/s) step 1095/76294 | train loss 4.015383 | norm 0.2342 | lr 1.20e-03 | (9894.98 ms | 52985 tok/s) step 1096/76294 | train loss 3.974206 | norm 0.2139 | lr 1.20e-03 | (9894.49 ms | 52988 tok/s) step 1097/76294 | train loss 3.971690 | norm 0.1882 | lr 1.20e-03 | (9932.97 ms | 52783 tok/s) step 1098/76294 | train loss 3.990098 | norm 0.1744 | lr 1.20e-03 | (9892.54 ms | 52998 tok/s) step 1099/76294 | train loss 3.915682 | norm 0.1937 | lr 1.20e-03 | (9898.50 ms | 52966 tok/s) step 1100/76294 | train loss 3.966906 | norm 0.2241 | lr 1.20e-03 | (9890.49 ms | 53009 tok/s) step 1101/76294 | train loss 4.042718 | norm 0.2616 | lr 1.20e-03 | (9901.08 ms | 52953 tok/s) step 1102/76294 | train loss 4.026160 | norm 0.2587 | lr 1.20e-03 | (9890.47 ms | 53009 tok/s) step 1103/76294 | train loss 4.002672 | norm 0.2119 | lr 1.20e-03 | (9898.75 ms | 52965 tok/s) step 1104/76294 | train loss 3.977163 | norm 0.2194 | lr 1.20e-03 | (9893.70 ms | 52992 tok/s) step 1105/76294 | train loss 3.994586 | norm 0.2660 | lr 1.20e-03 | (9899.42 ms | 52961 tok/s) step 1106/76294 | train loss 4.023111 | norm 0.2372 | lr 1.20e-03 | (9891.94 ms | 53002 tok/s) step 1107/76294 | train loss 3.979019 | norm 0.2190 | lr 1.20e-03 | (9901.15 ms | 52952 tok/s) step 1108/76294 | train loss 4.097618 | norm 0.2350 | lr 1.20e-03 | (9893.53 ms | 52993 tok/s) step 1109/76294 | train loss 4.004039 | norm 0.2276 | lr 1.20e-03 | (9930.09 ms | 52798 tok/s) step 1110/76294 | train loss 4.060171 | norm 0.1975 | lr 1.20e-03 | (9889.49 ms | 53015 tok/s) step 1111/76294 | train loss 3.969128 | norm 0.2315 | lr 1.20e-03 | (9901.20 ms | 52952 tok/s) step 1112/76294 | train loss 3.975182 | norm 0.2329 | lr 1.20e-03 | (9886.46 ms | 53031 tok/s) step 1113/76294 | train loss 3.955173 | norm 0.2580 | lr 1.20e-03 | (9898.01 ms | 52969 tok/s) step 1114/76294 | train loss 3.917829 | norm 0.2865 | lr 1.20e-03 | (9890.50 ms | 53009 tok/s) step 1115/76294 | train loss 3.899243 | norm 0.2347 | lr 1.20e-03 | (9936.88 ms | 52762 tok/s) step 1116/76294 | train loss 3.919793 | norm 0.1964 | lr 1.20e-03 | (9886.97 ms | 53028 tok/s) step 1117/76294 | train loss 4.013430 | norm 0.2091 | lr 1.20e-03 | (9907.79 ms | 52917 tok/s) step 1118/76294 | train loss 4.005270 | norm 0.1809 | lr 1.20e-03 | (9895.56 ms | 52982 tok/s) step 1119/76294 | train loss 3.984442 | norm 0.1791 | lr 1.20e-03 | (9927.81 ms | 52810 tok/s) step 1120/76294 | train loss 4.075595 | norm 0.2470 | lr 1.20e-03 | (9889.02 ms | 53017 tok/s) step 1121/76294 | train loss 4.008470 | norm 0.3155 | lr 1.20e-03 | (9934.42 ms | 52775 tok/s) step 1122/76294 | train loss 3.988784 | norm 0.3065 | lr 1.20e-03 | (9888.08 ms | 53022 tok/s) step 1123/76294 | train loss 3.958249 | norm 0.2629 | lr 1.20e-03 | (9893.59 ms | 52993 tok/s) step 1124/76294 | train loss 4.057404 | norm 0.2465 | lr 1.20e-03 | (9883.61 ms | 53046 tok/s) step 1125/76294 | train loss 4.005196 | norm 0.2760 | lr 1.20e-03 | (9913.33 ms | 52887 tok/s) step 1126/76294 | train loss 3.924533 | norm 0.2658 | lr 1.20e-03 | (9892.72 ms | 52997 tok/s) step 1127/76294 | train loss 3.937568 | norm 0.2229 | lr 1.20e-03 | (9889.12 ms | 53017 tok/s) step 1128/76294 | train loss 3.945069 | norm 0.3414 | lr 1.20e-03 | (9902.28 ms | 52946 tok/s) step 1129/76294 | train loss 4.020731 | norm 0.2351 | lr 1.20e-03 | (9926.99 ms | 52814 tok/s) step 1130/76294 | train loss 3.980673 | norm 0.2533 | lr 1.20e-03 | (9888.28 ms | 53021 tok/s) step 1131/76294 | train loss 3.950275 | norm 0.2581 | lr 1.20e-03 | (9897.70 ms | 52971 tok/s) step 1132/76294 | train loss 4.009766 | norm 0.2646 | lr 1.20e-03 | (9884.85 ms | 53040 tok/s) step 1133/76294 | train loss 3.962410 | norm 0.2831 | lr 1.20e-03 | (9930.42 ms | 52796 tok/s) step 1134/76294 | train loss 3.957402 | norm 0.3071 | lr 1.20e-03 | (9887.37 ms | 53026 tok/s) step 1135/76294 | train loss 3.976670 | norm 0.2579 | lr 1.20e-03 | (9906.30 ms | 52925 tok/s) step 1136/76294 | train loss 3.995676 | norm 0.2594 | lr 1.20e-03 | (9950.86 ms | 52688 tok/s) step 1137/76294 | train loss 3.997212 | norm 0.2457 | lr 1.20e-03 | (9895.27 ms | 52984 tok/s) step 1138/76294 | train loss 3.960927 | norm 0.2001 | lr 1.20e-03 | (9890.15 ms | 53011 tok/s) step 1139/76294 | train loss 3.998787 | norm 0.2341 | lr 1.20e-03 | (9927.64 ms | 52811 tok/s) step 1140/76294 | train loss 3.935229 | norm 0.2566 | lr 1.20e-03 | (9885.16 ms | 53038 tok/s) step 1141/76294 | train loss 4.043288 | norm 0.2600 | lr 1.20e-03 | (9894.92 ms | 52986 tok/s) step 1142/76294 | train loss 3.968359 | norm 0.2112 | lr 1.20e-03 | (9885.87 ms | 53034 tok/s) step 1143/76294 | train loss 3.961073 | norm 0.2069 | lr 1.20e-03 | (9892.27 ms | 53000 tok/s) step 1144/76294 | train loss 3.951555 | norm 0.2036 | lr 1.20e-03 | (9885.83 ms | 53034 tok/s) step 1145/76294 | train loss 4.002678 | norm 0.2442 | lr 1.20e-03 | (10154.73 ms | 51630 tok/s) step 1146/76294 | train loss 3.936735 | norm 0.2288 | lr 1.20e-03 | (9877.35 ms | 53080 tok/s) step 1147/76294 | train loss 3.941569 | norm 0.1914 | lr 1.20e-03 | (9919.86 ms | 52852 tok/s) step 1148/76294 | train loss 3.942522 | norm 0.2089 | lr 1.20e-03 | (9957.02 ms | 52655 tok/s) step 1149/76294 | train loss 3.918393 | norm 0.1940 | lr 1.20e-03 | (9896.71 ms | 52976 tok/s) step 1150/76294 | train loss 3.933310 | norm 0.2663 | lr 1.20e-03 | (9886.50 ms | 53031 tok/s) step 1151/76294 | train loss 4.089769 | norm 0.2023 | lr 1.20e-03 | (9926.47 ms | 52817 tok/s) step 1152/76294 | train loss 3.951867 | norm 0.2862 | lr 1.20e-03 | (9888.34 ms | 53021 tok/s) step 1153/76294 | train loss 3.960517 | norm 0.4791 | lr 1.20e-03 | (9904.79 ms | 52933 tok/s) step 1154/76294 | train loss 3.969775 | norm 0.4534 | lr 1.20e-03 | (9885.50 ms | 53036 tok/s) step 1155/76294 | train loss 3.928308 | norm 0.3796 | lr 1.20e-03 | (9890.45 ms | 53010 tok/s) step 1156/76294 | train loss 3.929612 | norm 0.3154 | lr 1.20e-03 | (9889.12 ms | 53017 tok/s) step 1157/76294 | train loss 3.967803 | norm 0.2233 | lr 1.20e-03 | (9897.09 ms | 52974 tok/s) step 1158/76294 | train loss 3.941960 | norm 0.2375 | lr 1.20e-03 | (9889.90 ms | 53012 tok/s) step 1159/76294 | train loss 4.001797 | norm 0.2386 | lr 1.20e-03 | (9887.13 ms | 53027 tok/s) step 1160/76294 | train loss 3.967080 | norm 0.2681 | lr 1.20e-03 | (9882.08 ms | 53054 tok/s) step 1161/76294 | train loss 3.900883 | norm 0.2564 | lr 1.20e-03 | (9927.89 ms | 52810 tok/s) step 1162/76294 | train loss 3.976080 | norm 0.2548 | lr 1.20e-03 | (9881.87 ms | 53056 tok/s) step 1163/76294 | train loss 3.976785 | norm 0.2030 | lr 1.20e-03 | (9915.38 ms | 52876 tok/s) step 1164/76294 | train loss 3.912715 | norm 0.1890 | lr 1.20e-03 | (9882.85 ms | 53050 tok/s) step 1165/76294 | train loss 3.921818 | norm 0.1778 | lr 1.20e-03 | (9891.21 ms | 53005 tok/s) step 1166/76294 | train loss 3.994471 | norm 0.1957 | lr 1.20e-03 | (9886.31 ms | 53032 tok/s) step 1167/76294 | train loss 3.899913 | norm 0.1826 | lr 1.20e-03 | (9894.94 ms | 52985 tok/s) step 1168/76294 | train loss 3.954595 | norm 0.1834 | lr 1.20e-03 | (9887.11 ms | 53027 tok/s) step 1169/76294 | train loss 3.949440 | norm 0.1874 | lr 1.20e-03 | (11032.22 ms | 47523 tok/s) step 1170/76294 | train loss 3.964851 | norm 0.1900 | lr 1.20e-03 | (9941.81 ms | 52736 tok/s) step 1171/76294 | train loss 3.955096 | norm 0.1959 | lr 1.20e-03 | (9881.89 ms | 53055 tok/s) step 1172/76294 | train loss 3.899327 | norm 0.1727 | lr 1.20e-03 | (10203.18 ms | 51385 tok/s) step 1173/76294 | train loss 3.917436 | norm 0.1914 | lr 1.20e-03 | (9878.88 ms | 53072 tok/s) step 1174/76294 | train loss 3.893622 | norm 0.2151 | lr 1.20e-03 | (9887.35 ms | 53026 tok/s) step 1175/76294 | train loss 3.917904 | norm 0.1962 | lr 1.20e-03 | (9881.07 ms | 53060 tok/s) step 1176/76294 | train loss 3.958697 | norm 0.1914 | lr 1.20e-03 | (14664.98 ms | 35751 tok/s) step 1177/76294 | train loss 3.906049 | norm 0.2125 | lr 1.20e-03 | (9851.21 ms | 53221 tok/s) step 1178/76294 | train loss 3.901184 | norm 0.1804 | lr 1.20e-03 | (9854.49 ms | 53203 tok/s) step 1179/76294 | train loss 3.940475 | norm 0.1932 | lr 1.20e-03 | (9868.46 ms | 53128 tok/s) step 1180/76294 | train loss 3.889697 | norm 0.2043 | lr 1.20e-03 | (9868.37 ms | 53128 tok/s) step 1181/76294 | train loss 3.916226 | norm 0.2133 | lr 1.20e-03 | (9896.11 ms | 52979 tok/s) step 1182/76294 | train loss 3.879707 | norm 0.2112 | lr 1.20e-03 | (9867.30 ms | 53134 tok/s) step 1183/76294 | train loss 3.855659 | norm 0.1905 | lr 1.20e-03 | (9869.18 ms | 53124 tok/s) step 1184/76294 | train loss 3.969837 | norm 0.2461 | lr 1.20e-03 | (9883.14 ms | 53049 tok/s) step 1185/76294 | train loss 3.888341 | norm 0.2531 | lr 1.20e-03 | (9918.63 ms | 52859 tok/s) step 1186/76294 | train loss 3.920526 | norm 0.2593 | lr 1.20e-03 | (9874.44 ms | 53095 tok/s) step 1187/76294 | train loss 3.934071 | norm 0.2879 | lr 1.20e-03 | (9954.16 ms | 52670 tok/s) step 1188/76294 | train loss 4.124116 | norm 0.2457 | lr 1.20e-03 | (9874.46 ms | 53095 tok/s) step 1189/76294 | train loss 3.942223 | norm 0.2126 | lr 1.20e-03 | (9906.55 ms | 52923 tok/s) step 1190/76294 | train loss 3.936412 | norm 0.2212 | lr 1.20e-03 | (9874.21 ms | 53097 tok/s) step 1191/76294 | train loss 3.871333 | norm 0.1906 | lr 1.20e-03 | (9964.68 ms | 52615 tok/s) step 1192/76294 | train loss 3.937103 | norm 0.2004 | lr 1.20e-03 | (9874.65 ms | 53094 tok/s) step 1193/76294 | train loss 3.943709 | norm 0.1846 | lr 1.20e-03 | (9877.63 ms | 53078 tok/s) step 1194/76294 | train loss 3.929234 | norm 0.2025 | lr 1.20e-03 | (9879.84 ms | 53066 tok/s) step 1195/76294 | train loss 3.907388 | norm 0.1761 | lr 1.20e-03 | (9922.78 ms | 52837 tok/s) step 1196/76294 | train loss 3.961318 | norm 0.1907 | lr 1.20e-03 | (9876.28 ms | 53086 tok/s) step 1197/76294 | train loss 3.926143 | norm 0.1609 | lr 1.20e-03 | (9885.57 ms | 53036 tok/s) step 1198/76294 | train loss 3.915959 | norm 0.1722 | lr 1.20e-03 | (9879.63 ms | 53068 tok/s) step 1199/76294 | train loss 3.934334 | norm 0.1974 | lr 1.20e-03 | (9905.23 ms | 52930 tok/s) step 1200/76294 | train loss 4.038969 | norm 0.2016 | lr 1.20e-03 | (9936.70 ms | 52763 tok/s) step 1201/76294 | train loss 3.953527 | norm 0.1842 | lr 1.20e-03 | (9885.96 ms | 53034 tok/s) step 1202/76294 | train loss 4.007553 | norm 0.1926 | lr 1.20e-03 | (9875.51 ms | 53090 tok/s) step 1203/76294 | train loss 3.899108 | norm 0.1960 | lr 1.20e-03 | (9880.86 ms | 53061 tok/s) step 1204/76294 | train loss 3.880654 | norm 0.2089 | lr 1.20e-03 | (9886.56 ms | 53030 tok/s) step 1205/76294 | train loss 3.982249 | norm 0.2466 | lr 1.20e-03 | (9883.74 ms | 53046 tok/s) step 1206/76294 | train loss 3.925130 | norm 0.3255 | lr 1.20e-03 | (9918.60 ms | 52859 tok/s) step 1207/76294 | train loss 3.981566 | norm 0.3179 | lr 1.20e-03 | (9880.64 ms | 53062 tok/s) step 1208/76294 | train loss 3.957453 | norm 0.3030 | lr 1.20e-03 | (9902.65 ms | 52944 tok/s) step 1209/76294 | train loss 3.866799 | norm 0.2301 | lr 1.20e-03 | (9880.82 ms | 53061 tok/s) step 1210/76294 | train loss 3.868709 | norm 0.2327 | lr 1.20e-03 | (9883.29 ms | 53048 tok/s) step 1211/76294 | train loss 3.903366 | norm 0.2079 | lr 1.20e-03 | (9880.03 ms | 53065 tok/s) step 1212/76294 | train loss 3.835475 | norm 0.1583 | lr 1.20e-03 | (9905.77 ms | 52928 tok/s) step 1213/76294 | train loss 3.913210 | norm 0.1774 | lr 1.20e-03 | (9883.94 ms | 53044 tok/s) step 1214/76294 | train loss 3.851375 | norm 0.2047 | lr 1.20e-03 | (9880.75 ms | 53062 tok/s) step 1215/76294 | train loss 3.852511 | norm 0.1717 | lr 1.20e-03 | (9878.40 ms | 53074 tok/s) step 1216/76294 | train loss 3.905211 | norm 0.1752 | lr 1.20e-03 | (9883.79 ms | 53045 tok/s) step 1217/76294 | train loss 3.888628 | norm 0.1776 | lr 1.20e-03 | (9884.74 ms | 53040 tok/s) step 1218/76294 | train loss 3.856257 | norm 0.1737 | lr 1.20e-03 | (10004.10 ms | 52407 tok/s) step 1219/76294 | train loss 3.934718 | norm 0.2076 | lr 1.20e-03 | (9885.54 ms | 53036 tok/s) step 1220/76294 | train loss 4.236256 | norm 0.2692 | lr 1.20e-03 | (9874.43 ms | 53096 tok/s) step 1221/76294 | train loss 3.843809 | norm 0.2796 | lr 1.20e-03 | (9892.19 ms | 53000 tok/s) step 1222/76294 | train loss 3.920976 | norm 0.2857 | lr 1.20e-03 | (9880.89 ms | 53061 tok/s) step 1223/76294 | train loss 3.879016 | norm 0.2699 | lr 1.20e-03 | (9877.21 ms | 53081 tok/s) step 1224/76294 | train loss 3.868760 | norm 0.2261 | lr 1.20e-03 | (9920.89 ms | 52847 tok/s) step 1225/76294 | train loss 3.972558 | norm 0.2329 | lr 1.20e-03 | (9942.63 ms | 52731 tok/s) step 1226/76294 | train loss 3.838001 | norm 0.2418 | lr 1.20e-03 | (9879.58 ms | 53068 tok/s) step 1227/76294 | train loss 3.926524 | norm 0.2197 | lr 1.20e-03 | (9880.33 ms | 53064 tok/s) step 1228/76294 | train loss 3.899535 | norm 0.2152 | lr 1.20e-03 | (9887.56 ms | 53025 tok/s) step 1229/76294 | train loss 3.923322 | norm 0.1873 | lr 1.20e-03 | (9922.42 ms | 52839 tok/s) step 1230/76294 | train loss 3.937703 | norm 0.2123 | lr 1.20e-03 | (9876.18 ms | 53086 tok/s) step 1231/76294 | train loss 3.906904 | norm 0.1725 | lr 1.20e-03 | (9885.41 ms | 53037 tok/s) step 1232/76294 | train loss 3.886337 | norm 0.1832 | lr 1.20e-03 | (9876.35 ms | 53085 tok/s) step 1233/76294 | train loss 3.919107 | norm 0.2005 | lr 1.20e-03 | (9907.59 ms | 52918 tok/s) step 1234/76294 | train loss 3.881389 | norm 0.1920 | lr 1.20e-03 | (9912.69 ms | 52891 tok/s) step 1235/76294 | train loss 3.816264 | norm 0.1918 | lr 1.20e-03 | (9911.76 ms | 52896 tok/s) step 1236/76294 | train loss 3.923410 | norm 0.1941 | lr 1.20e-03 | (9870.88 ms | 53115 tok/s) step 1237/76294 | train loss 3.915523 | norm 0.1979 | lr 1.20e-03 | (9878.01 ms | 53076 tok/s) step 1238/76294 | train loss 3.863029 | norm 1.2614 | lr 1.20e-03 | (9898.42 ms | 52967 tok/s) step 1239/76294 | train loss 3.927936 | norm 0.1837 | lr 1.20e-03 | (9881.92 ms | 53055 tok/s) step 1240/76294 | train loss 3.908605 | norm 0.1943 | lr 1.20e-03 | (9879.69 ms | 53067 tok/s) step 1241/76294 | train loss 3.907056 | norm 0.8680 | lr 1.20e-03 | (9877.65 ms | 53078 tok/s) step 1242/76294 | train loss 3.871254 | norm 0.2339 | lr 1.20e-03 | (9916.08 ms | 52872 tok/s) step 1243/76294 | train loss 3.901205 | norm 1.6047 | lr 1.20e-03 | (9881.38 ms | 53058 tok/s) step 1244/76294 | train loss 4.002335 | norm 0.2813 | lr 1.20e-03 | (9915.64 ms | 52875 tok/s) step 1245/76294 | train loss 3.948601 | norm 2.1120 | lr 1.20e-03 | (9877.53 ms | 53079 tok/s) step 1246/76294 | train loss 3.992801 | norm 0.5670 | lr 1.20e-03 | (9890.62 ms | 53009 tok/s) step 1247/76294 | train loss 4.038055 | norm 0.8852 | lr 1.20e-03 | (9880.83 ms | 53061 tok/s) step 1248/76294 | train loss 3.960085 | norm 0.8221 | lr 1.20e-03 | (9878.78 ms | 53072 tok/s) step 1249/76294 | train loss 3.994176 | norm 0.4622 | lr 1.20e-03 | (9882.67 ms | 53051 tok/s) step 1250/76294 | train loss 4.015664 | norm 0.5263 | lr 1.20e-03 | (9878.89 ms | 53072 tok/s) val loss: 3.965333 saving model checkpoint to ./results/gpt2-350M-gqa/step_1250.pth step 1251/76294 | train loss 3.930990 | norm 1.1587 | lr 1.20e-03 | (9952.61 ms | 52678 tok/s) step 1252/76294 | train loss 3.889324 | norm 1.1754 | lr 1.20e-03 | (9860.77 ms | 53169 tok/s) step 1253/76294 | train loss 4.004651 | norm 0.3642 | lr 1.20e-03 | (9878.83 ms | 53072 tok/s) step 1254/76294 | train loss 3.893843 | norm 0.3925 | lr 1.20e-03 | (9909.61 ms | 52907 tok/s) step 1255/76294 | train loss 3.900715 | norm 0.2412 | lr 1.20e-03 | (9873.99 ms | 53098 tok/s) step 1256/76294 | train loss 3.945537 | norm 0.4930 | lr 1.20e-03 | (9869.47 ms | 53122 tok/s) step 1257/76294 | train loss 3.925431 | norm 0.2448 | lr 1.20e-03 | (9879.33 ms | 53069 tok/s) step 1258/76294 | train loss 3.861231 | norm 0.2400 | lr 1.20e-03 | (9876.63 ms | 53084 tok/s) step 1259/76294 | train loss 3.999739 | norm 0.3183 | lr 1.20e-03 | (9880.79 ms | 53061 tok/s) step 1260/76294 | train loss 3.946161 | norm 0.2118 | lr 1.20e-03 | (9907.92 ms | 52916 tok/s) step 1261/76294 | train loss 3.916492 | norm 0.2428 | lr 1.20e-03 | (9868.75 ms | 53126 tok/s) step 1262/76294 | train loss 3.913091 | norm 0.3489 | lr 1.20e-03 | (9889.99 ms | 53012 tok/s) step 1263/76294 | train loss 3.931676 | norm 0.2107 | lr 1.20e-03 | (9883.14 ms | 53049 tok/s) step 1264/76294 | train loss 3.901568 | norm 0.2183 | lr 1.20e-03 | (9919.60 ms | 52854 tok/s) step 1265/76294 | train loss 3.908413 | norm 0.1901 | lr 1.20e-03 | (9886.63 ms | 53030 tok/s) step 1266/76294 | train loss 3.928078 | norm 0.1947 | lr 1.20e-03 | (9873.24 ms | 53102 tok/s) step 1267/76294 | train loss 3.923220 | norm 0.1857 | lr 1.20e-03 | (11248.94 ms | 46608 tok/s) step 1268/76294 | train loss 3.946515 | norm 0.2108 | lr 1.20e-03 | (9881.41 ms | 53058 tok/s) step 1269/76294 | train loss 3.965899 | norm 0.2126 | lr 1.20e-03 | (9894.06 ms | 52990 tok/s) step 1270/76294 | train loss 3.912356 | norm 0.1827 | lr 1.20e-03 | (9867.72 ms | 53132 tok/s) step 1271/76294 | train loss 3.852448 | norm 0.2160 | lr 1.20e-03 | (9871.99 ms | 53109 tok/s) step 1272/76294 | train loss 3.855475 | norm 0.1954 | lr 1.20e-03 | (9873.64 ms | 53100 tok/s) step 1273/76294 | train loss 3.956318 | norm 0.1999 | lr 1.20e-03 | (9876.23 ms | 53086 tok/s) step 1274/76294 | train loss 3.906495 | norm 0.1838 | lr 1.20e-03 | (9882.41 ms | 53053 tok/s) step 1275/76294 | train loss 3.895890 | norm 0.1927 | lr 1.20e-03 | (9881.21 ms | 53059 tok/s) step 1276/76294 | train loss 3.873968 | norm 0.1895 | lr 1.20e-03 | (9886.57 ms | 53030 tok/s) step 1277/76294 | train loss 3.945240 | norm 0.1672 | lr 1.20e-03 | (9888.63 ms | 53019 tok/s) step 1278/76294 | train loss 3.859500 | norm 0.1789 | lr 1.20e-03 | (9912.36 ms | 52892 tok/s) step 1279/76294 | train loss 3.920608 | norm 0.1777 | lr 1.20e-03 | (9902.44 ms | 52945 tok/s) step 1280/76294 | train loss 3.946091 | norm 0.1689 | lr 1.20e-03 | (9872.37 ms | 53107 tok/s) step 1281/76294 | train loss 3.896293 | norm 0.1777 | lr 1.20e-03 | (9878.57 ms | 53073 tok/s) step 1282/76294 | train loss 3.835609 | norm 0.1786 | lr 1.20e-03 | (9917.13 ms | 52867 tok/s) step 1283/76294 | train loss 3.875433 | norm 0.1971 | lr 1.20e-03 | (9879.01 ms | 53071 tok/s) step 1284/76294 | train loss 3.880528 | norm 0.1766 | lr 1.20e-03 | (9883.57 ms | 53046 tok/s) step 1285/76294 | train loss 3.884336 | norm 0.1990 | lr 1.20e-03 | (9884.09 ms | 53044 tok/s) step 1286/76294 | train loss 3.886524 | norm 0.2274 | lr 1.20e-03 | (9886.52 ms | 53031 tok/s) step 1287/76294 | train loss 3.891240 | norm 0.2087 | lr 1.20e-03 | (9882.37 ms | 53053 tok/s) step 1288/76294 | train loss 3.889155 | norm 0.1553 | lr 1.20e-03 | (9878.15 ms | 53076 tok/s) step 1289/76294 | train loss 3.874192 | norm 0.1781 | lr 1.20e-03 | (9888.08 ms | 53022 tok/s) step 1290/76294 | train loss 3.822884 | norm 0.1743 | lr 1.20e-03 | (9921.91 ms | 52841 tok/s) step 1291/76294 | train loss 3.872510 | norm 0.1768 | lr 1.20e-03 | (9882.64 ms | 53051 tok/s) step 1292/76294 | train loss 3.822113 | norm 0.2309 | lr 1.20e-03 | (9896.21 ms | 52979 tok/s) step 1293/76294 | train loss 3.818552 | norm 0.3201 | lr 1.20e-03 | (9884.54 ms | 53041 tok/s) step 1294/76294 | train loss 3.805650 | norm 0.3435 | lr 1.20e-03 | (9889.92 ms | 53012 tok/s) step 1295/76294 | train loss 3.912106 | norm 0.2908 | lr 1.20e-03 | (9883.42 ms | 53047 tok/s) step 1296/76294 | train loss 3.858870 | norm 0.3338 | lr 1.20e-03 | (9889.36 ms | 53015 tok/s) step 1297/76294 | train loss 4.018860 | norm 0.3904 | lr 1.20e-03 | (9881.76 ms | 53056 tok/s) step 1298/76294 | train loss 3.858893 | norm 0.2552 | lr 1.20e-03 | (9887.51 ms | 53025 tok/s) step 1299/76294 | train loss 3.948569 | norm 0.2730 | lr 1.20e-03 | (9885.28 ms | 53037 tok/s) step 1300/76294 | train loss 3.875208 | norm 0.2816 | lr 1.20e-03 | (9886.43 ms | 53031 tok/s) step 1301/76294 | train loss 3.865385 | norm 0.2207 | lr 1.20e-03 | (9884.48 ms | 53042 tok/s) step 1302/76294 | train loss 3.880206 | norm 0.2264 | lr 1.20e-03 | (9886.95 ms | 53028 tok/s) step 1303/76294 | train loss 3.924400 | norm 0.2047 | lr 1.20e-03 | (9882.70 ms | 53051 tok/s) step 1304/76294 | train loss 3.818824 | norm 0.1959 | lr 1.20e-03 | (9967.70 ms | 52599 tok/s) step 1305/76294 | train loss 3.883819 | norm 0.1833 | lr 1.20e-03 | (9887.56 ms | 53025 tok/s) step 1306/76294 | train loss 3.825084 | norm 0.1595 | lr 1.20e-03 | (9891.39 ms | 53004 tok/s) step 1307/76294 | train loss 3.865241 | norm 0.1659 | lr 1.20e-03 | (9891.82 ms | 53002 tok/s) step 1308/76294 | train loss 3.847109 | norm 0.1696 | lr 1.20e-03 | (9879.96 ms | 53066 tok/s) step 1309/76294 | train loss 3.813351 | norm 0.1694 | lr 1.20e-03 | (9918.41 ms | 52860 tok/s) step 1310/76294 | train loss 3.961161 | norm 0.1904 | lr 1.20e-03 | (9937.78 ms | 52757 tok/s) step 1311/76294 | train loss 3.893307 | norm 0.1859 | lr 1.20e-03 | (9883.24 ms | 53048 tok/s) step 1312/76294 | train loss 3.842531 | norm 0.2004 | lr 1.20e-03 | (9890.41 ms | 53010 tok/s) step 1313/76294 | train loss 3.946657 | norm 0.1774 | lr 1.20e-03 | (9934.96 ms | 52772 tok/s) step 1314/76294 | train loss 3.884065 | norm 0.2003 | lr 1.20e-03 | (9949.95 ms | 52693 tok/s) step 1315/76294 | train loss 3.860893 | norm 0.1857 | lr 1.20e-03 | (9886.20 ms | 53032 tok/s) step 1316/76294 | train loss 3.878424 | norm 0.1792 | lr 1.20e-03 | (9922.04 ms | 52841 tok/s) step 1317/76294 | train loss 3.878184 | norm 0.1835 | lr 1.20e-03 | (9883.08 ms | 53049 tok/s) step 1318/76294 | train loss 3.870133 | norm 0.2070 | lr 1.20e-03 | (9889.26 ms | 53016 tok/s) step 1319/76294 | train loss 3.844994 | norm 0.1936 | lr 1.20e-03 | (9882.37 ms | 53053 tok/s) step 1320/76294 | train loss 3.848995 | norm 0.1780 | lr 1.20e-03 | (9894.10 ms | 52990 tok/s) step 1321/76294 | train loss 3.785976 | norm 0.1727 | lr 1.20e-03 | (9949.07 ms | 52697 tok/s) step 1322/76294 | train loss 3.805316 | norm 0.1863 | lr 1.20e-03 | (9881.42 ms | 53058 tok/s) step 1323/76294 | train loss 3.912506 | norm 0.1585 | lr 1.20e-03 | (9877.90 ms | 53077 tok/s) step 1324/76294 | train loss 3.896961 | norm 0.2016 | lr 1.20e-03 | (9882.28 ms | 53053 tok/s) step 1325/76294 | train loss 3.850451 | norm 0.1759 | lr 1.20e-03 | (9921.25 ms | 52845 tok/s) step 1326/76294 | train loss 3.889207 | norm 0.1850 | lr 1.20e-03 | (9879.27 ms | 53070 tok/s) step 1327/76294 | train loss 3.829057 | norm 0.2097 | lr 1.20e-03 | (9888.22 ms | 53021 tok/s) step 1328/76294 | train loss 3.843618 | norm 0.2341 | lr 1.20e-03 | (9881.97 ms | 53055 tok/s) step 1329/76294 | train loss 3.859084 | norm 0.2804 | lr 1.20e-03 | (9888.96 ms | 53017 tok/s) step 1330/76294 | train loss 3.890699 | norm 0.2624 | lr 1.20e-03 | (9878.33 ms | 53075 tok/s) step 1331/76294 | train loss 4.137775 | norm 0.2249 | lr 1.20e-03 | (9887.30 ms | 53026 tok/s) step 1332/76294 | train loss 3.937587 | norm 0.2850 | lr 1.20e-03 | (9911.87 ms | 52895 tok/s) step 1333/76294 | train loss 3.906962 | norm 0.2704 | lr 1.20e-03 | (9887.46 ms | 53026 tok/s) step 1334/76294 | train loss 3.887623 | norm 0.2103 | lr 1.20e-03 | (9885.65 ms | 53035 tok/s) step 1335/76294 | train loss 3.855995 | norm 0.1999 | lr 1.20e-03 | (9884.98 ms | 53039 tok/s) step 1336/76294 | train loss 3.885756 | norm 0.2407 | lr 1.20e-03 | (11219.43 ms | 46730 tok/s) step 1337/76294 | train loss 3.823266 | norm 0.2632 | lr 1.20e-03 | (9869.67 ms | 53121 tok/s) step 1338/76294 | train loss 3.896523 | norm 0.2045 | lr 1.20e-03 | (9896.40 ms | 52978 tok/s) step 1339/76294 | train loss 3.832612 | norm 0.1924 | lr 1.20e-03 | (9863.03 ms | 53157 tok/s) step 1340/76294 | train loss 3.867770 | norm 0.1919 | lr 1.20e-03 | (9868.22 ms | 53129 tok/s) step 1341/76294 | train loss 3.831980 | norm 0.1906 | lr 1.20e-03 | (9865.03 ms | 53146 tok/s) step 1342/76294 | train loss 3.928412 | norm 0.1782 | lr 1.20e-03 | (9876.04 ms | 53087 tok/s) step 1343/76294 | train loss 3.783332 | norm 0.2090 | lr 1.20e-03 | (9873.07 ms | 53103 tok/s) step 1344/76294 | train loss 3.914145 | norm 0.2211 | lr 1.20e-03 | (9881.17 ms | 53059 tok/s) step 1345/76294 | train loss 3.831150 | norm 0.2240 | lr 1.20e-03 | (9870.55 ms | 53116 tok/s) step 1346/76294 | train loss 3.862676 | norm 0.2585 | lr 1.20e-03 | (9880.89 ms | 53061 tok/s) step 1347/76294 | train loss 3.877250 | norm 0.2282 | lr 1.20e-03 | (9873.05 ms | 53103 tok/s) step 1348/76294 | train loss 3.922469 | norm 0.2683 | lr 1.20e-03 | (9884.76 ms | 53040 tok/s) step 1349/76294 | train loss 3.843466 | norm 0.1756 | lr 1.20e-03 | (9918.09 ms | 52862 tok/s) step 1350/76294 | train loss 3.850567 | norm 0.2000 | lr 1.20e-03 | (9889.13 ms | 53017 tok/s) step 1351/76294 | train loss 3.888661 | norm 0.2027 | lr 1.20e-03 | (9899.32 ms | 52962 tok/s) step 1352/76294 | train loss 3.820349 | norm 0.1890 | lr 1.20e-03 | (9915.39 ms | 52876 tok/s) step 1353/76294 | train loss 3.902112 | norm 0.2106 | lr 1.20e-03 | (9882.05 ms | 53055 tok/s) step 1354/76294 | train loss 3.846019 | norm 0.1856 | lr 1.20e-03 | (9873.66 ms | 53100 tok/s) step 1355/76294 | train loss 3.874653 | norm 0.1710 | lr 1.20e-03 | (9876.47 ms | 53085 tok/s) step 1356/76294 | train loss 3.880174 | norm 0.1960 | lr 1.20e-03 | (9871.69 ms | 53110 tok/s) step 1357/76294 | train loss 3.855071 | norm 0.2272 | lr 1.20e-03 | (9886.52 ms | 53031 tok/s) step 1358/76294 | train loss 3.865115 | norm 0.2985 | lr 1.20e-03 | (9917.99 ms | 52862 tok/s) step 1359/76294 | train loss 3.901392 | norm 0.2675 | lr 1.20e-03 | (9878.94 ms | 53071 tok/s) step 1360/76294 | train loss 3.777219 | norm 0.2122 | lr 1.20e-03 | (9881.79 ms | 53056 tok/s) step 1361/76294 | train loss 3.877414 | norm 0.2312 | lr 1.20e-03 | (9878.67 ms | 53073 tok/s) step 1362/76294 | train loss 3.934447 | norm 0.2370 | lr 1.20e-03 | (9877.80 ms | 53077 tok/s) step 1363/76294 | train loss 3.860872 | norm 0.1824 | lr 1.20e-03 | (9916.03 ms | 52873 tok/s) step 1364/76294 | train loss 3.830726 | norm 0.1963 | lr 1.20e-03 | (9882.79 ms | 53051 tok/s) step 1365/76294 | train loss 4.129185 | norm 0.2124 | lr 1.20e-03 | (10943.41 ms | 47909 tok/s) step 1366/76294 | train loss 3.881226 | norm 0.2124 | lr 1.20e-03 | (9874.19 ms | 53097 tok/s) step 1367/76294 | train loss 3.899226 | norm 0.1829 | lr 1.20e-03 | (9916.60 ms | 52870 tok/s) step 1368/76294 | train loss 3.835635 | norm 0.2000 | lr 1.20e-03 | (9913.33 ms | 52887 tok/s) step 1369/76294 | train loss 3.797795 | norm 0.1675 | lr 1.20e-03 | (9882.01 ms | 53055 tok/s) step 1370/76294 | train loss 3.880154 | norm 0.2025 | lr 1.20e-03 | (9878.45 ms | 53074 tok/s) step 1371/76294 | train loss 3.868543 | norm 0.1604 | lr 1.20e-03 | (9873.25 ms | 53102 tok/s) step 1372/76294 | train loss 3.833701 | norm 0.1788 | lr 1.20e-03 | (9883.38 ms | 53047 tok/s) step 1373/76294 | train loss 3.830403 | norm 0.1771 | lr 1.20e-03 | (9884.21 ms | 53043 tok/s) step 1374/76294 | train loss 3.997214 | norm 0.2118 | lr 1.20e-03 | (9870.65 ms | 53116 tok/s) step 1375/76294 | train loss 4.007371 | norm 0.2306 | lr 1.20e-03 | (9879.92 ms | 53066 tok/s) step 1376/76294 | train loss 3.799201 | norm 0.2386 | lr 1.20e-03 | (9912.32 ms | 52893 tok/s) step 1377/76294 | train loss 3.834146 | norm 0.2304 | lr 1.20e-03 | (9881.10 ms | 53060 tok/s) step 1378/76294 | train loss 3.879413 | norm 0.2067 | lr 1.20e-03 | (9888.19 ms | 53022 tok/s) step 1379/76294 | train loss 3.902731 | norm 0.2422 | lr 1.20e-03 | (9878.27 ms | 53075 tok/s) step 1380/76294 | train loss 3.811094 | norm 0.2055 | lr 1.20e-03 | (9882.96 ms | 53050 tok/s) step 1381/76294 | train loss 3.932925 | norm 0.1861 | lr 1.20e-03 | (9876.73 ms | 53083 tok/s) step 1382/76294 | train loss 3.771623 | norm 0.2250 | lr 1.20e-03 | (9888.42 ms | 53020 tok/s) step 1383/76294 | train loss 3.817107 | norm 0.2041 | lr 1.20e-03 | (9891.45 ms | 53004 tok/s) step 1384/76294 | train loss 3.877017 | norm 0.1920 | lr 1.20e-03 | (9877.21 ms | 53081 tok/s) step 1385/76294 | train loss 3.884466 | norm 0.2211 | lr 1.20e-03 | (9883.62 ms | 53046 tok/s) step 1386/76294 | train loss 3.742816 | norm 0.1891 | lr 1.20e-03 | (9909.19 ms | 52909 tok/s) step 1387/76294 | train loss 3.894621 | norm 0.2124 | lr 1.20e-03 | (9878.92 ms | 53071 tok/s) step 1388/76294 | train loss 3.853564 | norm 0.2025 | lr 1.20e-03 | (9882.64 ms | 53051 tok/s) step 1389/76294 | train loss 3.822863 | norm 0.2056 | lr 1.20e-03 | (9877.86 ms | 53077 tok/s) step 1390/76294 | train loss 3.888058 | norm 0.2275 | lr 1.20e-03 | (9885.07 ms | 53038 tok/s) step 1391/76294 | train loss 3.846259 | norm 0.1868 | lr 1.20e-03 | (9879.55 ms | 53068 tok/s) step 1392/76294 | train loss 3.854594 | norm 0.1745 | lr 1.20e-03 | (9880.54 ms | 53063 tok/s) step 1393/76294 | train loss 3.921234 | norm 0.1794 | lr 1.20e-03 | (11398.16 ms | 45998 tok/s) step 1394/76294 | train loss 3.871873 | norm 0.2051 | lr 1.20e-03 | (9936.75 ms | 52763 tok/s) step 1395/76294 | train loss 3.798345 | norm 0.1904 | lr 1.20e-03 | (10599.78 ms | 49462 tok/s) step 1396/76294 | train loss 3.818090 | norm 0.4748 | lr 1.20e-03 | (9860.58 ms | 53170 tok/s) step 1397/76294 | train loss 3.812788 | norm 0.1811 | lr 1.20e-03 | (9872.31 ms | 53107 tok/s) step 1398/76294 | train loss 3.898192 | norm 0.3024 | lr 1.20e-03 | (9867.98 ms | 53130 tok/s) step 1399/76294 | train loss 3.923687 | norm 0.2899 | lr 1.20e-03 | (9883.59 ms | 53046 tok/s) step 1400/76294 | train loss 3.836643 | norm 0.1738 | lr 1.20e-03 | (9871.29 ms | 53112 tok/s) step 1401/76294 | train loss 3.839462 | norm 0.1816 | lr 1.20e-03 | (9874.82 ms | 53093 tok/s) step 1402/76294 | train loss 3.858697 | norm 0.2120 | lr 1.20e-03 | (9874.44 ms | 53095 tok/s) step 1403/76294 | train loss 3.766621 | norm 0.1826 | lr 1.20e-03 | (9917.72 ms | 52864 tok/s) step 1404/76294 | train loss 3.826359 | norm 0.2007 | lr 1.20e-03 | (9911.82 ms | 52895 tok/s) step 1405/76294 | train loss 3.881245 | norm 0.2167 | lr 1.20e-03 | (9875.54 ms | 53090 tok/s) step 1406/76294 | train loss 3.757780 | norm 0.2168 | lr 1.20e-03 | (9900.57 ms | 52955 tok/s) step 1407/76294 | train loss 3.815250 | norm 0.2084 | lr 1.20e-03 | (9874.21 ms | 53097 tok/s) step 1408/76294 | train loss 3.747464 | norm 0.2479 | lr 1.20e-03 | (9882.86 ms | 53050 tok/s) step 1409/76294 | train loss 3.832167 | norm 0.3367 | lr 1.20e-03 | (9875.45 ms | 53090 tok/s) step 1410/76294 | train loss 3.830408 | norm 0.3162 | lr 1.20e-03 | (9876.75 ms | 53083 tok/s) step 1411/76294 | train loss 3.810899 | norm 0.2551 | lr 1.20e-03 | (9923.32 ms | 52834 tok/s) step 1412/76294 | train loss 3.828042 | norm 0.2589 | lr 1.20e-03 | (9874.38 ms | 53096 tok/s) step 1413/76294 | train loss 3.893201 | norm 0.2447 | lr 1.20e-03 | (9886.17 ms | 53032 tok/s) step 1414/76294 | train loss 3.849943 | norm 0.2584 | lr 1.20e-03 | (9872.99 ms | 53103 tok/s) step 1415/76294 | train loss 3.803953 | norm 0.3274 | lr 1.20e-03 | (9907.18 ms | 52920 tok/s) step 1416/76294 | train loss 3.780078 | norm 0.1759 | lr 1.20e-03 | (9874.80 ms | 53094 tok/s) step 1417/76294 | train loss 3.864365 | norm 0.2084 | lr 1.20e-03 | (9912.33 ms | 52893 tok/s) step 1418/76294 | train loss 3.838155 | norm 0.1882 | lr 1.20e-03 | (9877.31 ms | 53080 tok/s) step 1419/76294 | train loss 3.852266 | norm 0.1884 | lr 1.20e-03 | (9874.66 ms | 53094 tok/s) step 1420/76294 | train loss 3.851090 | norm 0.1781 | lr 1.20e-03 | (9895.24 ms | 52984 tok/s) step 1421/76294 | train loss 3.815079 | norm 0.1790 | lr 1.20e-03 | (9877.41 ms | 53080 tok/s) step 1422/76294 | train loss 3.826593 | norm 0.2087 | lr 1.20e-03 | (9882.06 ms | 53055 tok/s) step 1423/76294 | train loss 3.857638 | norm 0.2012 | lr 1.20e-03 | (9914.83 ms | 52879 tok/s) step 1424/76294 | train loss 3.813681 | norm 0.2623 | lr 1.20e-03 | (9910.53 ms | 52902 tok/s) step 1425/76294 | train loss 3.858854 | norm 0.2433 | lr 1.20e-03 | (9882.10 ms | 53054 tok/s) step 1426/76294 | train loss 3.876358 | norm 0.1910 | lr 1.20e-03 | (9878.44 ms | 53074 tok/s) step 1427/76294 | train loss 3.805111 | norm 0.1871 | lr 1.20e-03 | (9890.23 ms | 53011 tok/s) step 1428/76294 | train loss 3.879211 | norm 0.1803 | lr 1.20e-03 | (9876.48 ms | 53084 tok/s) step 1429/76294 | train loss 3.882030 | norm 0.1777 | lr 1.20e-03 | (9888.34 ms | 53021 tok/s) step 1430/76294 | train loss 3.808964 | norm 0.1929 | lr 1.20e-03 | (9876.14 ms | 53086 tok/s) step 1431/76294 | train loss 4.100311 | norm 0.2258 | lr 1.20e-03 | (9882.66 ms | 53051 tok/s) step 1432/76294 | train loss 3.841498 | norm 0.2233 | lr 1.20e-03 | (9878.39 ms | 53074 tok/s) step 1433/76294 | train loss 3.835013 | norm 0.1783 | lr 1.20e-03 | (9882.48 ms | 53052 tok/s) step 1434/76294 | train loss 3.822478 | norm 0.1908 | lr 1.20e-03 | (9876.38 ms | 53085 tok/s) step 1435/76294 | train loss 3.797855 | norm 0.2012 | lr 1.20e-03 | (9885.96 ms | 53034 tok/s) step 1436/76294 | train loss 3.758409 | norm 0.1879 | lr 1.20e-03 | (9874.49 ms | 53095 tok/s) step 1437/76294 | train loss 3.842960 | norm 0.2262 | lr 1.20e-03 | (9885.86 ms | 53034 tok/s) step 1438/76294 | train loss 3.815589 | norm 0.2165 | lr 1.20e-03 | (9881.46 ms | 53058 tok/s) step 1439/76294 | train loss 3.787805 | norm 0.2249 | lr 1.20e-03 | (9886.48 ms | 53031 tok/s) step 1440/76294 | train loss 3.838743 | norm 0.2135 | lr 1.20e-03 | (9884.45 ms | 53042 tok/s) step 1441/76294 | train loss 3.776255 | norm 0.2172 | lr 1.20e-03 | (9879.48 ms | 53068 tok/s) step 1442/76294 | train loss 3.831294 | norm 0.1847 | lr 1.20e-03 | (9874.52 ms | 53095 tok/s) step 1443/76294 | train loss 3.883035 | norm 0.2391 | lr 1.20e-03 | (9872.51 ms | 53106 tok/s) step 1444/76294 | train loss 3.811395 | norm 0.2658 | lr 1.20e-03 | (9887.49 ms | 53025 tok/s) step 1445/76294 | train loss 3.808286 | norm 0.1950 | lr 1.20e-03 | (9906.36 ms | 52924 tok/s) step 1446/76294 | train loss 3.831315 | norm 0.1902 | lr 1.20e-03 | (9874.60 ms | 53095 tok/s) step 1447/76294 | train loss 3.867385 | norm 0.1885 | lr 1.20e-03 | (9887.91 ms | 53023 tok/s) step 1448/76294 | train loss 3.960783 | norm 0.1745 | lr 1.20e-03 | (9887.12 ms | 53027 tok/s) step 1449/76294 | train loss 3.941455 | norm 0.1627 | lr 1.20e-03 | (9876.30 ms | 53085 tok/s) step 1450/76294 | train loss 3.824399 | norm 0.1756 | lr 1.20e-03 | (9873.86 ms | 53099 tok/s) step 1451/76294 | train loss 3.822202 | norm 0.1761 | lr 1.20e-03 | (9876.74 ms | 53083 tok/s) step 1452/76294 | train loss 3.912216 | norm 0.1945 | lr 1.20e-03 | (9902.06 ms | 52947 tok/s) step 1453/76294 | train loss 3.870018 | norm 0.2351 | lr 1.20e-03 | (9919.73 ms | 52853 tok/s) step 1454/76294 | train loss 3.873248 | norm 0.1912 | lr 1.20e-03 | (9874.82 ms | 53093 tok/s) step 1455/76294 | train loss 3.961653 | norm 0.1864 | lr 1.20e-03 | (9882.72 ms | 53051 tok/s) step 1456/76294 | train loss 3.834816 | norm 0.2332 | lr 1.20e-03 | (9877.19 ms | 53081 tok/s) step 1457/76294 | train loss 3.770792 | norm 0.2164 | lr 1.20e-03 | (9887.33 ms | 53026 tok/s) step 1458/76294 | train loss 3.806300 | norm 0.2410 | lr 1.20e-03 | (9876.19 ms | 53086 tok/s) step 1459/76294 | train loss 3.845537 | norm 0.2180 | lr 1.20e-03 | (9909.82 ms | 52906 tok/s) step 1460/76294 | train loss 3.800420 | norm 0.2018 | lr 1.20e-03 | (9896.06 ms | 52979 tok/s) step 1461/76294 | train loss 3.883728 | norm 0.2179 | lr 1.20e-03 | (9917.19 ms | 52867 tok/s) step 1462/76294 | train loss 3.967692 | norm 0.2564 | lr 1.20e-03 | (11096.47 ms | 47248 tok/s) step 1463/76294 | train loss 3.788854 | norm 0.2312 | lr 1.20e-03 | (9887.54 ms | 53025 tok/s) step 1464/76294 | train loss 3.788386 | norm 0.2405 | lr 1.20e-03 | (9873.69 ms | 53100 tok/s) step 1465/76294 | train loss 3.799860 | norm 0.2471 | lr 1.20e-03 | (9895.74 ms | 52981 tok/s) step 1466/76294 | train loss 3.800508 | norm 0.2003 | lr 1.20e-03 | (9876.54 ms | 53084 tok/s) step 1467/76294 | train loss 3.773115 | norm 0.2159 | lr 1.20e-03 | (9874.69 ms | 53094 tok/s) step 1468/76294 | train loss 3.844715 | norm 0.1895 | lr 1.20e-03 | (9881.18 ms | 53059 tok/s) step 1469/76294 | train loss 3.782191 | norm 0.2161 | lr 1.20e-03 | (9880.88 ms | 53061 tok/s) step 1470/76294 | train loss 3.796657 | norm 0.2164 | lr 1.20e-03 | (9882.19 ms | 53054 tok/s) step 1471/76294 | train loss 3.791473 | norm 0.1772 | lr 1.20e-03 | (9887.86 ms | 53023 tok/s) step 1472/76294 | train loss 3.789223 | norm 0.1966 | lr 1.20e-03 | (9909.95 ms | 52905 tok/s) step 1473/76294 | train loss 3.896966 | norm 0.2599 | lr 1.20e-03 | (9882.93 ms | 53050 tok/s) step 1474/76294 | train loss 3.803208 | norm 0.3151 | lr 1.20e-03 | (9874.81 ms | 53093 tok/s) step 1475/76294 | train loss 3.827577 | norm 0.2222 | lr 1.20e-03 | (9882.56 ms | 53052 tok/s) step 1476/76294 | train loss 3.839350 | norm 0.2053 | lr 1.20e-03 | (9878.92 ms | 53071 tok/s) step 1477/76294 | train loss 3.834983 | norm 0.1764 | lr 1.20e-03 | (9900.19 ms | 52957 tok/s) step 1478/76294 | train loss 3.774212 | norm 0.1771 | lr 1.20e-03 | (9879.11 ms | 53070 tok/s) step 1479/76294 | train loss 3.850147 | norm 0.1791 | lr 1.20e-03 | (9890.30 ms | 53010 tok/s) step 1480/76294 | train loss 3.781970 | norm 0.1937 | lr 1.20e-03 | (9879.41 ms | 53069 tok/s) step 1481/76294 | train loss 3.807020 | norm 0.1593 | lr 1.20e-03 | (9889.71 ms | 53013 tok/s) step 1482/76294 | train loss 3.811895 | norm 0.1844 | lr 1.20e-03 | (9878.42 ms | 53074 tok/s) step 1483/76294 | train loss 3.886453 | norm 0.1599 | lr 1.20e-03 | (9892.07 ms | 53001 tok/s) step 1484/76294 | train loss 3.823110 | norm 0.1934 | lr 1.20e-03 | (9901.09 ms | 52953 tok/s) step 1485/76294 | train loss 3.803521 | norm 0.1656 | lr 1.20e-03 | (9898.00 ms | 52969 tok/s) step 1486/76294 | train loss 3.797447 | norm 0.1597 | lr 1.20e-03 | (9946.78 ms | 52709 tok/s) step 1487/76294 | train loss 3.773015 | norm 0.1669 | lr 1.20e-03 | (9888.97 ms | 53017 tok/s) step 1488/76294 | train loss 3.806841 | norm 0.1695 | lr 1.20e-03 | (9894.56 ms | 52988 tok/s) step 1489/76294 | train loss 3.789079 | norm 0.1692 | lr 1.20e-03 | (9910.58 ms | 52902 tok/s) step 1490/76294 | train loss 3.814013 | norm 0.1782 | lr 1.20e-03 | (9881.04 ms | 53060 tok/s) step 1491/76294 | train loss 3.785012 | norm 0.1581 | lr 1.20e-03 | (9895.13 ms | 52984 tok/s) step 1492/76294 | train loss 3.794686 | norm 0.1656 | lr 1.20e-03 | (9902.18 ms | 52947 tok/s) step 1493/76294 | train loss 3.785811 | norm 0.1512 | lr 1.20e-03 | (9882.63 ms | 53051 tok/s) step 1494/76294 | train loss 3.797565 | norm 0.1913 | lr 1.20e-03 | (9881.41 ms | 53058 tok/s) step 1495/76294 | train loss 3.828984 | norm 0.1880 | lr 1.20e-03 | (9923.50 ms | 52833 tok/s) step 1496/76294 | train loss 3.779490 | norm 0.1596 | lr 1.20e-03 | (9889.27 ms | 53016 tok/s) step 1497/76294 | train loss 3.783950 | norm 0.1449 | lr 1.20e-03 | (9888.51 ms | 53020 tok/s) step 1498/76294 | train loss 3.780606 | norm 0.1600 | lr 1.20e-03 | (9909.88 ms | 52906 tok/s) step 1499/76294 | train loss 3.820552 | norm 0.1640 | lr 1.20e-03 | (9883.65 ms | 53046 tok/s) step 1500/76294 | train loss 3.798853 | norm 0.1646 | lr 1.20e-03 | (9883.88 ms | 53045 tok/s) val loss: 3.773234 saving model checkpoint to ./results/gpt2-350M-gqa/step_1500.pth step 1501/76294 | train loss 3.752485 | norm 0.1630 | lr 1.20e-03 | (9948.45 ms | 52700 tok/s) step 1502/76294 | train loss 3.768423 | norm 0.1608 | lr 1.20e-03 | (10099.00 ms | 51915 tok/s) step 1503/76294 | train loss 3.789902 | norm 0.1724 | lr 1.20e-03 | (9862.41 ms | 53160 tok/s) step 1504/76294 | train loss 3.780989 | norm 0.1706 | lr 1.20e-03 | (9885.74 ms | 53035 tok/s) step 1505/76294 | train loss 3.724563 | norm 0.1588 | lr 1.20e-03 | (9871.06 ms | 53114 tok/s) step 1506/76294 | train loss 3.787302 | norm 0.1493 | lr 1.20e-03 | (9876.79 ms | 53083 tok/s) step 1507/76294 | train loss 3.798365 | norm 0.1832 | lr 1.20e-03 | (9887.98 ms | 53023 tok/s) step 1508/76294 | train loss 3.794203 | norm 0.1876 | lr 1.20e-03 | (9931.17 ms | 52792 tok/s) step 1509/76294 | train loss 3.793348 | norm 0.1722 | lr 1.20e-03 | (10316.31 ms | 50821 tok/s) step 1510/76294 | train loss 3.760715 | norm 0.2124 | lr 1.20e-03 | (9885.08 ms | 53038 tok/s) step 1511/76294 | train loss 3.861529 | norm 0.2985 | lr 1.20e-03 | (9899.59 ms | 52961 tok/s) step 1512/76294 | train loss 3.801772 | norm 0.4412 | lr 1.20e-03 | (9899.45 ms | 52961 tok/s) step 1513/76294 | train loss 3.771823 | norm 0.2657 | lr 1.20e-03 | (9930.69 ms | 52795 tok/s) step 1514/76294 | train loss 3.754772 | norm 0.2983 | lr 1.20e-03 | (9898.04 ms | 52969 tok/s) step 1515/76294 | train loss 3.767621 | norm 0.2276 | lr 1.20e-03 | (9892.43 ms | 52999 tok/s) step 1516/76294 | train loss 3.834042 | norm 0.2154 | lr 1.20e-03 | (9928.53 ms | 52806 tok/s) step 1517/76294 | train loss 3.799674 | norm 0.2757 | lr 1.20e-03 | (9894.61 ms | 52987 tok/s) step 1518/76294 | train loss 3.821509 | norm 0.3243 | lr 1.20e-03 | (9897.11 ms | 52974 tok/s) step 1519/76294 | train loss 3.778651 | norm 0.2957 | lr 1.20e-03 | (9894.51 ms | 52988 tok/s) step 1520/76294 | train loss 3.806978 | norm 0.1922 | lr 1.20e-03 | (9894.44 ms | 52988 tok/s) step 1521/76294 | train loss 3.772451 | norm 0.2038 | lr 1.20e-03 | (9892.40 ms | 52999 tok/s) step 1522/76294 | train loss 3.797897 | norm 0.1888 | lr 1.20e-03 | (9901.61 ms | 52950 tok/s) step 1523/76294 | train loss 3.761760 | norm 0.1589 | lr 1.20e-03 | (9898.08 ms | 52969 tok/s) step 1524/76294 | train loss 3.782733 | norm 0.1664 | lr 1.20e-03 | (9900.22 ms | 52957 tok/s) step 1525/76294 | train loss 3.775803 | norm 0.1714 | lr 1.20e-03 | (9898.36 ms | 52967 tok/s) step 1526/76294 | train loss 3.754424 | norm 0.1870 | lr 1.20e-03 | (10192.61 ms | 51438 tok/s) step 1527/76294 | train loss 3.742368 | norm 0.1784 | lr 1.20e-03 | (9893.53 ms | 52993 tok/s) step 1528/76294 | train loss 3.758642 | norm 0.1706 | lr 1.20e-03 | (9912.98 ms | 52889 tok/s) step 1529/76294 | train loss 3.795052 | norm 0.2034 | lr 1.20e-03 | (9898.31 ms | 52967 tok/s) step 1530/76294 | train loss 3.772495 | norm 0.1710 | lr 1.20e-03 | (9900.33 ms | 52957 tok/s) step 1531/76294 | train loss 3.819324 | norm 0.1898 | lr 1.20e-03 | (9936.63 ms | 52763 tok/s) step 1532/76294 | train loss 3.875293 | norm 0.1905 | lr 1.20e-03 | (9911.69 ms | 52896 tok/s) step 1533/76294 | train loss 3.827993 | norm 0.1637 | lr 1.19e-03 | (9947.54 ms | 52705 tok/s) step 1534/76294 | train loss 3.789591 | norm 0.1606 | lr 1.19e-03 | (9891.18 ms | 53006 tok/s) step 1535/76294 | train loss 3.816049 | norm 0.1705 | lr 1.19e-03 | (9896.14 ms | 52979 tok/s) step 1536/76294 | train loss 3.691532 | norm 0.1809 | lr 1.19e-03 | (9905.90 ms | 52927 tok/s) step 1537/76294 | train loss 3.872579 | norm 0.1757 | lr 1.19e-03 | (9895.09 ms | 52985 tok/s) step 1538/76294 | train loss 3.724985 | norm 0.1598 | lr 1.19e-03 | (9897.86 ms | 52970 tok/s) step 1539/76294 | train loss 3.748780 | norm 0.1693 | lr 1.19e-03 | (9900.48 ms | 52956 tok/s) step 1540/76294 | train loss 3.766556 | norm 0.1653 | lr 1.19e-03 | (9895.14 ms | 52984 tok/s) step 1541/76294 | train loss 4.094738 | norm 0.2083 | lr 1.19e-03 | (9897.23 ms | 52973 tok/s) step 1542/76294 | train loss 3.860905 | norm 0.2041 | lr 1.19e-03 | (9968.78 ms | 52593 tok/s) step 1543/76294 | train loss 3.854080 | norm 0.2143 | lr 1.19e-03 | (9898.71 ms | 52965 tok/s) step 1544/76294 | train loss 3.757481 | norm 0.2473 | lr 1.19e-03 | (9966.35 ms | 52606 tok/s) step 1545/76294 | train loss 3.782014 | norm 0.2336 | lr 1.19e-03 | (9913.53 ms | 52886 tok/s) step 1546/76294 | train loss 3.772852 | norm 0.2221 | lr 1.19e-03 | (9936.64 ms | 52763 tok/s) step 1547/76294 | train loss 3.784630 | norm 0.2048 | lr 1.19e-03 | (9901.40 ms | 52951 tok/s) step 1548/76294 | train loss 3.769368 | norm 0.2291 | lr 1.19e-03 | (9909.13 ms | 52910 tok/s) step 1549/76294 | train loss 3.816979 | norm 0.2126 | lr 1.19e-03 | (10009.43 ms | 52379 tok/s) step 1550/76294 | train loss 3.813858 | norm 0.1732 | lr 1.19e-03 | (9894.88 ms | 52986 tok/s) step 1551/76294 | train loss 3.704360 | norm 0.1752 | lr 1.19e-03 | (9924.59 ms | 52827 tok/s) step 1552/76294 | train loss 3.872357 | norm 0.2169 | lr 1.19e-03 | (9895.53 ms | 52982 tok/s) step 1553/76294 | train loss 3.795048 | norm 0.1698 | lr 1.19e-03 | (9904.97 ms | 52932 tok/s) step 1554/76294 | train loss 3.759895 | norm 0.1783 | lr 1.19e-03 | (9895.72 ms | 52981 tok/s) step 1555/76294 | train loss 3.799863 | norm 0.1861 | lr 1.19e-03 | (9908.63 ms | 52912 tok/s) step 1556/76294 | train loss 3.735702 | norm 0.1799 | lr 1.19e-03 | (9906.39 ms | 52924 tok/s) step 1557/76294 | train loss 3.719077 | norm 0.1802 | lr 1.19e-03 | (9905.18 ms | 52931 tok/s) step 1558/76294 | train loss 3.795710 | norm 0.2034 | lr 1.19e-03 | (9894.95 ms | 52985 tok/s) step 1559/76294 | train loss 3.733303 | norm 0.1882 | lr 1.19e-03 | (9900.37 ms | 52956 tok/s) step 1560/76294 | train loss 3.700270 | norm 0.1826 | lr 1.19e-03 | (11102.59 ms | 47222 tok/s) step 1561/76294 | train loss 3.749658 | norm 0.1460 | lr 1.19e-03 | (9890.56 ms | 53009 tok/s) step 1562/76294 | train loss 3.742455 | norm 0.1761 | lr 1.19e-03 | (9893.16 ms | 52995 tok/s) step 1563/76294 | train loss 3.743986 | norm 0.2563 | lr 1.19e-03 | (9897.66 ms | 52971 tok/s) step 1564/76294 | train loss 3.737530 | norm 0.1633 | lr 1.19e-03 | (9893.92 ms | 52991 tok/s) step 1565/76294 | train loss 3.726773 | norm 0.1724 | lr 1.19e-03 | (9893.83 ms | 52991 tok/s) step 1566/76294 | train loss 3.784764 | norm 0.1718 | lr 1.19e-03 | (9905.67 ms | 52928 tok/s) step 1567/76294 | train loss 3.794612 | norm 0.1754 | lr 1.19e-03 | (11809.22 ms | 44397 tok/s) step 1568/76294 | train loss 3.801531 | norm 0.1792 | lr 1.19e-03 | (12046.94 ms | 43520 tok/s) step 1569/76294 | train loss 3.753130 | norm 0.1758 | lr 1.19e-03 | (10020.17 ms | 52323 tok/s) step 1570/76294 | train loss 3.754518 | norm 0.1982 | lr 1.19e-03 | (9881.30 ms | 53059 tok/s) step 1571/76294 | train loss 3.792614 | norm 0.2034 | lr 1.19e-03 | (9893.46 ms | 52993 tok/s) step 1572/76294 | train loss 3.790141 | norm 0.2123 | lr 1.19e-03 | (9888.42 ms | 53020 tok/s) step 1573/76294 | train loss 3.780081 | norm 0.1997 | lr 1.19e-03 | (9898.08 ms | 52969 tok/s) step 1574/76294 | train loss 3.821483 | norm 0.2187 | lr 1.19e-03 | (9894.37 ms | 52989 tok/s) step 1575/76294 | train loss 3.805775 | norm 0.2011 | lr 1.19e-03 | (9902.39 ms | 52946 tok/s) step 1576/76294 | train loss 3.909914 | norm 0.1982 | lr 1.19e-03 | (9898.11 ms | 52969 tok/s) step 1577/76294 | train loss 3.765585 | norm 0.3636 | lr 1.19e-03 | (9935.17 ms | 52771 tok/s) step 1578/76294 | train loss 3.858778 | norm 0.1890 | lr 1.19e-03 | (9962.73 ms | 52625 tok/s) step 1579/76294 | train loss 3.768807 | norm 0.1997 | lr 1.19e-03 | (9899.75 ms | 52960 tok/s) step 1580/76294 | train loss 3.815329 | norm 0.2866 | lr 1.19e-03 | (9945.72 ms | 52715 tok/s) step 1581/76294 | train loss 3.836810 | norm 0.2542 | lr 1.19e-03 | (9896.62 ms | 52976 tok/s) step 1582/76294 | train loss 3.768024 | norm 0.2505 | lr 1.19e-03 | (9897.96 ms | 52969 tok/s) step 1583/76294 | train loss 3.733764 | norm 0.2353 | lr 1.19e-03 | (9937.97 ms | 52756 tok/s) step 1584/76294 | train loss 3.752861 | norm 0.2108 | lr 1.19e-03 | (9902.89 ms | 52943 tok/s) step 1585/76294 | train loss 3.741853 | norm 0.1957 | lr 1.19e-03 | (9938.54 ms | 52753 tok/s) step 1586/76294 | train loss 3.805270 | norm 0.1743 | lr 1.19e-03 | (9894.88 ms | 52986 tok/s) step 1587/76294 | train loss 3.754563 | norm 0.1727 | lr 1.19e-03 | (9899.96 ms | 52959 tok/s) step 1588/76294 | train loss 3.741673 | norm 0.1755 | lr 1.19e-03 | (9901.01 ms | 52953 tok/s) step 1589/76294 | train loss 3.718031 | norm 0.1590 | lr 1.19e-03 | (9906.78 ms | 52922 tok/s) step 1590/76294 | train loss 3.743626 | norm 0.1781 | lr 1.19e-03 | (9896.00 ms | 52980 tok/s) step 1591/76294 | train loss 3.798487 | norm 0.1814 | lr 1.19e-03 | (9961.52 ms | 52631 tok/s) step 1592/76294 | train loss 3.741916 | norm 0.2057 | lr 1.19e-03 | (9899.01 ms | 52964 tok/s) step 1593/76294 | train loss 3.911056 | norm 0.2388 | lr 1.19e-03 | (9897.41 ms | 52972 tok/s) step 1594/76294 | train loss 3.753057 | norm 0.3091 | lr 1.19e-03 | (9934.16 ms | 52776 tok/s) step 1595/76294 | train loss 3.742572 | norm 0.2302 | lr 1.19e-03 | (9895.05 ms | 52985 tok/s) step 1596/76294 | train loss 3.779026 | norm 0.1901 | lr 1.19e-03 | (9934.71 ms | 52773 tok/s) step 1597/76294 | train loss 3.743483 | norm 0.1918 | lr 1.19e-03 | (9897.43 ms | 52972 tok/s) step 1598/76294 | train loss 3.784394 | norm 0.2100 | lr 1.19e-03 | (9902.84 ms | 52943 tok/s) step 1599/76294 | train loss 3.741143 | norm 0.1785 | lr 1.19e-03 | (9899.90 ms | 52959 tok/s) step 1600/76294 | train loss 3.788315 | norm 0.1890 | lr 1.19e-03 | (9896.32 ms | 52978 tok/s) step 1601/76294 | train loss 3.725522 | norm 0.1650 | lr 1.19e-03 | (9908.47 ms | 52913 tok/s) step 1602/76294 | train loss 3.771744 | norm 0.2461 | lr 1.19e-03 | (9889.36 ms | 53015 tok/s) step 1603/76294 | train loss 3.733164 | norm 0.1655 | lr 1.19e-03 | (9932.41 ms | 52786 tok/s) step 1604/76294 | train loss 3.766299 | norm 0.2202 | lr 1.19e-03 | (9895.11 ms | 52985 tok/s) step 1605/76294 | train loss 3.755312 | norm 0.2304 | lr 1.19e-03 | (9897.09 ms | 52974 tok/s) step 1606/76294 | train loss 3.734553 | norm 0.2363 | lr 1.19e-03 | (9899.67 ms | 52960 tok/s) step 1607/76294 | train loss 3.790221 | norm 0.2497 | lr 1.19e-03 | (9897.71 ms | 52971 tok/s) step 1608/76294 | train loss 3.762708 | norm 0.1923 | lr 1.19e-03 | (9926.77 ms | 52816 tok/s) step 1609/76294 | train loss 3.836529 | norm 0.1737 | lr 1.19e-03 | (9896.85 ms | 52975 tok/s) step 1610/76294 | train loss 3.798318 | norm 0.1790 | lr 1.19e-03 | (9897.31 ms | 52973 tok/s) step 1611/76294 | train loss 3.723884 | norm 0.1771 | lr 1.19e-03 | (9899.76 ms | 52960 tok/s) step 1612/76294 | train loss 3.815309 | norm 0.1854 | lr 1.19e-03 | (9897.09 ms | 52974 tok/s) step 1613/76294 | train loss 3.788615 | norm 0.1915 | lr 1.19e-03 | (9896.03 ms | 52980 tok/s) step 1614/76294 | train loss 3.683850 | norm 0.2174 | lr 1.19e-03 | (9922.19 ms | 52840 tok/s) step 1615/76294 | train loss 3.754937 | norm 0.2195 | lr 1.19e-03 | (9894.17 ms | 52990 tok/s) step 1616/76294 | train loss 3.784570 | norm 0.2189 | lr 1.19e-03 | (9894.82 ms | 52986 tok/s) step 1617/76294 | train loss 3.763564 | norm 0.1904 | lr 1.19e-03 | (9896.80 ms | 52976 tok/s) step 1618/76294 | train loss 3.745631 | norm 0.1841 | lr 1.19e-03 | (9897.47 ms | 52972 tok/s) step 1619/76294 | train loss 3.778301 | norm 0.2030 | lr 1.19e-03 | (9895.34 ms | 52983 tok/s) step 1620/76294 | train loss 3.739397 | norm 0.1787 | lr 1.19e-03 | (9907.94 ms | 52916 tok/s) step 1621/76294 | train loss 3.787178 | norm 0.1825 | lr 1.19e-03 | (9905.42 ms | 52929 tok/s) step 1622/76294 | train loss 4.019023 | norm 0.1768 | lr 1.19e-03 | (9916.35 ms | 52871 tok/s) step 1623/76294 | train loss 3.767507 | norm 0.3768 | lr 1.19e-03 | (9931.12 ms | 52792 tok/s) step 1624/76294 | train loss 3.809376 | norm 0.1588 | lr 1.19e-03 | (9895.21 ms | 52984 tok/s) step 1625/76294 | train loss 3.897466 | norm 4.0116 | lr 1.19e-03 | (9906.16 ms | 52925 tok/s) step 1626/76294 | train loss 3.774428 | norm 0.2814 | lr 1.19e-03 | (9894.96 ms | 52985 tok/s) step 1627/76294 | train loss 3.775470 | norm 0.3148 | lr 1.19e-03 | (9924.18 ms | 52829 tok/s) step 1628/76294 | train loss 3.735041 | norm 0.3237 | lr 1.19e-03 | (9889.27 ms | 53016 tok/s) step 1629/76294 | train loss 3.795859 | norm 0.2361 | lr 1.19e-03 | (9905.00 ms | 52932 tok/s) step 1630/76294 | train loss 3.804432 | norm 0.2925 | lr 1.19e-03 | (9958.30 ms | 52648 tok/s) step 1631/76294 | train loss 3.724218 | norm 0.2592 | lr 1.19e-03 | (9905.40 ms | 52930 tok/s) step 1632/76294 | train loss 3.804814 | norm 0.2081 | lr 1.19e-03 | (9896.93 ms | 52975 tok/s) step 1633/76294 | train loss 3.738688 | norm 0.1912 | lr 1.19e-03 | (9938.38 ms | 52754 tok/s) step 1634/76294 | train loss 3.842727 | norm 0.1885 | lr 1.19e-03 | (9891.54 ms | 53004 tok/s) step 1635/76294 | train loss 3.787182 | norm 0.1907 | lr 1.19e-03 | (9903.75 ms | 52938 tok/s) step 1636/76294 | train loss 3.828159 | norm 0.1746 | lr 1.19e-03 | (9890.40 ms | 53010 tok/s) step 1637/76294 | train loss 3.677550 | norm 0.1606 | lr 1.19e-03 | (9901.19 ms | 52952 tok/s) step 1638/76294 | train loss 3.782363 | norm 0.1631 | lr 1.19e-03 | (9897.73 ms | 52971 tok/s) step 1639/76294 | train loss 3.749263 | norm 0.1906 | lr 1.19e-03 | (9898.21 ms | 52968 tok/s) step 1640/76294 | train loss 3.696138 | norm 0.1795 | lr 1.19e-03 | (9895.10 ms | 52985 tok/s) step 1641/76294 | train loss 3.780723 | norm 0.1835 | lr 1.19e-03 | (9915.84 ms | 52874 tok/s) step 1642/76294 | train loss 3.795999 | norm 0.3418 | lr 1.19e-03 | (9896.95 ms | 52975 tok/s) step 1643/76294 | train loss 3.778432 | norm 0.2592 | lr 1.19e-03 | (9899.09 ms | 52963 tok/s) step 1644/76294 | train loss 3.715452 | norm 0.2200 | lr 1.19e-03 | (9895.85 ms | 52981 tok/s) step 1645/76294 | train loss 3.730484 | norm 0.2533 | lr 1.19e-03 | (9903.12 ms | 52942 tok/s) step 1646/76294 | train loss 3.774059 | norm 0.2707 | lr 1.19e-03 | (9894.34 ms | 52989 tok/s) step 1647/76294 | train loss 3.798823 | norm 0.2793 | lr 1.19e-03 | (9901.23 ms | 52952 tok/s) step 1648/76294 | train loss 3.751215 | norm 0.2748 | lr 1.19e-03 | (9911.35 ms | 52898 tok/s) step 1649/76294 | train loss 3.788659 | norm 0.2387 | lr 1.19e-03 | (9901.22 ms | 52952 tok/s) step 1650/76294 | train loss 3.745591 | norm 0.2005 | lr 1.19e-03 | (9912.17 ms | 52893 tok/s) step 1651/76294 | train loss 3.784445 | norm 0.2188 | lr 1.19e-03 | (9895.25 ms | 52984 tok/s) step 1652/76294 | train loss 3.709760 | norm 0.1639 | lr 1.19e-03 | (9920.81 ms | 52847 tok/s) step 1653/76294 | train loss 3.688299 | norm 0.1834 | lr 1.19e-03 | (9895.04 ms | 52985 tok/s) step 1654/76294 | train loss 3.761768 | norm 0.1835 | lr 1.19e-03 | (9938.91 ms | 52751 tok/s) step 1655/76294 | train loss 3.790688 | norm 0.1998 | lr 1.19e-03 | (9897.46 ms | 52972 tok/s) step 1656/76294 | train loss 3.725336 | norm 0.1575 | lr 1.19e-03 | (9904.50 ms | 52934 tok/s) step 1657/76294 | train loss 3.737732 | norm 0.1904 | lr 1.19e-03 | (9896.97 ms | 52975 tok/s) step 1658/76294 | train loss 3.733099 | norm 0.1698 | lr 1.19e-03 | (11345.66 ms | 46210 tok/s) step 1659/76294 | train loss 3.722117 | norm 0.1598 | lr 1.19e-03 | (9887.97 ms | 53023 tok/s) step 1660/76294 | train loss 3.750653 | norm 0.1580 | lr 1.19e-03 | (9892.39 ms | 52999 tok/s) step 1661/76294 | train loss 3.759162 | norm 0.1595 | lr 1.19e-03 | (9890.95 ms | 53007 tok/s) step 1662/76294 | train loss 3.718464 | norm 0.1744 | lr 1.19e-03 | (9894.21 ms | 52989 tok/s) step 1663/76294 | train loss 3.743518 | norm 0.1708 | lr 1.19e-03 | (9897.80 ms | 52970 tok/s) step 1664/76294 | train loss 3.712977 | norm 0.1782 | lr 1.19e-03 | (9891.91 ms | 53002 tok/s) step 1665/76294 | train loss 3.741613 | norm 0.1729 | lr 1.19e-03 | (9892.36 ms | 52999 tok/s) step 1666/76294 | train loss 3.759496 | norm 0.1677 | lr 1.19e-03 | (9921.02 ms | 52846 tok/s) step 1667/76294 | train loss 3.691478 | norm 0.1675 | lr 1.19e-03 | (9893.84 ms | 52991 tok/s) step 1668/76294 | train loss 3.758034 | norm 0.1535 | lr 1.19e-03 | (9921.18 ms | 52845 tok/s) step 1669/76294 | train loss 3.784002 | norm 0.1781 | lr 1.19e-03 | (9926.81 ms | 52815 tok/s) step 1670/76294 | train loss 3.737714 | norm 0.1469 | lr 1.19e-03 | (9927.18 ms | 52813 tok/s) step 1671/76294 | train loss 3.750658 | norm 0.1682 | lr 1.19e-03 | (9956.81 ms | 52656 tok/s) step 1672/76294 | train loss 3.821257 | norm 0.1694 | lr 1.19e-03 | (9888.94 ms | 53018 tok/s) step 1673/76294 | train loss 3.674373 | norm 0.1643 | lr 1.19e-03 | (9962.90 ms | 52624 tok/s) step 1674/76294 | train loss 3.663020 | norm 0.1802 | lr 1.19e-03 | (9895.55 ms | 52982 tok/s) step 1675/76294 | train loss 3.790628 | norm 0.1756 | lr 1.19e-03 | (9900.16 ms | 52958 tok/s) step 1676/76294 | train loss 3.693283 | norm 0.1694 | lr 1.19e-03 | (9938.70 ms | 52752 tok/s) step 1677/76294 | train loss 3.701673 | norm 0.1598 | lr 1.19e-03 | (9893.85 ms | 52991 tok/s) step 1678/76294 | train loss 3.747066 | norm 0.1804 | lr 1.19e-03 | (9898.89 ms | 52964 tok/s) step 1679/76294 | train loss 3.720355 | norm 0.2142 | lr 1.19e-03 | (9895.02 ms | 52985 tok/s) step 1680/76294 | train loss 3.715824 | norm 0.1838 | lr 1.19e-03 | (9903.18 ms | 52941 tok/s) step 1681/76294 | train loss 3.730656 | norm 0.1729 | lr 1.19e-03 | (9908.27 ms | 52914 tok/s) step 1682/76294 | train loss 3.691914 | norm 0.1861 | lr 1.19e-03 | (9939.23 ms | 52749 tok/s) step 1683/76294 | train loss 3.730054 | norm 0.1776 | lr 1.19e-03 | (9894.94 ms | 52985 tok/s) step 1684/76294 | train loss 3.690187 | norm 0.1655 | lr 1.19e-03 | (9895.07 ms | 52985 tok/s) step 1685/76294 | train loss 3.732650 | norm 0.1860 | lr 1.19e-03 | (9901.58 ms | 52950 tok/s) step 1686/76294 | train loss 3.752513 | norm 0.1735 | lr 1.19e-03 | (9939.55 ms | 52748 tok/s) step 1687/76294 | train loss 3.720593 | norm 0.1829 | lr 1.19e-03 | (9895.99 ms | 52980 tok/s) step 1688/76294 | train loss 3.691689 | norm 0.2007 | lr 1.19e-03 | (9904.37 ms | 52935 tok/s) step 1689/76294 | train loss 3.717174 | norm 0.1965 | lr 1.19e-03 | (9897.46 ms | 52972 tok/s) step 1690/76294 | train loss 3.738871 | norm 0.1693 | lr 1.19e-03 | (9941.35 ms | 52738 tok/s) step 1691/76294 | train loss 3.696967 | norm 0.2038 | lr 1.19e-03 | (9895.55 ms | 52982 tok/s) step 1692/76294 | train loss 3.753653 | norm 0.1798 | lr 1.19e-03 | (9899.43 ms | 52961 tok/s) step 1693/76294 | train loss 3.725214 | norm 0.1846 | lr 1.19e-03 | (9898.83 ms | 52965 tok/s) step 1694/76294 | train loss 3.729913 | norm 0.1952 | lr 1.19e-03 | (9929.27 ms | 52802 tok/s) step 1695/76294 | train loss 3.718144 | norm 0.1623 | lr 1.19e-03 | (9898.71 ms | 52965 tok/s) step 1696/76294 | train loss 3.767423 | norm 0.1796 | lr 1.19e-03 | (9898.26 ms | 52968 tok/s) step 1697/76294 | train loss 3.701655 | norm 0.1785 | lr 1.19e-03 | (9900.27 ms | 52957 tok/s) step 1698/76294 | train loss 3.686082 | norm 0.1720 | lr 1.19e-03 | (9900.42 ms | 52956 tok/s) step 1699/76294 | train loss 3.773249 | norm 0.1825 | lr 1.19e-03 | (9898.89 ms | 52964 tok/s) step 1700/76294 | train loss 3.710698 | norm 0.1597 | lr 1.19e-03 | (9900.86 ms | 52954 tok/s) step 1701/76294 | train loss 3.682619 | norm 0.1851 | lr 1.19e-03 | (9899.33 ms | 52962 tok/s) step 1702/76294 | train loss 3.744301 | norm 0.1943 | lr 1.19e-03 | (9909.99 ms | 52905 tok/s) step 1703/76294 | train loss 3.751893 | norm 0.1698 | lr 1.19e-03 | (9911.49 ms | 52897 tok/s) step 1704/76294 | train loss 3.713697 | norm 0.1788 | lr 1.19e-03 | (9981.40 ms | 52526 tok/s) step 1705/76294 | train loss 3.720821 | norm 0.1696 | lr 1.19e-03 | (9901.49 ms | 52950 tok/s) step 1706/76294 | train loss 3.835296 | norm 0.1799 | lr 1.19e-03 | (9906.88 ms | 52922 tok/s) step 1707/76294 | train loss 3.721749 | norm 0.2048 | lr 1.19e-03 | (9938.29 ms | 52754 tok/s) step 1708/76294 | train loss 3.649726 | norm 0.1917 | lr 1.19e-03 | (9893.89 ms | 52991 tok/s) step 1709/76294 | train loss 3.726958 | norm 0.2039 | lr 1.19e-03 | (9902.46 ms | 52945 tok/s) step 1710/76294 | train loss 3.662152 | norm 0.2293 | lr 1.19e-03 | (9891.60 ms | 53003 tok/s) step 1711/76294 | train loss 3.717663 | norm 0.2636 | lr 1.19e-03 | (9893.01 ms | 52996 tok/s) step 1712/76294 | train loss 3.756061 | norm 0.3438 | lr 1.19e-03 | (9893.34 ms | 52994 tok/s) step 1713/76294 | train loss 3.921021 | norm 0.3079 | lr 1.19e-03 | (9919.10 ms | 52856 tok/s) step 1714/76294 | train loss 3.692369 | norm 0.2362 | lr 1.19e-03 | (9891.31 ms | 53005 tok/s) step 1715/76294 | train loss 3.699846 | norm 0.2189 | lr 1.19e-03 | (9911.25 ms | 52898 tok/s) step 1716/76294 | train loss 3.818057 | norm 0.1965 | lr 1.19e-03 | (9940.21 ms | 52744 tok/s) step 1717/76294 | train loss 3.707886 | norm 0.1897 | lr 1.19e-03 | (10737.21 ms | 48829 tok/s) step 1718/76294 | train loss 3.696654 | norm 0.2153 | lr 1.19e-03 | (9885.92 ms | 53034 tok/s) step 1719/76294 | train loss 3.709142 | norm 0.2047 | lr 1.19e-03 | (9901.70 ms | 52949 tok/s) step 1720/76294 | train loss 3.700739 | norm 0.2104 | lr 1.19e-03 | (9920.03 ms | 52851 tok/s) step 1721/76294 | train loss 3.710313 | norm 0.2175 | lr 1.19e-03 | (9899.38 ms | 52962 tok/s) step 1722/76294 | train loss 3.709631 | norm 0.2512 | lr 1.19e-03 | (9898.83 ms | 52965 tok/s) step 1723/76294 | train loss 3.685102 | norm 0.1991 | lr 1.19e-03 | (9893.28 ms | 52994 tok/s) step 1724/76294 | train loss 3.760135 | norm 0.1851 | lr 1.19e-03 | (9902.88 ms | 52943 tok/s) step 1725/76294 | train loss 3.651692 | norm 0.1781 | lr 1.19e-03 | (9897.09 ms | 52974 tok/s) step 1726/76294 | train loss 3.725818 | norm 0.1918 | lr 1.19e-03 | (9895.73 ms | 52981 tok/s) step 1727/76294 | train loss 3.700889 | norm 0.1816 | lr 1.19e-03 | (9898.78 ms | 52965 tok/s) step 1728/76294 | train loss 3.693586 | norm 0.1633 | lr 1.19e-03 | (9900.56 ms | 52955 tok/s) step 1729/76294 | train loss 3.750000 | norm 0.1618 | lr 1.19e-03 | (9892.97 ms | 52996 tok/s) step 1730/76294 | train loss 3.673839 | norm 0.2079 | lr 1.19e-03 | (9935.24 ms | 52771 tok/s) step 1731/76294 | train loss 3.739579 | norm 0.2147 | lr 1.19e-03 | (9894.72 ms | 52987 tok/s) step 1732/76294 | train loss 3.730244 | norm 0.2006 | lr 1.19e-03 | (9897.06 ms | 52974 tok/s) step 1733/76294 | train loss 3.673712 | norm 0.1777 | lr 1.19e-03 | (9897.97 ms | 52969 tok/s) step 1734/76294 | train loss 3.636475 | norm 0.1689 | lr 1.19e-03 | (9900.82 ms | 52954 tok/s) step 1735/76294 | train loss 3.736837 | norm 0.1687 | lr 1.19e-03 | (9899.63 ms | 52960 tok/s) step 1736/76294 | train loss 3.685422 | norm 0.1631 | lr 1.19e-03 | (9897.45 ms | 52972 tok/s) step 1737/76294 | train loss 3.709610 | norm 0.1805 | lr 1.19e-03 | (9894.93 ms | 52986 tok/s) step 1738/76294 | train loss 3.653398 | norm 0.1912 | lr 1.19e-03 | (9898.63 ms | 52966 tok/s) step 1739/76294 | train loss 3.791061 | norm 0.1665 | lr 1.19e-03 | (9897.80 ms | 52970 tok/s) step 1740/76294 | train loss 3.658679 | norm 0.1847 | lr 1.19e-03 | (9899.66 ms | 52960 tok/s) step 1741/76294 | train loss 3.673548 | norm 0.1805 | lr 1.19e-03 | (9902.53 ms | 52945 tok/s) step 1742/76294 | train loss 3.653311 | norm 0.1741 | lr 1.19e-03 | (10264.54 ms | 51078 tok/s) step 1743/76294 | train loss 3.665251 | norm 0.1759 | lr 1.19e-03 | (9927.15 ms | 52814 tok/s) step 1744/76294 | train loss 3.711894 | norm 0.1814 | lr 1.19e-03 | (9892.00 ms | 53001 tok/s) step 1745/76294 | train loss 3.660555 | norm 0.1883 | lr 1.19e-03 | (9899.83 ms | 52959 tok/s) step 1746/76294 | train loss 3.702150 | norm 0.2117 | lr 1.19e-03 | (9891.75 ms | 53003 tok/s) step 1747/76294 | train loss 3.695133 | norm 0.1652 | lr 1.19e-03 | (9903.45 ms | 52940 tok/s) step 1748/76294 | train loss 3.695091 | norm 0.2085 | lr 1.19e-03 | (9914.36 ms | 52882 tok/s) step 1749/76294 | train loss 3.725230 | norm 0.2003 | lr 1.19e-03 | (9894.54 ms | 52988 tok/s) step 1750/76294 | train loss 3.608872 | norm 0.1743 | lr 1.19e-03 | (9917.77 ms | 52863 tok/s) val loss: 3.712293 saving model checkpoint to ./results/gpt2-350M-gqa/step_1750.pth step 1751/76294 | train loss 3.671614 | norm 0.1934 | lr 1.19e-03 | (9971.84 ms | 52577 tok/s) step 1752/76294 | train loss 3.653767 | norm 0.1873 | lr 1.19e-03 | (9880.00 ms | 53066 tok/s) step 1753/76294 | train loss 3.808505 | norm 0.1840 | lr 1.19e-03 | (9925.50 ms | 52822 tok/s) step 1754/76294 | train loss 3.697887 | norm 0.1813 | lr 1.19e-03 | (9878.55 ms | 53073 tok/s) step 1755/76294 | train loss 3.720360 | norm 0.1778 | lr 1.19e-03 | (11172.20 ms | 46928 tok/s) step 1756/76294 | train loss 3.717443 | norm 0.1887 | lr 1.19e-03 | (9892.24 ms | 53000 tok/s) step 1757/76294 | train loss 3.649206 | norm 0.1594 | lr 1.19e-03 | (9883.20 ms | 53048 tok/s) step 1758/76294 | train loss 3.722079 | norm 0.1920 | lr 1.19e-03 | (9880.15 ms | 53065 tok/s) step 1759/76294 | train loss 3.577124 | norm 0.1682 | lr 1.19e-03 | (9887.97 ms | 53023 tok/s) step 1760/76294 | train loss 3.740966 | norm 0.1896 | lr 1.19e-03 | (9881.31 ms | 53059 tok/s) step 1761/76294 | train loss 3.703621 | norm 0.1664 | lr 1.19e-03 | (9969.22 ms | 52591 tok/s) step 1762/76294 | train loss 3.701469 | norm 0.1697 | lr 1.19e-03 | (9938.24 ms | 52755 tok/s) step 1763/76294 | train loss 3.664074 | norm 0.2196 | lr 1.19e-03 | (9957.14 ms | 52654 tok/s) step 1764/76294 | train loss 3.685322 | norm 0.2082 | lr 1.19e-03 | (9898.09 ms | 52969 tok/s) step 1765/76294 | train loss 3.679721 | norm 0.1824 | lr 1.19e-03 | (9891.80 ms | 53002 tok/s) step 1766/76294 | train loss 3.676983 | norm 0.1720 | lr 1.19e-03 | (9944.51 ms | 52721 tok/s) step 1767/76294 | train loss 3.705547 | norm 0.1774 | lr 1.19e-03 | (9894.04 ms | 52990 tok/s) step 1768/76294 | train loss 3.669118 | norm 0.1864 | lr 1.19e-03 | (9932.92 ms | 52783 tok/s) step 1769/76294 | train loss 3.738413 | norm 0.1615 | lr 1.19e-03 | (9893.59 ms | 52993 tok/s) step 1770/76294 | train loss 3.650638 | norm 0.1644 | lr 1.19e-03 | (9903.29 ms | 52941 tok/s) step 1771/76294 | train loss 3.677111 | norm 0.1638 | lr 1.19e-03 | (9893.92 ms | 52991 tok/s) step 1772/76294 | train loss 3.707371 | norm 0.1972 | lr 1.19e-03 | (9896.11 ms | 52979 tok/s) step 1773/76294 | train loss 3.707437 | norm 0.2635 | lr 1.19e-03 | (9893.87 ms | 52991 tok/s) step 1774/76294 | train loss 3.660035 | norm 0.3679 | lr 1.19e-03 | (9903.94 ms | 52937 tok/s) step 1775/76294 | train loss 3.669261 | norm 0.3314 | lr 1.19e-03 | (9902.84 ms | 52943 tok/s) step 1776/76294 | train loss 3.746416 | norm 0.2791 | lr 1.19e-03 | (9935.43 ms | 52770 tok/s) step 1777/76294 | train loss 3.637603 | norm 0.2092 | lr 1.19e-03 | (9895.15 ms | 52984 tok/s) step 1778/76294 | train loss 3.777123 | norm 0.2554 | lr 1.19e-03 | (9897.05 ms | 52974 tok/s) step 1779/76294 | train loss 3.694607 | norm 0.2524 | lr 1.19e-03 | (9897.81 ms | 52970 tok/s) step 1780/76294 | train loss 3.783753 | norm 0.2693 | lr 1.19e-03 | (9896.65 ms | 52976 tok/s) step 1781/76294 | train loss 3.773857 | norm 0.2020 | lr 1.19e-03 | (9898.21 ms | 52968 tok/s) step 1782/76294 | train loss 3.772363 | norm 0.2376 | lr 1.19e-03 | (9908.63 ms | 52912 tok/s) step 1783/76294 | train loss 3.693452 | norm 0.1944 | lr 1.19e-03 | (9895.60 ms | 52982 tok/s) step 1784/76294 | train loss 3.710031 | norm 0.2022 | lr 1.19e-03 | (9901.68 ms | 52949 tok/s) step 1785/76294 | train loss 3.652003 | norm 0.2128 | lr 1.19e-03 | (9894.49 ms | 52988 tok/s) step 1786/76294 | train loss 3.685248 | norm 0.2787 | lr 1.19e-03 | (9897.76 ms | 52970 tok/s) step 1787/76294 | train loss 3.648636 | norm 0.1596 | lr 1.19e-03 | (9918.90 ms | 52857 tok/s) step 1788/76294 | train loss 3.727831 | norm 0.1738 | lr 1.19e-03 | (9902.01 ms | 52948 tok/s) step 1789/76294 | train loss 3.672429 | norm 0.1668 | lr 1.19e-03 | (9918.85 ms | 52858 tok/s) step 1790/76294 | train loss 3.602868 | norm 0.1802 | lr 1.19e-03 | (9960.85 ms | 52635 tok/s) step 1791/76294 | train loss 3.746112 | norm 0.1746 | lr 1.19e-03 | (9907.19 ms | 52920 tok/s) step 1792/76294 | train loss 3.599947 | norm 0.1725 | lr 1.19e-03 | (9958.44 ms | 52648 tok/s) step 1793/76294 | train loss 3.766703 | norm 0.1760 | lr 1.19e-03 | (9893.85 ms | 52991 tok/s) step 1794/76294 | train loss 3.676897 | norm 0.2150 | lr 1.19e-03 | (9973.97 ms | 52566 tok/s) step 1795/76294 | train loss 3.630800 | norm 0.2058 | lr 1.19e-03 | (9918.32 ms | 52861 tok/s) step 1796/76294 | train loss 3.707882 | norm 0.1738 | lr 1.19e-03 | (9892.23 ms | 53000 tok/s) step 1797/76294 | train loss 3.680664 | norm 0.1736 | lr 1.19e-03 | (9895.19 ms | 52984 tok/s) step 1798/76294 | train loss 3.737747 | norm 0.1649 | lr 1.19e-03 | (9898.16 ms | 52968 tok/s) step 1799/76294 | train loss 3.753030 | norm 0.1488 | lr 1.19e-03 | (9900.54 ms | 52955 tok/s) step 1800/76294 | train loss 3.721296 | norm 0.1700 | lr 1.19e-03 | (9904.59 ms | 52934 tok/s) step 1801/76294 | train loss 3.700855 | norm 0.1687 | lr 1.19e-03 | (9897.60 ms | 52971 tok/s) step 1802/76294 | train loss 3.672491 | norm 0.1573 | lr 1.19e-03 | (9893.89 ms | 52991 tok/s) step 1803/76294 | train loss 3.632121 | norm 0.1602 | lr 1.19e-03 | (9966.42 ms | 52605 tok/s) step 1804/76294 | train loss 3.618964 | norm 0.1859 | lr 1.19e-03 | (9913.60 ms | 52886 tok/s) step 1805/76294 | train loss 3.706816 | norm 0.1694 | lr 1.19e-03 | (9972.06 ms | 52576 tok/s) step 1806/76294 | train loss 3.826452 | norm 0.2174 | lr 1.19e-03 | (9897.56 ms | 52971 tok/s) step 1807/76294 | train loss 3.736953 | norm 0.2185 | lr 1.19e-03 | (9897.07 ms | 52974 tok/s) step 1808/76294 | train loss 3.640245 | norm 0.1970 | lr 1.19e-03 | (9940.76 ms | 52741 tok/s) step 1809/76294 | train loss 3.739754 | norm 0.1721 | lr 1.19e-03 | (9893.62 ms | 52993 tok/s) step 1810/76294 | train loss 3.638011 | norm 0.1958 | lr 1.19e-03 | (9900.22 ms | 52957 tok/s) step 1811/76294 | train loss 3.653158 | norm 0.1684 | lr 1.19e-03 | (9895.79 ms | 52981 tok/s) step 1812/76294 | train loss 3.645919 | norm 0.1596 | lr 1.19e-03 | (9898.73 ms | 52965 tok/s) step 1813/76294 | train loss 3.717753 | norm 0.2031 | lr 1.19e-03 | (9897.05 ms | 52974 tok/s) step 1814/76294 | train loss 3.685658 | norm 0.1569 | lr 1.19e-03 | (9899.62 ms | 52960 tok/s) step 1815/76294 | train loss 3.571206 | norm 0.1701 | lr 1.19e-03 | (9926.98 ms | 52814 tok/s) step 1816/76294 | train loss 3.692082 | norm 0.1754 | lr 1.19e-03 | (9956.03 ms | 52660 tok/s) step 1817/76294 | train loss 3.618772 | norm 0.1916 | lr 1.19e-03 | (9898.88 ms | 52964 tok/s) step 1818/76294 | train loss 3.689890 | norm 0.1883 | lr 1.19e-03 | (9894.79 ms | 52986 tok/s) step 1819/76294 | train loss 3.575891 | norm 0.1557 | lr 1.19e-03 | (9937.51 ms | 52759 tok/s) step 1820/76294 | train loss 3.643374 | norm 0.1748 | lr 1.19e-03 | (9895.92 ms | 52980 tok/s) step 1821/76294 | train loss 3.665565 | norm 0.1721 | lr 1.19e-03 | (10547.32 ms | 49708 tok/s) step 1822/76294 | train loss 3.628326 | norm 0.1700 | lr 1.19e-03 | (9886.82 ms | 53029 tok/s) step 1823/76294 | train loss 3.700732 | norm 0.4653 | lr 1.19e-03 | (9918.85 ms | 52858 tok/s) step 1824/76294 | train loss 3.609800 | norm 0.1869 | lr 1.19e-03 | (9893.86 ms | 52991 tok/s) step 1825/76294 | train loss 3.679260 | norm 0.1577 | lr 1.19e-03 | (9892.28 ms | 53000 tok/s) step 1826/76294 | train loss 3.626951 | norm 0.1921 | lr 1.19e-03 | (9893.44 ms | 52993 tok/s) step 1827/76294 | train loss 3.698116 | norm 0.1853 | lr 1.19e-03 | (9936.10 ms | 52766 tok/s) step 1828/76294 | train loss 3.719957 | norm 0.2039 | lr 1.19e-03 | (9893.30 ms | 52994 tok/s) step 1829/76294 | train loss 3.642483 | norm 0.1667 | lr 1.19e-03 | (9904.55 ms | 52934 tok/s) step 1830/76294 | train loss 3.697612 | norm 0.1713 | lr 1.19e-03 | (9893.38 ms | 52994 tok/s) step 1831/76294 | train loss 3.624704 | norm 0.1499 | lr 1.19e-03 | (9898.72 ms | 52965 tok/s) step 1832/76294 | train loss 3.648913 | norm 0.1659 | lr 1.19e-03 | (9891.23 ms | 53005 tok/s) step 1833/76294 | train loss 3.660260 | norm 0.1768 | lr 1.19e-03 | (9900.14 ms | 52958 tok/s) step 1834/76294 | train loss 3.650554 | norm 0.1915 | lr 1.19e-03 | (9892.52 ms | 52998 tok/s) step 1835/76294 | train loss 3.692098 | norm 0.1801 | lr 1.19e-03 | (9903.68 ms | 52939 tok/s) step 1836/76294 | train loss 3.563643 | norm 0.1998 | lr 1.19e-03 | (9896.38 ms | 52978 tok/s) step 1837/76294 | train loss 3.669502 | norm 0.1998 | lr 1.19e-03 | (9918.34 ms | 52860 tok/s) step 1838/76294 | train loss 3.663667 | norm 0.1528 | lr 1.19e-03 | (9897.97 ms | 52969 tok/s) step 1839/76294 | train loss 3.616024 | norm 0.1906 | lr 1.19e-03 | (9960.34 ms | 52638 tok/s) step 1840/76294 | train loss 3.689729 | norm 0.1910 | lr 1.19e-03 | (9904.43 ms | 52935 tok/s) step 1841/76294 | train loss 3.648091 | norm 0.2061 | lr 1.19e-03 | (9905.09 ms | 52931 tok/s) step 1842/76294 | train loss 3.784066 | norm 0.2122 | lr 1.19e-03 | (9894.75 ms | 52987 tok/s) step 1843/76294 | train loss 3.603213 | norm 0.2089 | lr 1.19e-03 | (9902.44 ms | 52945 tok/s) step 1844/76294 | train loss 3.659866 | norm 0.1967 | lr 1.19e-03 | (9893.26 ms | 52994 tok/s) step 1845/76294 | train loss 3.585981 | norm 0.1721 | lr 1.19e-03 | (9901.72 ms | 52949 tok/s) step 1846/76294 | train loss 3.639815 | norm 0.1905 | lr 1.19e-03 | (9899.30 ms | 52962 tok/s) step 1847/76294 | train loss 3.681542 | norm 0.2008 | lr 1.19e-03 | (9941.25 ms | 52739 tok/s) step 1848/76294 | train loss 3.604799 | norm 0.1938 | lr 1.19e-03 | (9892.98 ms | 52996 tok/s) step 1849/76294 | train loss 3.675890 | norm 0.2027 | lr 1.19e-03 | (9912.68 ms | 52891 tok/s) step 1850/76294 | train loss 3.632604 | norm 0.1841 | lr 1.19e-03 | (9959.38 ms | 52643 tok/s) step 1851/76294 | train loss 3.748049 | norm 0.2195 | lr 1.19e-03 | (9894.96 ms | 52985 tok/s) step 1852/76294 | train loss 3.619501 | norm 0.2635 | lr 1.19e-03 | (9956.59 ms | 52657 tok/s) step 1853/76294 | train loss 3.687049 | norm 0.1939 | lr 1.19e-03 | (12949.46 ms | 40487 tok/s) step 1854/76294 | train loss 3.621229 | norm 0.1765 | lr 1.19e-03 | (9886.50 ms | 53031 tok/s) step 1855/76294 | train loss 3.620351 | norm 0.1530 | lr 1.19e-03 | (9930.46 ms | 52796 tok/s) step 1856/76294 | train loss 3.731797 | norm 0.1692 | lr 1.19e-03 | (9888.35 ms | 53021 tok/s) step 1857/76294 | train loss 3.660418 | norm 0.1722 | lr 1.19e-03 | (9925.63 ms | 52822 tok/s) step 1858/76294 | train loss 3.659075 | norm 0.1755 | lr 1.19e-03 | (9897.37 ms | 52972 tok/s) step 1859/76294 | train loss 3.609827 | norm 0.1618 | lr 1.19e-03 | (9917.17 ms | 52867 tok/s) step 1860/76294 | train loss 3.657556 | norm 0.1675 | lr 1.19e-03 | (9901.45 ms | 52951 tok/s) step 1861/76294 | train loss 3.671340 | norm 0.1626 | lr 1.19e-03 | (9918.94 ms | 52857 tok/s) step 1862/76294 | train loss 3.611989 | norm 0.1668 | lr 1.19e-03 | (9920.56 ms | 52849 tok/s) step 1863/76294 | train loss 3.722300 | norm 0.1837 | lr 1.19e-03 | (9909.47 ms | 52908 tok/s) step 1864/76294 | train loss 3.608126 | norm 0.1797 | lr 1.19e-03 | (9897.70 ms | 52971 tok/s) step 1865/76294 | train loss 3.773520 | norm 0.1935 | lr 1.19e-03 | (9914.80 ms | 52879 tok/s) step 1866/76294 | train loss 3.626138 | norm 0.1768 | lr 1.19e-03 | (9893.33 ms | 52994 tok/s) step 1867/76294 | train loss 3.664155 | norm 0.1795 | lr 1.19e-03 | (9906.33 ms | 52925 tok/s) step 1868/76294 | train loss 3.688113 | norm 0.1977 | lr 1.19e-03 | (9898.72 ms | 52965 tok/s) step 1869/76294 | train loss 3.645146 | norm 0.1990 | lr 1.19e-03 | (9902.94 ms | 52943 tok/s) step 1870/76294 | train loss 3.728511 | norm 0.1748 | lr 1.19e-03 | (9893.69 ms | 52992 tok/s) step 1871/76294 | train loss 3.596388 | norm 0.1657 | lr 1.19e-03 | (9901.57 ms | 52950 tok/s) step 1872/76294 | train loss 3.653850 | norm 0.1877 | lr 1.19e-03 | (9889.54 ms | 53014 tok/s) step 1873/76294 | train loss 3.637421 | norm 0.1783 | lr 1.19e-03 | (9906.18 ms | 52925 tok/s) step 1874/76294 | train loss 3.636322 | norm 0.1845 | lr 1.19e-03 | (9893.70 ms | 52992 tok/s) step 1875/76294 | train loss 3.627547 | norm 0.1697 | lr 1.19e-03 | (9972.51 ms | 52573 tok/s) step 1876/76294 | train loss 3.594001 | norm 0.1870 | lr 1.19e-03 | (9895.88 ms | 52980 tok/s) step 1877/76294 | train loss 3.680906 | norm 0.2102 | lr 1.19e-03 | (9937.43 ms | 52759 tok/s) step 1878/76294 | train loss 3.605180 | norm 0.2422 | lr 1.19e-03 | (9891.74 ms | 53003 tok/s) step 1879/76294 | train loss 3.684796 | norm 0.3107 | lr 1.19e-03 | (9904.14 ms | 52936 tok/s) step 1880/76294 | train loss 3.638115 | norm 0.2057 | lr 1.19e-03 | (9894.13 ms | 52990 tok/s) step 1881/76294 | train loss 3.694427 | norm 0.2183 | lr 1.19e-03 | (9900.07 ms | 52958 tok/s) step 1882/76294 | train loss 3.631264 | norm 0.1884 | lr 1.19e-03 | (9894.90 ms | 52986 tok/s) step 1883/76294 | train loss 3.656426 | norm 0.1728 | lr 1.19e-03 | (9899.00 ms | 52964 tok/s) step 1884/76294 | train loss 3.693625 | norm 0.1673 | lr 1.19e-03 | (9890.58 ms | 53009 tok/s) step 1885/76294 | train loss 3.643860 | norm 0.1434 | lr 1.19e-03 | (9901.67 ms | 52949 tok/s) step 1886/76294 | train loss 3.652874 | norm 0.1609 | lr 1.19e-03 | (9891.71 ms | 53003 tok/s) step 1887/76294 | train loss 3.688315 | norm 0.1822 | lr 1.19e-03 | (9914.72 ms | 52880 tok/s) step 1888/76294 | train loss 3.665381 | norm 0.1593 | lr 1.19e-03 | (9894.96 ms | 52985 tok/s) step 1889/76294 | train loss 3.665333 | norm 0.1826 | lr 1.19e-03 | (9896.14 ms | 52979 tok/s) step 1890/76294 | train loss 3.625285 | norm 0.1823 | lr 1.19e-03 | (9923.44 ms | 52833 tok/s) step 1891/76294 | train loss 3.697343 | norm 0.1635 | lr 1.19e-03 | (9960.93 ms | 52634 tok/s) step 1892/76294 | train loss 3.772648 | norm 0.1596 | lr 1.19e-03 | (10682.76 ms | 49078 tok/s) step 1893/76294 | train loss 3.679005 | norm 0.2008 | lr 1.19e-03 | (9888.64 ms | 53019 tok/s) step 1894/76294 | train loss 3.535896 | norm 0.1627 | lr 1.19e-03 | (9890.42 ms | 53010 tok/s) step 1895/76294 | train loss 3.656577 | norm 0.1758 | lr 1.19e-03 | (9899.25 ms | 52962 tok/s) step 1896/76294 | train loss 3.661743 | norm 0.1829 | lr 1.19e-03 | (9888.06 ms | 53022 tok/s) step 1897/76294 | train loss 3.628766 | norm 0.1741 | lr 1.19e-03 | (9912.81 ms | 52890 tok/s) step 1898/76294 | train loss 3.724288 | norm 0.1649 | lr 1.19e-03 | (9898.38 ms | 52967 tok/s) step 1899/76294 | train loss 3.595402 | norm 0.1543 | lr 1.19e-03 | (9889.85 ms | 53013 tok/s) step 1900/76294 | train loss 3.672155 | norm 0.1538 | lr 1.19e-03 | (9942.59 ms | 52732 tok/s) step 1901/76294 | train loss 3.655036 | norm 0.1520 | lr 1.19e-03 | (9895.23 ms | 52984 tok/s) step 1902/76294 | train loss 3.605313 | norm 0.1707 | lr 1.19e-03 | (9890.87 ms | 53007 tok/s) step 1903/76294 | train loss 3.617887 | norm 0.1517 | lr 1.19e-03 | (9937.43 ms | 52759 tok/s) step 1904/76294 | train loss 3.603025 | norm 0.1847 | lr 1.19e-03 | (9889.84 ms | 53013 tok/s) step 1905/76294 | train loss 3.654890 | norm 0.1562 | lr 1.19e-03 | (9973.02 ms | 52571 tok/s) step 1906/76294 | train loss 3.579966 | norm 0.1912 | lr 1.19e-03 | (9875.26 ms | 53091 tok/s) step 1907/76294 | train loss 3.583337 | norm 0.3010 | lr 1.19e-03 | (9887.60 ms | 53025 tok/s) step 1908/76294 | train loss 3.665259 | norm 0.2194 | lr 1.19e-03 | (10166.57 ms | 51570 tok/s) step 1909/76294 | train loss 3.673877 | norm 0.2227 | lr 1.19e-03 | (9904.77 ms | 52933 tok/s) step 1910/76294 | train loss 3.629025 | norm 0.2405 | lr 1.19e-03 | (9903.87 ms | 52938 tok/s) step 1911/76294 | train loss 3.681154 | norm 0.1754 | lr 1.19e-03 | (9939.38 ms | 52749 tok/s) step 1912/76294 | train loss 3.620393 | norm 0.1860 | lr 1.19e-03 | (9903.85 ms | 52938 tok/s) step 1913/76294 | train loss 3.706202 | norm 0.1755 | lr 1.19e-03 | (9900.55 ms | 52955 tok/s) step 1914/76294 | train loss 3.679047 | norm 0.2056 | lr 1.19e-03 | (9891.37 ms | 53005 tok/s) step 1915/76294 | train loss 3.713341 | norm 0.1828 | lr 1.19e-03 | (9897.29 ms | 52973 tok/s) step 1916/76294 | train loss 3.689629 | norm 0.1831 | lr 1.19e-03 | (9915.89 ms | 52874 tok/s) step 1917/76294 | train loss 3.741860 | norm 0.1723 | lr 1.19e-03 | (9894.23 ms | 52989 tok/s) step 1918/76294 | train loss 3.673907 | norm 0.1745 | lr 1.19e-03 | (9893.58 ms | 52993 tok/s) step 1919/76294 | train loss 3.714293 | norm 0.1737 | lr 1.19e-03 | (9919.68 ms | 52853 tok/s) step 1920/76294 | train loss 3.748657 | norm 0.2051 | lr 1.19e-03 | (9899.11 ms | 52963 tok/s) step 1921/76294 | train loss 3.702904 | norm 0.2053 | lr 1.19e-03 | (9925.57 ms | 52822 tok/s) step 1922/76294 | train loss 3.694048 | norm 1.3971 | lr 1.19e-03 | (9894.30 ms | 52989 tok/s) step 1923/76294 | train loss 3.633181 | norm 0.2067 | lr 1.19e-03 | (9903.57 ms | 52939 tok/s) step 1924/76294 | train loss 3.656078 | norm 0.2262 | lr 1.19e-03 | (9894.41 ms | 52988 tok/s) step 1925/76294 | train loss 3.665718 | norm 0.1906 | lr 1.19e-03 | (9899.22 ms | 52963 tok/s) step 1926/76294 | train loss 3.720549 | norm 0.1797 | lr 1.19e-03 | (9894.38 ms | 52988 tok/s) step 1927/76294 | train loss 3.652162 | norm 0.1977 | lr 1.19e-03 | (9930.99 ms | 52793 tok/s) step 1928/76294 | train loss 3.751057 | norm 0.7268 | lr 1.19e-03 | (9891.21 ms | 53005 tok/s) step 1929/76294 | train loss 3.660361 | norm 0.1965 | lr 1.19e-03 | (9933.45 ms | 52780 tok/s) step 1930/76294 | train loss 3.620222 | norm 0.1849 | lr 1.19e-03 | (9896.32 ms | 52978 tok/s) step 1931/76294 | train loss 3.747689 | norm 0.2218 | lr 1.19e-03 | (9943.96 ms | 52724 tok/s) step 1932/76294 | train loss 3.613860 | norm 0.2027 | lr 1.19e-03 | (9893.30 ms | 52994 tok/s) step 1933/76294 | train loss 3.705935 | norm 2.0700 | lr 1.19e-03 | (9909.99 ms | 52905 tok/s) step 1934/76294 | train loss 3.642278 | norm 0.2163 | lr 1.19e-03 | (9958.38 ms | 52648 tok/s) step 1935/76294 | train loss 3.760458 | norm 0.3097 | lr 1.19e-03 | (9896.10 ms | 52979 tok/s) step 1936/76294 | train loss 3.717158 | norm 0.2772 | lr 1.19e-03 | (9904.52 ms | 52934 tok/s) step 1937/76294 | train loss 3.651129 | norm 0.3520 | lr 1.19e-03 | (9964.43 ms | 52616 tok/s) step 1938/76294 | train loss 3.687407 | norm 0.5405 | lr 1.19e-03 | (9897.28 ms | 52973 tok/s) step 1939/76294 | train loss 3.746609 | norm 0.3186 | lr 1.19e-03 | (9898.88 ms | 52964 tok/s) step 1940/76294 | train loss 3.673073 | norm 0.2951 | lr 1.19e-03 | (9940.71 ms | 52742 tok/s) step 1941/76294 | train loss 3.698248 | norm 0.2528 | lr 1.19e-03 | (9897.39 ms | 52972 tok/s) step 1942/76294 | train loss 3.659908 | norm 0.2227 | lr 1.19e-03 | (9900.62 ms | 52955 tok/s) step 1943/76294 | train loss 3.633755 | norm 0.1896 | lr 1.19e-03 | (9898.12 ms | 52968 tok/s) step 1944/76294 | train loss 3.731535 | norm 0.1954 | lr 1.19e-03 | (9902.94 ms | 52943 tok/s) step 1945/76294 | train loss 3.666860 | norm 0.1988 | lr 1.19e-03 | (9898.60 ms | 52966 tok/s) step 1946/76294 | train loss 3.671229 | norm 0.1668 | lr 1.19e-03 | (9905.01 ms | 52932 tok/s) step 1947/76294 | train loss 3.600747 | norm 0.1836 | lr 1.19e-03 | (9933.88 ms | 52778 tok/s) step 1948/76294 | train loss 3.674051 | norm 0.1706 | lr 1.19e-03 | (9899.46 ms | 52961 tok/s) step 1949/76294 | train loss 3.685321 | norm 0.1491 | lr 1.19e-03 | (9909.15 ms | 52909 tok/s) step 1950/76294 | train loss 3.697942 | norm 0.1611 | lr 1.19e-03 | (9933.15 ms | 52782 tok/s) step 1951/76294 | train loss 3.700889 | norm 0.1636 | lr 1.19e-03 | (11078.26 ms | 47326 tok/s) step 1952/76294 | train loss 3.689610 | norm 0.1617 | lr 1.19e-03 | (9888.16 ms | 53022 tok/s) step 1953/76294 | train loss 3.633568 | norm 0.1443 | lr 1.19e-03 | (9898.54 ms | 52966 tok/s) step 1954/76294 | train loss 3.677570 | norm 0.1560 | lr 1.19e-03 | (9885.80 ms | 53034 tok/s) step 1955/76294 | train loss 3.677379 | norm 0.2426 | lr 1.19e-03 | (9898.28 ms | 52968 tok/s) step 1956/76294 | train loss 3.652744 | norm 0.1875 | lr 1.19e-03 | (9891.16 ms | 53006 tok/s) step 1957/76294 | train loss 3.735442 | norm 0.1584 | lr 1.19e-03 | (9902.21 ms | 52947 tok/s) step 1958/76294 | train loss 3.582699 | norm 0.1875 | lr 1.19e-03 | (9913.37 ms | 52887 tok/s) step 1959/76294 | train loss 3.625347 | norm 0.1855 | lr 1.19e-03 | (13038.91 ms | 40209 tok/s) step 1960/76294 | train loss 3.641846 | norm 0.1535 | lr 1.19e-03 | (9865.11 ms | 53146 tok/s) step 1961/76294 | train loss 3.651103 | norm 0.1619 | lr 1.19e-03 | (11411.81 ms | 45943 tok/s) step 1962/76294 | train loss 3.650315 | norm 0.1791 | lr 1.19e-03 | (9866.03 ms | 53141 tok/s) step 1963/76294 | train loss 3.679008 | norm 0.1593 | lr 1.19e-03 | (9946.88 ms | 52709 tok/s) step 1964/76294 | train loss 3.646389 | norm 0.1609 | lr 1.19e-03 | (9876.44 ms | 53085 tok/s) step 1965/76294 | train loss 3.581685 | norm 0.1654 | lr 1.19e-03 | (9925.00 ms | 52825 tok/s) step 1966/76294 | train loss 3.667960 | norm 0.3050 | lr 1.19e-03 | (9911.44 ms | 52897 tok/s) step 1967/76294 | train loss 3.621609 | norm 0.1941 | lr 1.19e-03 | (9908.69 ms | 52912 tok/s) step 1968/76294 | train loss 3.637924 | norm 0.2098 | lr 1.19e-03 | (9885.53 ms | 53036 tok/s) step 1969/76294 | train loss 3.651966 | norm 0.1688 | lr 1.19e-03 | (9886.72 ms | 53029 tok/s) step 1970/76294 | train loss 3.671040 | norm 0.1614 | lr 1.19e-03 | (9939.16 ms | 52750 tok/s) step 1971/76294 | train loss 3.576559 | norm 0.1759 | lr 1.19e-03 | (9897.45 ms | 52972 tok/s) step 1972/76294 | train loss 3.658629 | norm 0.1615 | lr 1.19e-03 | (9888.66 ms | 53019 tok/s) step 1973/76294 | train loss 3.690737 | norm 0.1886 | lr 1.19e-03 | (9897.90 ms | 52970 tok/s) step 1974/76294 | train loss 3.628361 | norm 0.1695 | lr 1.19e-03 | (9890.50 ms | 53009 tok/s) step 1975/76294 | train loss 3.791422 | norm 0.1432 | lr 1.19e-03 | (9900.39 ms | 52956 tok/s) step 1976/76294 | train loss 3.654996 | norm 0.2065 | lr 1.19e-03 | (9955.80 ms | 52662 tok/s) step 1977/76294 | train loss 3.644630 | norm 0.1836 | lr 1.19e-03 | (9906.96 ms | 52921 tok/s) step 1978/76294 | train loss 3.663697 | norm 0.1845 | lr 1.19e-03 | (9919.53 ms | 52854 tok/s) step 1979/76294 | train loss 3.672241 | norm 0.2035 | lr 1.19e-03 | (9905.71 ms | 52928 tok/s) step 1980/76294 | train loss 3.640364 | norm 0.1737 | lr 1.19e-03 | (9910.47 ms | 52902 tok/s) step 1981/76294 | train loss 3.670228 | norm 0.1715 | lr 1.19e-03 | (9914.72 ms | 52880 tok/s) step 1982/76294 | train loss 3.681647 | norm 0.1595 | lr 1.19e-03 | (9928.68 ms | 52805 tok/s) step 1983/76294 | train loss 3.725586 | norm 0.1562 | lr 1.19e-03 | (10114.48 ms | 51835 tok/s) step 1984/76294 | train loss 3.789119 | norm 0.1548 | lr 1.19e-03 | (9900.25 ms | 52957 tok/s) step 1985/76294 | train loss 3.629384 | norm 0.1654 | lr 1.19e-03 | (9957.03 ms | 52655 tok/s) step 1986/76294 | train loss 3.601501 | norm 0.1579 | lr 1.19e-03 | (9894.25 ms | 52989 tok/s) step 1987/76294 | train loss 3.788886 | norm 0.1818 | lr 1.19e-03 | (9904.63 ms | 52934 tok/s) step 1988/76294 | train loss 3.624457 | norm 0.1601 | lr 1.19e-03 | (9894.40 ms | 52988 tok/s) step 1989/76294 | train loss 3.654616 | norm 0.1672 | lr 1.19e-03 | (9909.29 ms | 52909 tok/s) step 1990/76294 | train loss 3.721102 | norm 0.2217 | lr 1.19e-03 | (9898.39 ms | 52967 tok/s) step 1991/76294 | train loss 3.642790 | norm 0.2239 | lr 1.19e-03 | (9901.78 ms | 52949 tok/s) step 1992/76294 | train loss 3.632756 | norm 0.1847 | lr 1.19e-03 | (9898.44 ms | 52967 tok/s) step 1993/76294 | train loss 3.717705 | norm 0.2265 | lr 1.19e-03 | (9912.09 ms | 52894 tok/s) step 1994/76294 | train loss 3.662855 | norm 0.2047 | lr 1.19e-03 | (9898.69 ms | 52965 tok/s) step 1995/76294 | train loss 3.706348 | norm 0.1886 | lr 1.19e-03 | (9941.54 ms | 52737 tok/s) step 1996/76294 | train loss 3.628682 | norm 0.2101 | lr 1.19e-03 | (9896.07 ms | 52979 tok/s) step 1997/76294 | train loss 3.676298 | norm 0.1791 | lr 1.19e-03 | (9905.41 ms | 52929 tok/s) step 1998/76294 | train loss 3.669562 | norm 0.1680 | lr 1.19e-03 | (9894.29 ms | 52989 tok/s) step 1999/76294 | train loss 3.638134 | norm 0.1749 | lr 1.19e-03 | (9940.00 ms | 52745 tok/s) step 2000/76294 | train loss 3.751380 | norm 0.1853 | lr 1.19e-03 | (9895.24 ms | 52984 tok/s) val loss: 3.651815 saving model checkpoint to ./results/gpt2-350M-gqa/step_2000.pth step 2001/76294 | train loss 3.597297 | norm 0.1909 | lr 1.19e-03 | (9996.80 ms | 52446 tok/s) step 2002/76294 | train loss 3.659423 | norm 0.1753 | lr 1.19e-03 | (9876.78 ms | 53083 tok/s) step 2003/76294 | train loss 3.787827 | norm 0.1746 | lr 1.19e-03 | (9901.09 ms | 52953 tok/s) step 2004/76294 | train loss 3.699352 | norm 0.1815 | lr 1.19e-03 | (9878.56 ms | 53073 tok/s) step 2005/76294 | train loss 3.616355 | norm 0.1777 | lr 1.19e-03 | (9886.28 ms | 53032 tok/s) step 2006/76294 | train loss 3.664366 | norm 0.1815 | lr 1.19e-03 | (9877.90 ms | 53077 tok/s) step 2007/76294 | train loss 3.691607 | norm 0.1543 | lr 1.19e-03 | (9889.00 ms | 53017 tok/s) step 2008/76294 | train loss 3.678746 | norm 0.1521 | lr 1.19e-03 | (9925.43 ms | 52823 tok/s) step 2009/76294 | train loss 3.653779 | norm 0.1527 | lr 1.19e-03 | (9889.25 ms | 53016 tok/s) step 2010/76294 | train loss 3.677061 | norm 0.1630 | lr 1.19e-03 | (9949.85 ms | 52693 tok/s) step 2011/76294 | train loss 3.624212 | norm 0.1674 | lr 1.19e-03 | (9955.15 ms | 52665 tok/s) step 2012/76294 | train loss 3.586465 | norm 0.1547 | lr 1.19e-03 | (9890.86 ms | 53007 tok/s) step 2013/76294 | train loss 3.662057 | norm 0.1677 | lr 1.19e-03 | (9898.51 ms | 52966 tok/s) step 2014/76294 | train loss 3.607593 | norm 0.1737 | lr 1.19e-03 | (9950.89 ms | 52688 tok/s) step 2015/76294 | train loss 3.628056 | norm 0.1523 | lr 1.19e-03 | (9895.96 ms | 52980 tok/s) step 2016/76294 | train loss 3.624267 | norm 0.1456 | lr 1.19e-03 | (9902.32 ms | 52946 tok/s) step 2017/76294 | train loss 3.599569 | norm 0.1574 | lr 1.19e-03 | (9898.59 ms | 52966 tok/s) step 2018/76294 | train loss 3.609077 | norm 0.1506 | lr 1.19e-03 | (9933.95 ms | 52777 tok/s) step 2019/76294 | train loss 3.570705 | norm 0.1633 | lr 1.19e-03 | (9926.87 ms | 52815 tok/s) step 2020/76294 | train loss 3.636214 | norm 0.1625 | lr 1.19e-03 | (9896.76 ms | 52976 tok/s) step 2021/76294 | train loss 3.643468 | norm 0.1776 | lr 1.19e-03 | (9938.86 ms | 52751 tok/s) step 2022/76294 | train loss 3.681064 | norm 0.2028 | lr 1.19e-03 | (9894.57 ms | 52987 tok/s) step 2023/76294 | train loss 3.634727 | norm 0.1933 | lr 1.19e-03 | (9898.62 ms | 52966 tok/s) step 2024/76294 | train loss 3.668793 | norm 0.1677 | lr 1.19e-03 | (9905.20 ms | 52931 tok/s) step 2025/76294 | train loss 3.588804 | norm 0.1880 | lr 1.19e-03 | (9894.97 ms | 52985 tok/s) step 2026/76294 | train loss 3.649544 | norm 0.2097 | lr 1.19e-03 | (9921.58 ms | 52843 tok/s) step 2027/76294 | train loss 3.608478 | norm 0.1727 | lr 1.19e-03 | (9897.51 ms | 52972 tok/s) step 2028/76294 | train loss 3.650183 | norm 0.2040 | lr 1.19e-03 | (9893.35 ms | 52994 tok/s) step 2029/76294 | train loss 3.636384 | norm 0.1858 | lr 1.19e-03 | (9905.02 ms | 52932 tok/s) step 2030/76294 | train loss 3.600917 | norm 0.2232 | lr 1.19e-03 | (9892.25 ms | 53000 tok/s) step 2031/76294 | train loss 3.599740 | norm 0.2731 | lr 1.19e-03 | (9901.84 ms | 52949 tok/s) step 2032/76294 | train loss 3.725222 | norm 0.3370 | lr 1.19e-03 | (9890.48 ms | 53009 tok/s) step 2033/76294 | train loss 3.693526 | norm 0.2504 | lr 1.19e-03 | (9904.10 ms | 52936 tok/s) step 2034/76294 | train loss 3.624174 | norm 0.2340 | lr 1.19e-03 | (9889.36 ms | 53015 tok/s) step 2035/76294 | train loss 3.618130 | norm 0.2086 | lr 1.19e-03 | (9915.41 ms | 52876 tok/s) step 2036/76294 | train loss 3.662072 | norm 0.2643 | lr 1.19e-03 | (9891.36 ms | 53005 tok/s) step 2037/76294 | train loss 3.693444 | norm 0.3244 | lr 1.19e-03 | (9963.53 ms | 52621 tok/s) step 2038/76294 | train loss 3.685241 | norm 0.2223 | lr 1.19e-03 | (9887.96 ms | 53023 tok/s) step 2039/76294 | train loss 3.782328 | norm 0.1971 | lr 1.19e-03 | (9909.72 ms | 52906 tok/s) step 2040/76294 | train loss 3.657234 | norm 0.2184 | lr 1.19e-03 | (9890.77 ms | 53008 tok/s) step 2041/76294 | train loss 3.634052 | norm 0.1611 | lr 1.19e-03 | (9921.41 ms | 52844 tok/s) step 2042/76294 | train loss 3.652779 | norm 0.1884 | lr 1.19e-03 | (9891.37 ms | 53005 tok/s) step 2043/76294 | train loss 3.709441 | norm 0.1769 | lr 1.19e-03 | (9905.61 ms | 52928 tok/s) step 2044/76294 | train loss 3.658304 | norm 0.1633 | lr 1.19e-03 | (9886.88 ms | 53029 tok/s) step 2045/76294 | train loss 3.653347 | norm 0.1644 | lr 1.19e-03 | (9919.36 ms | 52855 tok/s) step 2046/76294 | train loss 3.654415 | norm 0.1683 | lr 1.19e-03 | (9886.88 ms | 53029 tok/s) step 2047/76294 | train loss 3.608302 | norm 0.1850 | lr 1.19e-03 | (9898.17 ms | 52968 tok/s) step 2048/76294 | train loss 3.781902 | norm 0.2093 | lr 1.19e-03 | (11527.05 ms | 45483 tok/s) step 2049/76294 | train loss 3.624193 | norm 0.1720 | lr 1.19e-03 | (9965.85 ms | 52608 tok/s) step 2050/76294 | train loss 3.735736 | norm 0.1696 | lr 1.19e-03 | (9872.75 ms | 53105 tok/s) step 2051/76294 | train loss 3.635171 | norm 0.1722 | lr 1.19e-03 | (9877.88 ms | 53077 tok/s) step 2052/76294 | train loss 3.605699 | norm 0.1674 | lr 1.19e-03 | (9892.52 ms | 52998 tok/s) step 2053/76294 | train loss 3.685450 | norm 0.1915 | lr 1.19e-03 | (9884.69 ms | 53040 tok/s) step 2054/76294 | train loss 3.620550 | norm 0.1772 | lr 1.19e-03 | (9875.90 ms | 53088 tok/s) step 2055/76294 | train loss 3.636217 | norm 0.1758 | lr 1.19e-03 | (9922.31 ms | 52839 tok/s) step 2056/76294 | train loss 3.656898 | norm 0.1480 | lr 1.19e-03 | (9910.43 ms | 52903 tok/s) step 2057/76294 | train loss 3.655279 | norm 0.1556 | lr 1.19e-03 | (9904.02 ms | 52937 tok/s) step 2058/76294 | train loss 3.595686 | norm 0.1500 | lr 1.19e-03 | (9873.00 ms | 53103 tok/s) step 2059/76294 | train loss 3.700614 | norm 0.1614 | lr 1.19e-03 | (9948.44 ms | 52700 tok/s) step 2060/76294 | train loss 3.723562 | norm 0.1750 | lr 1.19e-03 | (9877.56 ms | 53079 tok/s) step 2061/76294 | train loss 3.632663 | norm 0.1592 | lr 1.19e-03 | (9894.68 ms | 52987 tok/s) step 2062/76294 | train loss 3.653339 | norm 0.1550 | lr 1.19e-03 | (9901.63 ms | 52950 tok/s) step 2063/76294 | train loss 3.720189 | norm 0.1748 | lr 1.19e-03 | (9877.38 ms | 53080 tok/s) step 2064/76294 | train loss 3.579893 | norm 0.1520 | lr 1.19e-03 | (9879.82 ms | 53067 tok/s) step 2065/76294 | train loss 3.667081 | norm 0.1660 | lr 1.19e-03 | (9878.85 ms | 53072 tok/s) step 2066/76294 | train loss 3.654369 | norm 0.1817 | lr 1.19e-03 | (10407.34 ms | 50377 tok/s) step 2067/76294 | train loss 3.648160 | norm 0.1575 | lr 1.19e-03 | (9885.64 ms | 53035 tok/s) step 2068/76294 | train loss 3.625027 | norm 0.1653 | lr 1.19e-03 | (9879.54 ms | 53068 tok/s) step 2069/76294 | train loss 3.631226 | norm 0.1559 | lr 1.19e-03 | (9899.62 ms | 52960 tok/s) step 2070/76294 | train loss 3.595863 | norm 0.1852 | lr 1.19e-03 | (9878.42 ms | 53074 tok/s) step 2071/76294 | train loss 3.617007 | norm 0.1815 | lr 1.19e-03 | (9878.10 ms | 53076 tok/s) step 2072/76294 | train loss 3.635311 | norm 0.1746 | lr 1.19e-03 | (9882.85 ms | 53050 tok/s) step 2073/76294 | train loss 3.563754 | norm 0.1790 | lr 1.19e-03 | (9891.46 ms | 53004 tok/s) step 2074/76294 | train loss 3.597342 | norm 0.1397 | lr 1.19e-03 | (9873.62 ms | 53100 tok/s) step 2075/76294 | train loss 3.650282 | norm 0.1461 | lr 1.19e-03 | (9896.55 ms | 52977 tok/s) step 2076/76294 | train loss 3.609513 | norm 0.1645 | lr 1.19e-03 | (9878.11 ms | 53076 tok/s) step 2077/76294 | train loss 3.585649 | norm 0.1475 | lr 1.19e-03 | (9878.50 ms | 53074 tok/s) step 2078/76294 | train loss 3.701074 | norm 0.1543 | lr 1.19e-03 | (9880.50 ms | 53063 tok/s) step 2079/76294 | train loss 3.638251 | norm 0.1562 | lr 1.19e-03 | (9876.93 ms | 53082 tok/s) step 2080/76294 | train loss 3.605966 | norm 0.1733 | lr 1.19e-03 | (9879.38 ms | 53069 tok/s) step 2081/76294 | train loss 3.666307 | norm 0.1588 | lr 1.19e-03 | (9940.93 ms | 52740 tok/s) step 2082/76294 | train loss 3.736201 | norm 0.1758 | lr 1.19e-03 | (9916.66 ms | 52869 tok/s) step 2083/76294 | train loss 3.708621 | norm 0.1766 | lr 1.19e-03 | (9903.05 ms | 52942 tok/s) step 2084/76294 | train loss 3.640794 | norm 0.1910 | lr 1.19e-03 | (9913.85 ms | 52884 tok/s) step 2085/76294 | train loss 3.622798 | norm 0.1527 | lr 1.19e-03 | (9875.54 ms | 53090 tok/s) step 2086/76294 | train loss 3.611531 | norm 0.1539 | lr 1.19e-03 | (9878.88 ms | 53072 tok/s) step 2087/76294 | train loss 3.615928 | norm 0.1645 | lr 1.19e-03 | (9875.53 ms | 53090 tok/s) step 2088/76294 | train loss 3.684482 | norm 0.1526 | lr 1.19e-03 | (9873.95 ms | 53098 tok/s) step 2089/76294 | train loss 3.556182 | norm 0.1493 | lr 1.19e-03 | (9874.86 ms | 53093 tok/s) step 2090/76294 | train loss 3.640614 | norm 0.1677 | lr 1.19e-03 | (9878.61 ms | 53073 tok/s) step 2091/76294 | train loss 3.631601 | norm 0.1643 | lr 1.19e-03 | (9875.57 ms | 53089 tok/s) step 2092/76294 | train loss 3.650816 | norm 0.1779 | lr 1.19e-03 | (9910.33 ms | 52903 tok/s) step 2093/76294 | train loss 3.595929 | norm 0.1737 | lr 1.19e-03 | (9874.01 ms | 53098 tok/s) step 2094/76294 | train loss 3.582272 | norm 0.1845 | lr 1.19e-03 | (9896.88 ms | 52975 tok/s) step 2095/76294 | train loss 3.653787 | norm 0.1933 | lr 1.19e-03 | (9874.91 ms | 53093 tok/s) step 2096/76294 | train loss 3.601057 | norm 0.1663 | lr 1.19e-03 | (9875.51 ms | 53090 tok/s) step 2097/76294 | train loss 3.619102 | norm 0.1790 | lr 1.19e-03 | (9874.45 ms | 53095 tok/s) step 2098/76294 | train loss 3.615377 | norm 0.1669 | lr 1.19e-03 | (9878.06 ms | 53076 tok/s) step 2099/76294 | train loss 3.625961 | norm 0.1549 | lr 1.19e-03 | (10761.78 ms | 48718 tok/s) step 2100/76294 | train loss 3.610963 | norm 0.1585 | lr 1.19e-03 | (9860.04 ms | 53173 tok/s) step 2101/76294 | train loss 3.641785 | norm 0.1527 | lr 1.19e-03 | (9879.07 ms | 53071 tok/s) step 2102/76294 | train loss 3.588237 | norm 0.1434 | lr 1.19e-03 | (9864.94 ms | 53147 tok/s) step 2103/76294 | train loss 3.638442 | norm 0.1728 | lr 1.19e-03 | (9871.31 ms | 53112 tok/s) step 2104/76294 | train loss 3.705258 | norm 0.2238 | lr 1.19e-03 | (9872.86 ms | 53104 tok/s) step 2105/76294 | train loss 3.666172 | norm 0.2331 | lr 1.19e-03 | (9871.30 ms | 53112 tok/s) step 2106/76294 | train loss 3.655401 | norm 0.1800 | lr 1.19e-03 | (9870.40 ms | 53117 tok/s) step 2107/76294 | train loss 3.689373 | norm 0.2129 | lr 1.19e-03 | (9878.01 ms | 53076 tok/s) step 2108/76294 | train loss 3.706878 | norm 0.1846 | lr 1.19e-03 | (9875.51 ms | 53090 tok/s) step 2109/76294 | train loss 3.608084 | norm 0.2322 | lr 1.19e-03 | (9882.79 ms | 53051 tok/s) step 2110/76294 | train loss 3.654655 | norm 0.2368 | lr 1.19e-03 | (9911.74 ms | 52896 tok/s) step 2111/76294 | train loss 3.569223 | norm 0.2401 | lr 1.19e-03 | (9935.27 ms | 52770 tok/s) step 2112/76294 | train loss 3.590089 | norm 0.1966 | lr 1.19e-03 | (9869.01 ms | 53125 tok/s) step 2113/76294 | train loss 3.604513 | norm 0.1644 | lr 1.19e-03 | (9867.93 ms | 53131 tok/s) step 2114/76294 | train loss 3.713091 | norm 0.1664 | lr 1.19e-03 | (9881.16 ms | 53059 tok/s) step 2115/76294 | train loss 3.571095 | norm 0.1626 | lr 1.19e-03 | (9877.30 ms | 53080 tok/s) step 2116/76294 | train loss 3.651078 | norm 0.1576 | lr 1.19e-03 | (9870.01 ms | 53119 tok/s) step 2117/76294 | train loss 3.596918 | norm 0.1575 | lr 1.19e-03 | (9882.14 ms | 53054 tok/s) step 2118/76294 | train loss 3.643582 | norm 0.1494 | lr 1.19e-03 | (9872.03 ms | 53108 tok/s) step 2119/76294 | train loss 3.668793 | norm 0.1508 | lr 1.19e-03 | (9892.17 ms | 53000 tok/s) step 2120/76294 | train loss 3.637753 | norm 0.1680 | lr 1.19e-03 | (9869.13 ms | 53124 tok/s) step 2121/76294 | train loss 3.693775 | norm 0.1727 | lr 1.19e-03 | (9867.92 ms | 53131 tok/s) step 2122/76294 | train loss 3.611717 | norm 0.1819 | lr 1.19e-03 | (9867.54 ms | 53133 tok/s) step 2123/76294 | train loss 3.772582 | norm 0.1951 | lr 1.19e-03 | (9884.54 ms | 53041 tok/s) step 2124/76294 | train loss 3.649377 | norm 0.2087 | lr 1.19e-03 | (9870.48 ms | 53117 tok/s) step 2125/76294 | train loss 3.648658 | norm 0.2530 | lr 1.19e-03 | (9880.26 ms | 53064 tok/s) step 2126/76294 | train loss 3.552769 | norm 0.2333 | lr 1.19e-03 | (9871.06 ms | 53114 tok/s) step 2127/76294 | train loss 3.644985 | norm 0.1970 | lr 1.19e-03 | (9878.95 ms | 53071 tok/s) step 2128/76294 | train loss 3.649444 | norm 0.2086 | lr 1.19e-03 | (9869.19 ms | 53124 tok/s) step 2129/76294 | train loss 3.672969 | norm 0.1591 | lr 1.19e-03 | (9890.59 ms | 53009 tok/s) step 2130/76294 | train loss 3.589589 | norm 0.1528 | lr 1.19e-03 | (9868.94 ms | 53125 tok/s) step 2131/76294 | train loss 3.659125 | norm 0.1656 | lr 1.19e-03 | (9868.50 ms | 53127 tok/s) step 2132/76294 | train loss 3.633138 | norm 0.1748 | lr 1.19e-03 | (9872.37 ms | 53107 tok/s) step 2133/76294 | train loss 3.619491 | norm 0.1648 | lr 1.19e-03 | (9870.52 ms | 53117 tok/s) step 2134/76294 | train loss 3.638464 | norm 0.1605 | lr 1.18e-03 | (9869.94 ms | 53120 tok/s) step 2135/76294 | train loss 3.599938 | norm 0.1629 | lr 1.18e-03 | (9870.16 ms | 53119 tok/s) step 2136/76294 | train loss 3.674865 | norm 0.1970 | lr 1.18e-03 | (9871.20 ms | 53113 tok/s) step 2137/76294 | train loss 3.553159 | norm 0.1858 | lr 1.18e-03 | (9876.84 ms | 53083 tok/s) step 2138/76294 | train loss 3.661751 | norm 0.1642 | lr 1.18e-03 | (9871.94 ms | 53109 tok/s) step 2139/76294 | train loss 3.648920 | norm 0.1798 | lr 1.18e-03 | (9899.38 ms | 52962 tok/s) step 2140/76294 | train loss 3.629171 | norm 0.1795 | lr 1.18e-03 | (9868.02 ms | 53130 tok/s) step 2141/76294 | train loss 3.596424 | norm 0.1894 | lr 1.18e-03 | (9872.71 ms | 53105 tok/s) step 2142/76294 | train loss 3.582220 | norm 0.1564 | lr 1.18e-03 | (9867.20 ms | 53134 tok/s) step 2143/76294 | train loss 3.617284 | norm 0.1691 | lr 1.18e-03 | (9875.70 ms | 53089 tok/s) step 2144/76294 | train loss 3.581536 | norm 0.1514 | lr 1.18e-03 | (9877.32 ms | 53080 tok/s) step 2145/76294 | train loss 3.665089 | norm 0.1679 | lr 1.18e-03 | (9875.88 ms | 53088 tok/s) step 2146/76294 | train loss 3.610426 | norm 0.1742 | lr 1.18e-03 | (11372.76 ms | 46100 tok/s) step 2147/76294 | train loss 3.646144 | norm 0.1456 | lr 1.18e-03 | (9863.72 ms | 53153 tok/s) step 2148/76294 | train loss 3.618284 | norm 0.1603 | lr 1.18e-03 | (9878.55 ms | 53073 tok/s) step 2149/76294 | train loss 3.694387 | norm 0.1617 | lr 1.18e-03 | (9866.66 ms | 53137 tok/s) step 2150/76294 | train loss 3.662130 | norm 0.1869 | lr 1.18e-03 | (9870.08 ms | 53119 tok/s) step 2151/76294 | train loss 3.638373 | norm 0.1596 | lr 1.18e-03 | (9879.07 ms | 53071 tok/s) step 2152/76294 | train loss 3.648384 | norm 0.1542 | lr 1.18e-03 | (9874.40 ms | 53096 tok/s) step 2153/76294 | train loss 3.628745 | norm 0.1588 | lr 1.18e-03 | (9867.02 ms | 53135 tok/s) step 2154/76294 | train loss 3.682561 | norm 0.1658 | lr 1.18e-03 | (9881.57 ms | 53057 tok/s) step 2155/76294 | train loss 3.589152 | norm 0.1665 | lr 1.18e-03 | (9906.35 ms | 52924 tok/s) step 2156/76294 | train loss 3.747478 | norm 0.1745 | lr 1.18e-03 | (9869.27 ms | 53123 tok/s) step 2157/76294 | train loss 3.608112 | norm 0.1974 | lr 1.18e-03 | (9876.89 ms | 53082 tok/s) step 2158/76294 | train loss 3.622555 | norm 0.2287 | lr 1.18e-03 | (9869.03 ms | 53125 tok/s) step 2159/76294 | train loss 3.599963 | norm 0.2419 | lr 1.18e-03 | (9874.79 ms | 53094 tok/s) step 2160/76294 | train loss 3.668760 | norm 0.1893 | lr 1.18e-03 | (9879.37 ms | 53069 tok/s) step 2161/76294 | train loss 3.577063 | norm 0.1989 | lr 1.18e-03 | (9872.11 ms | 53108 tok/s) step 2162/76294 | train loss 3.674567 | norm 0.1890 | lr 1.18e-03 | (9872.82 ms | 53104 tok/s) step 2163/76294 | train loss 3.624025 | norm 0.2036 | lr 1.18e-03 | (9876.89 ms | 53082 tok/s) step 2164/76294 | train loss 3.636970 | norm 0.1947 | lr 1.18e-03 | (9868.70 ms | 53126 tok/s) step 2165/76294 | train loss 3.656857 | norm 0.1773 | lr 1.18e-03 | (9905.08 ms | 52931 tok/s) step 2166/76294 | train loss 3.578056 | norm 0.2026 | lr 1.18e-03 | (9872.93 ms | 53104 tok/s) step 2167/76294 | train loss 3.637035 | norm 0.1763 | lr 1.18e-03 | (9876.44 ms | 53085 tok/s) step 2168/76294 | train loss 3.569846 | norm 0.1755 | lr 1.18e-03 | (9870.22 ms | 53118 tok/s) step 2169/76294 | train loss 3.799219 | norm 0.1774 | lr 1.18e-03 | (9882.60 ms | 53052 tok/s) step 2170/76294 | train loss 3.624339 | norm 0.1675 | lr 1.18e-03 | (9924.14 ms | 52830 tok/s) step 2171/76294 | train loss 3.677576 | norm 0.1598 | lr 1.18e-03 | (9877.76 ms | 53078 tok/s) step 2172/76294 | train loss 3.593318 | norm 0.1509 | lr 1.18e-03 | (9869.88 ms | 53120 tok/s) step 2173/76294 | train loss 3.620539 | norm 0.1519 | lr 1.18e-03 | (9878.95 ms | 53071 tok/s) step 2174/76294 | train loss 3.584659 | norm 0.1516 | lr 1.18e-03 | (9870.12 ms | 53119 tok/s) step 2175/76294 | train loss 3.638957 | norm 0.1454 | lr 1.18e-03 | (9883.04 ms | 53049 tok/s) step 2176/76294 | train loss 3.613514 | norm 0.1606 | lr 1.18e-03 | (9871.49 ms | 53111 tok/s) step 2177/76294 | train loss 3.675081 | norm 0.1779 | lr 1.18e-03 | (9879.58 ms | 53068 tok/s) step 2178/76294 | train loss 3.678788 | norm 0.1985 | lr 1.18e-03 | (9886.15 ms | 53033 tok/s) step 2179/76294 | train loss 3.646201 | norm 0.1949 | lr 1.18e-03 | (9914.91 ms | 52879 tok/s) step 2180/76294 | train loss 3.585109 | norm 0.1508 | lr 1.18e-03 | (9877.63 ms | 53078 tok/s) step 2181/76294 | train loss 3.666751 | norm 0.1866 | lr 1.18e-03 | (9872.05 ms | 53108 tok/s) step 2182/76294 | train loss 3.627635 | norm 0.1883 | lr 1.18e-03 | (9876.97 ms | 53082 tok/s) step 2183/76294 | train loss 3.621562 | norm 0.1924 | lr 1.18e-03 | (9906.54 ms | 52923 tok/s) step 2184/76294 | train loss 3.632654 | norm 0.1643 | lr 1.18e-03 | (9946.42 ms | 52711 tok/s) step 2185/76294 | train loss 3.583985 | norm 0.1565 | lr 1.18e-03 | (9877.99 ms | 53076 tok/s) step 2186/76294 | train loss 3.628121 | norm 0.2021 | lr 1.18e-03 | (9877.90 ms | 53077 tok/s) step 2187/76294 | train loss 3.579627 | norm 0.2436 | lr 1.18e-03 | (9877.23 ms | 53080 tok/s) step 2188/76294 | train loss 3.547331 | norm 0.1900 | lr 1.18e-03 | (9867.81 ms | 53131 tok/s) step 2189/76294 | train loss 3.609792 | norm 0.1940 | lr 1.18e-03 | (9878.87 ms | 53072 tok/s) step 2190/76294 | train loss 3.591791 | norm 0.1603 | lr 1.18e-03 | (9912.02 ms | 52894 tok/s) step 2191/76294 | train loss 3.624963 | norm 0.1698 | lr 1.18e-03 | (9936.88 ms | 52762 tok/s) step 2192/76294 | train loss 3.606364 | norm 0.1591 | lr 1.18e-03 | (9875.06 ms | 53092 tok/s) step 2193/76294 | train loss 3.624532 | norm 0.1700 | lr 1.18e-03 | (9874.98 ms | 53093 tok/s) step 2194/76294 | train loss 3.597050 | norm 0.1515 | lr 1.18e-03 | (9880.80 ms | 53061 tok/s) step 2195/76294 | train loss 3.585610 | norm 0.1714 | lr 1.18e-03 | (9893.16 ms | 52995 tok/s) step 2196/76294 | train loss 3.628523 | norm 0.1954 | lr 1.18e-03 | (9873.16 ms | 53102 tok/s) step 2197/76294 | train loss 3.648945 | norm 0.1666 | lr 1.18e-03 | (9882.76 ms | 53051 tok/s) step 2198/76294 | train loss 3.644534 | norm 0.1803 | lr 1.18e-03 | (9870.34 ms | 53118 tok/s) step 2199/76294 | train loss 3.584679 | norm 0.1604 | lr 1.18e-03 | (9888.48 ms | 53020 tok/s) step 2200/76294 | train loss 3.638247 | norm 0.1601 | lr 1.18e-03 | (9879.12 ms | 53070 tok/s) step 2201/76294 | train loss 3.591164 | norm 0.1971 | lr 1.18e-03 | (9909.04 ms | 52910 tok/s) step 2202/76294 | train loss 3.580403 | norm 0.1539 | lr 1.18e-03 | (9864.84 ms | 53147 tok/s) step 2203/76294 | train loss 3.602910 | norm 0.1659 | lr 1.18e-03 | (9881.93 ms | 53055 tok/s) step 2204/76294 | train loss 3.759030 | norm 0.1926 | lr 1.18e-03 | (9866.04 ms | 53141 tok/s) step 2205/76294 | train loss 3.587081 | norm 0.1682 | lr 1.18e-03 | (9879.97 ms | 53066 tok/s) step 2206/76294 | train loss 3.597702 | norm 0.1663 | lr 1.18e-03 | (9871.30 ms | 53112 tok/s) step 2207/76294 | train loss 3.635757 | norm 0.1747 | lr 1.18e-03 | (9881.27 ms | 53059 tok/s) step 2208/76294 | train loss 3.610985 | norm 0.1444 | lr 1.18e-03 | (9867.77 ms | 53131 tok/s) step 2209/76294 | train loss 3.660118 | norm 0.1483 | lr 1.18e-03 | (9880.06 ms | 53065 tok/s) step 2210/76294 | train loss 3.623268 | norm 0.1555 | lr 1.18e-03 | (9865.84 ms | 53142 tok/s) step 2211/76294 | train loss 3.630986 | norm 0.1436 | lr 1.18e-03 | (9875.33 ms | 53091 tok/s) step 2212/76294 | train loss 3.636904 | norm 0.1592 | lr 1.18e-03 | (9869.10 ms | 53124 tok/s) step 2213/76294 | train loss 3.624701 | norm 0.1556 | lr 1.18e-03 | (9880.27 ms | 53064 tok/s) step 2214/76294 | train loss 3.688469 | norm 0.1829 | lr 1.18e-03 | (9869.38 ms | 53123 tok/s) step 2215/76294 | train loss 3.566428 | norm 0.1753 | lr 1.18e-03 | (9874.19 ms | 53097 tok/s) step 2216/76294 | train loss 3.635579 | norm 0.1462 | lr 1.18e-03 | (9866.98 ms | 53136 tok/s) step 2217/76294 | train loss 3.655054 | norm 0.1990 | lr 1.18e-03 | (9885.53 ms | 53036 tok/s) step 2218/76294 | train loss 3.688485 | norm 0.2296 | lr 1.18e-03 | (9868.56 ms | 53127 tok/s) step 2219/76294 | train loss 3.597271 | norm 0.2131 | lr 1.18e-03 | (9882.58 ms | 53052 tok/s) step 2220/76294 | train loss 3.594554 | norm 0.2040 | lr 1.18e-03 | (9867.59 ms | 53132 tok/s) step 2221/76294 | train loss 3.617062 | norm 0.2143 | lr 1.18e-03 | (9878.89 ms | 53072 tok/s) step 2222/76294 | train loss 3.622493 | norm 0.1745 | lr 1.18e-03 | (9868.31 ms | 53128 tok/s) step 2223/76294 | train loss 3.637894 | norm 0.1883 | lr 1.18e-03 | (9907.92 ms | 52916 tok/s) step 2224/76294 | train loss 3.610236 | norm 0.1621 | lr 1.18e-03 | (9928.86 ms | 52804 tok/s) step 2225/76294 | train loss 3.686447 | norm 0.1915 | lr 1.18e-03 | (9879.42 ms | 53069 tok/s) step 2226/76294 | train loss 3.584026 | norm 0.1597 | lr 1.18e-03 | (9874.21 ms | 53097 tok/s) step 2227/76294 | train loss 3.580980 | norm 0.1654 | lr 1.18e-03 | (9953.37 ms | 52674 tok/s) step 2228/76294 | train loss 3.661101 | norm 0.1856 | lr 1.18e-03 | (9864.42 ms | 53149 tok/s) step 2229/76294 | train loss 3.578073 | norm 0.5450 | lr 1.18e-03 | (11331.84 ms | 46267 tok/s) step 2230/76294 | train loss 3.679893 | norm 0.1794 | lr 1.18e-03 | (9864.05 ms | 53151 tok/s) step 2231/76294 | train loss 3.550963 | norm 0.1675 | lr 1.18e-03 | (9876.76 ms | 53083 tok/s) step 2232/76294 | train loss 3.614665 | norm 0.1709 | lr 1.18e-03 | (9865.95 ms | 53141 tok/s) step 2233/76294 | train loss 3.611287 | norm 0.1611 | lr 1.18e-03 | (9917.87 ms | 52863 tok/s) step 2234/76294 | train loss 3.636145 | norm 0.1840 | lr 1.18e-03 | (9879.93 ms | 53066 tok/s) step 2235/76294 | train loss 3.553388 | norm 0.2164 | lr 1.18e-03 | (9869.07 ms | 53124 tok/s) step 2236/76294 | train loss 3.604504 | norm 0.2238 | lr 1.18e-03 | (9867.30 ms | 53134 tok/s) step 2237/76294 | train loss 3.599789 | norm 0.1788 | lr 1.18e-03 | (9906.10 ms | 52926 tok/s) step 2238/76294 | train loss 3.584956 | norm 0.2227 | lr 1.18e-03 | (9868.14 ms | 53129 tok/s) step 2239/76294 | train loss 3.714066 | norm 0.1863 | lr 1.18e-03 | (9892.85 ms | 52997 tok/s) step 2240/76294 | train loss 3.566067 | norm 0.2944 | lr 1.18e-03 | (9871.32 ms | 53112 tok/s) step 2241/76294 | train loss 3.638939 | norm 0.2835 | lr 1.18e-03 | (9878.62 ms | 53073 tok/s) step 2242/76294 | train loss 3.640512 | norm 0.1795 | lr 1.18e-03 | (9872.58 ms | 53105 tok/s) step 2243/76294 | train loss 3.636387 | norm 0.2903 | lr 1.18e-03 | (9927.51 ms | 52812 tok/s) step 2244/76294 | train loss 3.525239 | norm 0.9211 | lr 1.18e-03 | (10938.13 ms | 47932 tok/s) step 2245/76294 | train loss 3.592133 | norm 0.2677 | lr 1.18e-03 | (9965.88 ms | 52608 tok/s) step 2246/76294 | train loss 3.656141 | norm 0.3702 | lr 1.18e-03 | (9874.81 ms | 53093 tok/s) step 2247/76294 | train loss 3.599699 | norm 0.3237 | lr 1.18e-03 | (9876.48 ms | 53084 tok/s) step 2248/76294 | train loss 3.593100 | norm 0.2075 | lr 1.18e-03 | (9909.22 ms | 52909 tok/s) step 2249/76294 | train loss 3.627547 | norm 0.2735 | lr 1.18e-03 | (9872.16 ms | 53108 tok/s) step 2250/76294 | train loss 3.517586 | norm 0.1871 | lr 1.18e-03 | (9880.75 ms | 53062 tok/s) val loss: 3.623547 saving model checkpoint to ./results/gpt2-350M-gqa/step_2250.pth step 2251/76294 | train loss 3.579279 | norm 0.1724 | lr 1.18e-03 | (9944.02 ms | 52724 tok/s) step 2252/76294 | train loss 3.549778 | norm 0.1754 | lr 1.18e-03 | (9853.23 ms | 53210 tok/s) step 2253/76294 | train loss 3.650326 | norm 0.1650 | lr 1.18e-03 | (9901.70 ms | 52949 tok/s) step 2254/76294 | train loss 3.628745 | norm 0.1717 | lr 1.18e-03 | (9867.56 ms | 53132 tok/s) step 2255/76294 | train loss 3.636616 | norm 0.1566 | lr 1.18e-03 | (9922.84 ms | 52837 tok/s) step 2256/76294 | train loss 3.632545 | norm 0.2305 | lr 1.18e-03 | (9874.38 ms | 53096 tok/s) step 2257/76294 | train loss 3.603031 | norm 0.1698 | lr 1.18e-03 | (9883.37 ms | 53047 tok/s) step 2258/76294 | train loss 3.582569 | norm 0.1617 | lr 1.18e-03 | (9881.70 ms | 53056 tok/s) step 2259/76294 | train loss 3.577563 | norm 0.1770 | lr 1.18e-03 | (9891.30 ms | 53005 tok/s) step 2260/76294 | train loss 3.601373 | norm 0.1506 | lr 1.18e-03 | (9887.87 ms | 53023 tok/s) step 2261/76294 | train loss 3.643275 | norm 0.1770 | lr 1.18e-03 | (9887.11 ms | 53027 tok/s) step 2262/76294 | train loss 3.609882 | norm 0.1826 | lr 1.18e-03 | (9937.53 ms | 52758 tok/s) step 2263/76294 | train loss 3.606222 | norm 0.1641 | lr 1.18e-03 | (9892.15 ms | 53000 tok/s) step 2264/76294 | train loss 3.573905 | norm 0.1738 | lr 1.18e-03 | (9893.80 ms | 52992 tok/s) step 2265/76294 | train loss 3.667854 | norm 0.2045 | lr 1.18e-03 | (9895.72 ms | 52981 tok/s) step 2266/76294 | train loss 3.601129 | norm 0.1717 | lr 1.18e-03 | (10001.37 ms | 52422 tok/s) step 2267/76294 | train loss 3.654581 | norm 0.1673 | lr 1.18e-03 | (9904.73 ms | 52933 tok/s) step 2268/76294 | train loss 3.556349 | norm 0.1610 | lr 1.18e-03 | (9901.08 ms | 52953 tok/s) step 2269/76294 | train loss 3.557864 | norm 0.1756 | lr 1.18e-03 | (9898.14 ms | 52968 tok/s) step 2270/76294 | train loss 3.641167 | norm 0.1660 | lr 1.18e-03 | (9906.35 ms | 52924 tok/s) step 2271/76294 | train loss 3.605787 | norm 0.1748 | lr 1.18e-03 | (9901.08 ms | 52953 tok/s) step 2272/76294 | train loss 3.594760 | norm 0.1571 | lr 1.18e-03 | (9896.05 ms | 52979 tok/s) step 2273/76294 | train loss 3.553265 | norm 0.1768 | lr 1.18e-03 | (10083.62 ms | 51994 tok/s) step 2274/76294 | train loss 3.640479 | norm 0.1511 | lr 1.18e-03 | (9957.61 ms | 52652 tok/s) step 2275/76294 | train loss 3.613438 | norm 0.1549 | lr 1.18e-03 | (9894.00 ms | 52991 tok/s) step 2276/76294 | train loss 3.577293 | norm 0.1413 | lr 1.18e-03 | (9903.44 ms | 52940 tok/s) step 2277/76294 | train loss 3.591552 | norm 0.1471 | lr 1.18e-03 | (9958.67 ms | 52646 tok/s) step 2278/76294 | train loss 3.563379 | norm 0.1657 | lr 1.18e-03 | (9906.60 ms | 52923 tok/s) step 2279/76294 | train loss 3.541442 | norm 0.1543 | lr 1.18e-03 | (9963.96 ms | 52618 tok/s) step 2280/76294 | train loss 3.577724 | norm 0.1563 | lr 1.18e-03 | (9899.07 ms | 52963 tok/s) step 2281/76294 | train loss 3.646130 | norm 0.1559 | lr 1.18e-03 | (9911.71 ms | 52896 tok/s) step 2282/76294 | train loss 3.580924 | norm 0.1824 | lr 1.18e-03 | (9933.47 ms | 52780 tok/s) step 2283/76294 | train loss 3.610137 | norm 0.3259 | lr 1.18e-03 | (9896.40 ms | 52978 tok/s) step 2284/76294 | train loss 3.585058 | norm 0.1600 | lr 1.18e-03 | (9899.45 ms | 52961 tok/s) step 2285/76294 | train loss 3.610770 | norm 0.2010 | lr 1.18e-03 | (9899.45 ms | 52961 tok/s) step 2286/76294 | train loss 3.712196 | norm 0.2053 | lr 1.18e-03 | (9895.99 ms | 52980 tok/s) step 2287/76294 | train loss 3.715702 | norm 0.1934 | lr 1.18e-03 | (9903.36 ms | 52940 tok/s) step 2288/76294 | train loss 3.610116 | norm 0.2312 | lr 1.18e-03 | (9892.15 ms | 53000 tok/s) step 2289/76294 | train loss 3.613443 | norm 0.1836 | lr 1.18e-03 | (10131.46 ms | 51749 tok/s) step 2290/76294 | train loss 3.720969 | norm 0.1995 | lr 1.18e-03 | (9898.78 ms | 52965 tok/s) step 2291/76294 | train loss 3.621463 | norm 0.1930 | lr 1.18e-03 | (9949.26 ms | 52696 tok/s) step 2292/76294 | train loss 3.625862 | norm 0.2081 | lr 1.18e-03 | (9928.79 ms | 52805 tok/s) step 2293/76294 | train loss 3.669294 | norm 0.1632 | lr 1.18e-03 | (9908.36 ms | 52914 tok/s) step 2294/76294 | train loss 3.576522 | norm 0.2146 | lr 1.18e-03 | (9902.59 ms | 52945 tok/s) step 2295/76294 | train loss 3.635478 | norm 0.1937 | lr 1.18e-03 | (9923.06 ms | 52835 tok/s) step 2296/76294 | train loss 3.645790 | norm 0.1755 | lr 1.18e-03 | (9890.02 ms | 53012 tok/s) step 2297/76294 | train loss 3.589089 | norm 0.1606 | lr 1.18e-03 | (9900.35 ms | 52956 tok/s) step 2298/76294 | train loss 3.644183 | norm 0.1744 | lr 1.18e-03 | (9894.75 ms | 52986 tok/s) step 2299/76294 | train loss 3.678187 | norm 0.3016 | lr 1.18e-03 | (9900.66 ms | 52955 tok/s) step 2300/76294 | train loss 3.604484 | norm 0.1730 | lr 1.18e-03 | (9888.95 ms | 53018 tok/s) step 2301/76294 | train loss 3.580192 | norm 0.1651 | lr 1.18e-03 | (9904.44 ms | 52935 tok/s) step 2302/76294 | train loss 3.622997 | norm 0.1524 | lr 1.18e-03 | (9893.61 ms | 52993 tok/s) step 2303/76294 | train loss 3.601703 | norm 0.1730 | lr 1.18e-03 | (9935.40 ms | 52770 tok/s) step 2304/76294 | train loss 3.618609 | norm 0.1728 | lr 1.18e-03 | (9890.63 ms | 53009 tok/s) step 2305/76294 | train loss 3.636567 | norm 0.1793 | lr 1.18e-03 | (9901.34 ms | 52951 tok/s) step 2306/76294 | train loss 3.625862 | norm 0.1778 | lr 1.18e-03 | (9892.78 ms | 52997 tok/s) step 2307/76294 | train loss 3.600933 | norm 0.1677 | lr 1.18e-03 | (9933.44 ms | 52780 tok/s) step 2308/76294 | train loss 3.596255 | norm 0.1537 | lr 1.18e-03 | (9896.97 ms | 52975 tok/s) step 2309/76294 | train loss 3.679053 | norm 0.1640 | lr 1.18e-03 | (9910.01 ms | 52905 tok/s) step 2310/76294 | train loss 3.623617 | norm 0.1424 | lr 1.18e-03 | (9893.14 ms | 52995 tok/s) step 2311/76294 | train loss 3.600375 | norm 0.1557 | lr 1.18e-03 | (9923.61 ms | 52832 tok/s) step 2312/76294 | train loss 3.658550 | norm 0.1628 | lr 1.18e-03 | (9892.85 ms | 52997 tok/s) step 2313/76294 | train loss 3.643420 | norm 0.1598 | lr 1.18e-03 | (9909.35 ms | 52908 tok/s) step 2314/76294 | train loss 3.608090 | norm 0.1552 | lr 1.18e-03 | (9900.05 ms | 52958 tok/s) step 2315/76294 | train loss 3.597907 | norm 0.1692 | lr 1.18e-03 | (9900.97 ms | 52953 tok/s) step 2316/76294 | train loss 3.648672 | norm 0.1516 | lr 1.18e-03 | (9961.94 ms | 52629 tok/s) step 2317/76294 | train loss 3.664092 | norm 0.1648 | lr 1.18e-03 | (9903.61 ms | 52939 tok/s) step 2318/76294 | train loss 3.613198 | norm 0.1516 | lr 1.18e-03 | (9887.96 ms | 53023 tok/s) step 2319/76294 | train loss 3.630534 | norm 0.1681 | lr 1.18e-03 | (9952.94 ms | 52677 tok/s) step 2320/76294 | train loss 3.632498 | norm 0.1673 | lr 1.18e-03 | (9892.22 ms | 53000 tok/s) step 2321/76294 | train loss 3.608661 | norm 0.1573 | lr 1.18e-03 | (9907.51 ms | 52918 tok/s) step 2322/76294 | train loss 3.587716 | norm 0.1651 | lr 1.18e-03 | (9949.83 ms | 52693 tok/s) step 2323/76294 | train loss 3.647161 | norm 0.1755 | lr 1.18e-03 | (9901.46 ms | 52951 tok/s) step 2324/76294 | train loss 3.604455 | norm 0.2054 | lr 1.18e-03 | (9894.67 ms | 52987 tok/s) step 2325/76294 | train loss 3.598753 | norm 0.1884 | lr 1.18e-03 | (9934.39 ms | 52775 tok/s) step 2326/76294 | train loss 3.616801 | norm 0.1683 | lr 1.18e-03 | (9891.81 ms | 53002 tok/s) step 2327/76294 | train loss 3.686234 | norm 0.1634 | lr 1.18e-03 | (9901.67 ms | 52949 tok/s) step 2328/76294 | train loss 3.619400 | norm 0.1500 | lr 1.18e-03 | (9928.88 ms | 52804 tok/s) step 2329/76294 | train loss 3.583540 | norm 0.1804 | lr 1.18e-03 | (9909.55 ms | 52907 tok/s) step 2330/76294 | train loss 3.632633 | norm 0.1678 | lr 1.18e-03 | (9931.89 ms | 52788 tok/s) step 2331/76294 | train loss 3.555200 | norm 0.2361 | lr 1.18e-03 | (9896.02 ms | 52980 tok/s) step 2332/76294 | train loss 3.674886 | norm 0.2270 | lr 1.18e-03 | (9889.47 ms | 53015 tok/s) step 2333/76294 | train loss 3.678099 | norm 0.1938 | lr 1.18e-03 | (9947.66 ms | 52705 tok/s) step 2334/76294 | train loss 3.601043 | norm 0.2096 | lr 1.18e-03 | (9916.63 ms | 52870 tok/s) step 2335/76294 | train loss 3.556976 | norm 0.1791 | lr 1.18e-03 | (9896.46 ms | 52977 tok/s) step 2336/76294 | train loss 3.735576 | norm 0.2015 | lr 1.18e-03 | (9924.39 ms | 52828 tok/s) step 2337/76294 | train loss 3.577119 | norm 0.1962 | lr 1.18e-03 | (9891.05 ms | 53006 tok/s) step 2338/76294 | train loss 3.664478 | norm 0.1793 | lr 1.18e-03 | (9985.97 ms | 52502 tok/s) step 2339/76294 | train loss 3.656331 | norm 0.1855 | lr 1.18e-03 | (9889.50 ms | 53015 tok/s) step 2340/76294 | train loss 3.570414 | norm 0.2107 | lr 1.18e-03 | (9896.77 ms | 52976 tok/s) step 2341/76294 | train loss 3.589083 | norm 0.1619 | lr 1.18e-03 | (11286.50 ms | 46453 tok/s) step 2342/76294 | train loss 3.642326 | norm 0.1880 | lr 1.18e-03 | (9875.16 ms | 53092 tok/s) step 2343/76294 | train loss 3.589358 | norm 0.1667 | lr 1.18e-03 | (11536.26 ms | 45447 tok/s) step 2344/76294 | train loss 3.580318 | norm 0.1662 | lr 1.18e-03 | (9879.92 ms | 53066 tok/s) step 2345/76294 | train loss 3.594229 | norm 0.2060 | lr 1.18e-03 | (9884.12 ms | 53043 tok/s) step 2346/76294 | train loss 3.560868 | norm 0.1590 | lr 1.18e-03 | (9906.64 ms | 52923 tok/s) step 2347/76294 | train loss 3.658517 | norm 0.2203 | lr 1.18e-03 | (9878.72 ms | 53072 tok/s) step 2348/76294 | train loss 3.649817 | norm 0.1892 | lr 1.18e-03 | (9892.32 ms | 52999 tok/s) step 2349/76294 | train loss 3.608780 | norm 0.1619 | lr 1.18e-03 | (9884.82 ms | 53040 tok/s) step 2350/76294 | train loss 3.610357 | norm 0.1659 | lr 1.18e-03 | (9924.27 ms | 52829 tok/s) step 2351/76294 | train loss 3.568245 | norm 0.1552 | lr 1.18e-03 | (12665.10 ms | 41396 tok/s) step 2352/76294 | train loss 3.589375 | norm 0.1584 | lr 1.18e-03 | (9862.75 ms | 53158 tok/s) step 2353/76294 | train loss 3.600728 | norm 0.1414 | lr 1.18e-03 | (9881.98 ms | 53055 tok/s) step 2354/76294 | train loss 3.607029 | norm 0.1492 | lr 1.18e-03 | (10877.25 ms | 48200 tok/s) step 2355/76294 | train loss 3.652212 | norm 0.1565 | lr 1.18e-03 | (9865.25 ms | 53145 tok/s) step 2356/76294 | train loss 3.565928 | norm 0.1421 | lr 1.18e-03 | (9871.02 ms | 53114 tok/s) step 2357/76294 | train loss 3.597287 | norm 0.1583 | lr 1.18e-03 | (9874.22 ms | 53097 tok/s) step 2358/76294 | train loss 3.592480 | norm 0.1487 | lr 1.18e-03 | (9880.86 ms | 53061 tok/s) step 2359/76294 | train loss 3.569582 | norm 0.1818 | lr 1.18e-03 | (9879.49 ms | 53068 tok/s) step 2360/76294 | train loss 3.581079 | norm 0.1838 | lr 1.18e-03 | (9887.57 ms | 53025 tok/s) step 2361/76294 | train loss 3.620583 | norm 0.1822 | lr 1.18e-03 | (9947.70 ms | 52704 tok/s) step 2362/76294 | train loss 3.728648 | norm 0.1916 | lr 1.18e-03 | (9878.27 ms | 53075 tok/s) step 2363/76294 | train loss 3.597684 | norm 0.1720 | lr 1.18e-03 | (9917.31 ms | 52866 tok/s) step 2364/76294 | train loss 3.578594 | norm 0.2066 | lr 1.18e-03 | (9907.81 ms | 52917 tok/s) step 2365/76294 | train loss 3.589330 | norm 0.1691 | lr 1.18e-03 | (9883.70 ms | 53046 tok/s) step 2366/76294 | train loss 3.632857 | norm 0.1593 | lr 1.18e-03 | (9878.44 ms | 53074 tok/s) step 2367/76294 | train loss 3.587810 | norm 0.1608 | lr 1.18e-03 | (9885.63 ms | 53035 tok/s) step 2368/76294 | train loss 3.776659 | norm 0.2315 | lr 1.18e-03 | (9896.80 ms | 52976 tok/s) step 2369/76294 | train loss 3.627362 | norm 0.2972 | lr 1.18e-03 | (9889.81 ms | 53013 tok/s) step 2370/76294 | train loss 3.675359 | norm 0.2291 | lr 1.18e-03 | (9890.82 ms | 53008 tok/s) step 2371/76294 | train loss 3.607703 | norm 0.1932 | lr 1.18e-03 | (9952.83 ms | 52677 tok/s) step 2372/76294 | train loss 3.612801 | norm 0.1809 | lr 1.18e-03 | (9891.86 ms | 53002 tok/s) step 2373/76294 | train loss 3.625135 | norm 0.1900 | lr 1.18e-03 | (9887.86 ms | 53023 tok/s) step 2374/76294 | train loss 3.584540 | norm 0.1582 | lr 1.18e-03 | (9922.98 ms | 52836 tok/s) step 2375/76294 | train loss 3.619799 | norm 0.1628 | lr 1.18e-03 | (9933.79 ms | 52778 tok/s) step 2376/76294 | train loss 3.579270 | norm 0.1517 | lr 1.18e-03 | (9916.69 ms | 52869 tok/s) step 2377/76294 | train loss 3.582988 | norm 0.1766 | lr 1.18e-03 | (9931.49 ms | 52790 tok/s) step 2378/76294 | train loss 3.596726 | norm 0.1932 | lr 1.18e-03 | (9888.52 ms | 53020 tok/s) step 2379/76294 | train loss 3.594299 | norm 0.1542 | lr 1.18e-03 | (9904.62 ms | 52934 tok/s) step 2380/76294 | train loss 3.664057 | norm 0.1622 | lr 1.18e-03 | (9891.53 ms | 53004 tok/s) step 2381/76294 | train loss 3.618722 | norm 0.1714 | lr 1.18e-03 | (9943.77 ms | 52725 tok/s) step 2382/76294 | train loss 3.600802 | norm 0.1663 | lr 1.18e-03 | (9932.70 ms | 52784 tok/s) step 2383/76294 | train loss 3.631089 | norm 0.1533 | lr 1.18e-03 | (9916.87 ms | 52868 tok/s) step 2384/76294 | train loss 3.698267 | norm 0.1722 | lr 1.18e-03 | (9894.61 ms | 52987 tok/s) step 2385/76294 | train loss 3.586241 | norm 0.1904 | lr 1.18e-03 | (9898.64 ms | 52966 tok/s) step 2386/76294 | train loss 3.602852 | norm 0.2113 | lr 1.18e-03 | (9888.70 ms | 53019 tok/s) step 2387/76294 | train loss 3.620041 | norm 0.1657 | lr 1.18e-03 | (9908.87 ms | 52911 tok/s) step 2388/76294 | train loss 3.614297 | norm 0.1600 | lr 1.18e-03 | (9930.35 ms | 52797 tok/s) step 2389/76294 | train loss 3.636033 | norm 0.1741 | lr 1.18e-03 | (9889.49 ms | 53015 tok/s) step 2390/76294 | train loss 3.604998 | norm 0.1693 | lr 1.18e-03 | (9927.50 ms | 52812 tok/s) step 2391/76294 | train loss 3.565524 | norm 0.1563 | lr 1.18e-03 | (9888.33 ms | 53021 tok/s) step 2392/76294 | train loss 3.603389 | norm 0.1669 | lr 1.18e-03 | (9892.80 ms | 52997 tok/s) step 2393/76294 | train loss 3.600672 | norm 0.1645 | lr 1.18e-03 | (9891.82 ms | 53002 tok/s) step 2394/76294 | train loss 3.684140 | norm 0.1684 | lr 1.18e-03 | (9910.35 ms | 52903 tok/s) step 2395/76294 | train loss 3.536064 | norm 0.1725 | lr 1.18e-03 | (10736.02 ms | 48835 tok/s) step 2396/76294 | train loss 3.588351 | norm 0.1697 | lr 1.18e-03 | (9911.20 ms | 52899 tok/s) step 2397/76294 | train loss 3.644312 | norm 0.1893 | lr 1.18e-03 | (9889.15 ms | 53016 tok/s) step 2398/76294 | train loss 3.614538 | norm 0.1575 | lr 1.18e-03 | (9882.43 ms | 53053 tok/s) step 2399/76294 | train loss 3.577148 | norm 0.1773 | lr 1.18e-03 | (9913.43 ms | 52887 tok/s) step 2400/76294 | train loss 3.591547 | norm 0.1885 | lr 1.18e-03 | (9890.92 ms | 53007 tok/s) step 2401/76294 | train loss 3.611442 | norm 0.1613 | lr 1.18e-03 | (9907.27 ms | 52920 tok/s) step 2402/76294 | train loss 3.587521 | norm 0.1737 | lr 1.18e-03 | (9894.77 ms | 52986 tok/s) step 2403/76294 | train loss 3.601100 | norm 0.1599 | lr 1.18e-03 | (9889.12 ms | 53017 tok/s) step 2404/76294 | train loss 3.607245 | norm 0.1789 | lr 1.18e-03 | (9903.43 ms | 52940 tok/s) step 2405/76294 | train loss 3.600397 | norm 0.2061 | lr 1.18e-03 | (9890.07 ms | 53012 tok/s) step 2406/76294 | train loss 3.672376 | norm 0.1650 | lr 1.18e-03 | (9894.78 ms | 52986 tok/s) step 2407/76294 | train loss 3.566678 | norm 0.1743 | lr 1.18e-03 | (9887.91 ms | 53023 tok/s) step 2408/76294 | train loss 3.595684 | norm 0.1558 | lr 1.18e-03 | (9898.69 ms | 52965 tok/s) step 2409/76294 | train loss 3.567196 | norm 0.1581 | lr 1.18e-03 | (9890.31 ms | 53010 tok/s) step 2410/76294 | train loss 3.655175 | norm 0.1652 | lr 1.18e-03 | (9892.30 ms | 53000 tok/s) step 2411/76294 | train loss 3.591531 | norm 0.1733 | lr 1.18e-03 | (9887.98 ms | 53023 tok/s) step 2412/76294 | train loss 3.625003 | norm 0.1553 | lr 1.18e-03 | (9936.53 ms | 52764 tok/s) step 2413/76294 | train loss 3.648375 | norm 0.1525 | lr 1.18e-03 | (9896.54 ms | 52977 tok/s) step 2414/76294 | train loss 3.623607 | norm 0.1881 | lr 1.18e-03 | (9890.69 ms | 53008 tok/s) step 2415/76294 | train loss 3.560944 | norm 0.1710 | lr 1.18e-03 | (9890.96 ms | 53007 tok/s) step 2416/76294 | train loss 3.562058 | norm 0.2095 | lr 1.18e-03 | (9952.51 ms | 52679 tok/s) step 2417/76294 | train loss 3.569631 | norm 0.2621 | lr 1.18e-03 | (9898.55 ms | 52966 tok/s) step 2418/76294 | train loss 3.545935 | norm 0.2182 | lr 1.18e-03 | (9924.70 ms | 52827 tok/s) step 2419/76294 | train loss 3.702034 | norm 0.2261 | lr 1.18e-03 | (9889.06 ms | 53017 tok/s) step 2420/76294 | train loss 3.572223 | norm 0.2148 | lr 1.18e-03 | (9895.22 ms | 52984 tok/s) step 2421/76294 | train loss 3.571945 | norm 0.2092 | lr 1.18e-03 | (9894.84 ms | 52986 tok/s) step 2422/76294 | train loss 3.637505 | norm 0.1701 | lr 1.18e-03 | (9889.50 ms | 53015 tok/s) step 2423/76294 | train loss 3.577717 | norm 0.1861 | lr 1.18e-03 | (9950.09 ms | 52692 tok/s) step 2424/76294 | train loss 3.579258 | norm 0.1803 | lr 1.18e-03 | (9890.09 ms | 53011 tok/s) step 2425/76294 | train loss 3.581048 | norm 0.1637 | lr 1.18e-03 | (9887.86 ms | 53023 tok/s) step 2426/76294 | train loss 3.568109 | norm 0.1517 | lr 1.18e-03 | (9948.75 ms | 52699 tok/s) step 2427/76294 | train loss 3.653480 | norm 0.1585 | lr 1.18e-03 | (9891.31 ms | 53005 tok/s) step 2428/76294 | train loss 3.643276 | norm 0.1471 | lr 1.18e-03 | (9899.88 ms | 52959 tok/s) step 2429/76294 | train loss 3.600647 | norm 0.1505 | lr 1.18e-03 | (9926.89 ms | 52815 tok/s) step 2430/76294 | train loss 3.552986 | norm 0.1675 | lr 1.18e-03 | (9884.36 ms | 53042 tok/s) step 2431/76294 | train loss 3.606646 | norm 0.2017 | lr 1.18e-03 | (9891.14 ms | 53006 tok/s) step 2432/76294 | train loss 3.586649 | norm 0.2206 | lr 1.18e-03 | (9884.94 ms | 53039 tok/s) step 2433/76294 | train loss 3.586032 | norm 0.2237 | lr 1.18e-03 | (9888.30 ms | 53021 tok/s) step 2434/76294 | train loss 3.573596 | norm 0.2141 | lr 1.18e-03 | (9953.83 ms | 52672 tok/s) step 2435/76294 | train loss 3.604146 | norm 0.1879 | lr 1.18e-03 | (9883.57 ms | 53046 tok/s) step 2436/76294 | train loss 3.519258 | norm 0.1922 | lr 1.18e-03 | (9917.28 ms | 52866 tok/s) step 2437/76294 | train loss 3.558607 | norm 0.1861 | lr 1.18e-03 | (9889.57 ms | 53014 tok/s) step 2438/76294 | train loss 3.586174 | norm 0.1785 | lr 1.18e-03 | (9922.86 ms | 52836 tok/s) step 2439/76294 | train loss 3.597469 | norm 0.1796 | lr 1.18e-03 | (11312.02 ms | 46348 tok/s) step 2440/76294 | train loss 3.594532 | norm 0.1618 | lr 1.18e-03 | (9877.47 ms | 53079 tok/s) step 2441/76294 | train loss 3.532081 | norm 0.1818 | lr 1.18e-03 | (9897.44 ms | 52972 tok/s) step 2442/76294 | train loss 3.597541 | norm 0.1694 | lr 1.18e-03 | (9900.53 ms | 52956 tok/s) step 2443/76294 | train loss 3.631111 | norm 0.1632 | lr 1.18e-03 | (9885.72 ms | 53035 tok/s) step 2444/76294 | train loss 3.588393 | norm 0.1764 | lr 1.18e-03 | (9892.83 ms | 52997 tok/s) step 2445/76294 | train loss 3.659441 | norm 0.1594 | lr 1.18e-03 | (9883.72 ms | 53046 tok/s) step 2446/76294 | train loss 3.603513 | norm 0.1560 | lr 1.18e-03 | (9888.53 ms | 53020 tok/s) step 2447/76294 | train loss 3.598040 | norm 0.1603 | lr 1.18e-03 | (9888.73 ms | 53019 tok/s) step 2448/76294 | train loss 3.595480 | norm 0.1710 | lr 1.18e-03 | (9890.13 ms | 53011 tok/s) step 2449/76294 | train loss 3.582804 | norm 0.1761 | lr 1.18e-03 | (9890.99 ms | 53007 tok/s) step 2450/76294 | train loss 3.612175 | norm 0.1776 | lr 1.18e-03 | (9898.63 ms | 52966 tok/s) step 2451/76294 | train loss 3.628970 | norm 0.1759 | lr 1.18e-03 | (9911.98 ms | 52894 tok/s) step 2452/76294 | train loss 3.541107 | norm 0.1640 | lr 1.18e-03 | (9893.96 ms | 52991 tok/s) step 2453/76294 | train loss 3.574959 | norm 0.1691 | lr 1.18e-03 | (9893.56 ms | 52993 tok/s) step 2454/76294 | train loss 3.546337 | norm 0.1526 | lr 1.18e-03 | (9936.45 ms | 52764 tok/s) step 2455/76294 | train loss 3.583252 | norm 0.1640 | lr 1.18e-03 | (9894.62 ms | 52987 tok/s) step 2456/76294 | train loss 3.551900 | norm 0.1434 | lr 1.18e-03 | (9960.65 ms | 52636 tok/s) step 2457/76294 | train loss 3.594975 | norm 0.1509 | lr 1.18e-03 | (9889.16 ms | 53016 tok/s) step 2458/76294 | train loss 3.592717 | norm 0.1717 | lr 1.18e-03 | (9904.12 ms | 52936 tok/s) step 2459/76294 | train loss 3.575316 | norm 0.1727 | lr 1.18e-03 | (9891.48 ms | 53004 tok/s) step 2460/76294 | train loss 3.542078 | norm 0.1613 | lr 1.18e-03 | (9894.39 ms | 52988 tok/s) step 2461/76294 | train loss 3.516644 | norm 0.1609 | lr 1.18e-03 | (9908.25 ms | 52914 tok/s) step 2462/76294 | train loss 3.563845 | norm 0.1597 | lr 1.18e-03 | (9894.82 ms | 52986 tok/s) step 2463/76294 | train loss 3.585673 | norm 0.1400 | lr 1.18e-03 | (9897.72 ms | 52971 tok/s) step 2464/76294 | train loss 3.589014 | norm 0.1529 | lr 1.18e-03 | (9889.80 ms | 53013 tok/s) step 2465/76294 | train loss 3.631405 | norm 0.1479 | lr 1.18e-03 | (9956.54 ms | 52658 tok/s) step 2466/76294 | train loss 3.545464 | norm 0.1544 | lr 1.18e-03 | (9900.71 ms | 52955 tok/s) step 2467/76294 | train loss 3.576749 | norm 0.1698 | lr 1.18e-03 | (9906.77 ms | 52922 tok/s) step 2468/76294 | train loss 3.592490 | norm 0.1588 | lr 1.18e-03 | (10150.02 ms | 51654 tok/s) step 2469/76294 | train loss 3.617379 | norm 0.1676 | lr 1.18e-03 | (10111.21 ms | 51852 tok/s) step 2470/76294 | train loss 3.583270 | norm 0.1838 | lr 1.18e-03 | (9887.20 ms | 53027 tok/s) step 2471/76294 | train loss 3.645157 | norm 0.1891 | lr 1.18e-03 | (9892.29 ms | 53000 tok/s) step 2472/76294 | train loss 3.642051 | norm 0.1803 | lr 1.18e-03 | (9899.74 ms | 52960 tok/s) step 2473/76294 | train loss 3.581014 | norm 0.1873 | lr 1.18e-03 | (9935.16 ms | 52771 tok/s) step 2474/76294 | train loss 3.547951 | norm 0.1648 | lr 1.18e-03 | (9893.34 ms | 52994 tok/s) step 2475/76294 | train loss 3.537883 | norm 0.1818 | lr 1.18e-03 | (9893.68 ms | 52992 tok/s) step 2476/76294 | train loss 3.539575 | norm 0.1767 | lr 1.18e-03 | (9896.78 ms | 52976 tok/s) step 2477/76294 | train loss 3.589415 | norm 0.1687 | lr 1.18e-03 | (9894.94 ms | 52985 tok/s) step 2478/76294 | train loss 3.529781 | norm 0.1675 | lr 1.18e-03 | (9906.36 ms | 52924 tok/s) step 2479/76294 | train loss 3.604516 | norm 0.1534 | lr 1.18e-03 | (9893.63 ms | 52992 tok/s) step 2480/76294 | train loss 3.590513 | norm 0.1617 | lr 1.18e-03 | (9893.26 ms | 52994 tok/s) step 2481/76294 | train loss 3.550867 | norm 0.1898 | lr 1.18e-03 | (9897.97 ms | 52969 tok/s) step 2482/76294 | train loss 3.636245 | norm 0.2559 | lr 1.18e-03 | (9892.60 ms | 52998 tok/s) step 2483/76294 | train loss 3.599190 | norm 0.2233 | lr 1.18e-03 | (9902.07 ms | 52947 tok/s) step 2484/76294 | train loss 3.658762 | norm 0.1723 | lr 1.18e-03 | (9890.04 ms | 53012 tok/s) step 2485/76294 | train loss 3.629344 | norm 0.2049 | lr 1.18e-03 | (9890.61 ms | 53009 tok/s) step 2486/76294 | train loss 3.623940 | norm 0.1725 | lr 1.18e-03 | (9906.18 ms | 52925 tok/s) step 2487/76294 | train loss 3.582578 | norm 0.1667 | lr 1.18e-03 | (9929.44 ms | 52801 tok/s) step 2488/76294 | train loss 3.534307 | norm 0.1836 | lr 1.18e-03 | (9892.89 ms | 52996 tok/s) step 2489/76294 | train loss 3.572110 | norm 0.3067 | lr 1.18e-03 | (9919.64 ms | 52854 tok/s) step 2490/76294 | train loss 3.573917 | norm 0.1501 | lr 1.18e-03 | (9896.81 ms | 52975 tok/s) step 2491/76294 | train loss 3.621342 | norm 0.1596 | lr 1.18e-03 | (9910.48 ms | 52902 tok/s) step 2492/76294 | train loss 3.594633 | norm 0.1792 | lr 1.18e-03 | (9895.67 ms | 52982 tok/s) step 2493/76294 | train loss 3.604690 | norm 0.1477 | lr 1.18e-03 | (9891.95 ms | 53002 tok/s) step 2494/76294 | train loss 3.570990 | norm 0.1601 | lr 1.18e-03 | (9897.90 ms | 52970 tok/s) step 2495/76294 | train loss 3.562353 | norm 0.1862 | lr 1.18e-03 | (9895.34 ms | 52983 tok/s) step 2496/76294 | train loss 3.681093 | norm 0.1673 | lr 1.18e-03 | (9892.46 ms | 52999 tok/s) step 2497/76294 | train loss 3.593633 | norm 0.1574 | lr 1.18e-03 | (9891.61 ms | 53003 tok/s) step 2498/76294 | train loss 3.565330 | norm 0.1637 | lr 1.18e-03 | (9889.67 ms | 53014 tok/s) step 2499/76294 | train loss 3.557631 | norm 0.1588 | lr 1.18e-03 | (9910.73 ms | 52901 tok/s) step 2500/76294 | train loss 3.612518 | norm 0.2224 | lr 1.18e-03 | (9897.51 ms | 52972 tok/s) val loss: 3.558307 saving model checkpoint to ./results/gpt2-350M-gqa/step_2500.pth step 2501/76294 | train loss 3.543079 | norm 0.1585 | lr 1.18e-03 | (9980.29 ms | 52532 tok/s) step 2502/76294 | train loss 3.569981 | norm 0.1453 | lr 1.18e-03 | (9879.94 ms | 53066 tok/s) step 2503/76294 | train loss 3.582916 | norm 0.1613 | lr 1.18e-03 | (9871.09 ms | 53113 tok/s) step 2504/76294 | train loss 3.600351 | norm 0.1590 | lr 1.18e-03 | (9882.85 ms | 53050 tok/s) step 2505/76294 | train loss 3.507373 | norm 0.7239 | lr 1.18e-03 | (9874.07 ms | 53097 tok/s) step 2506/76294 | train loss 3.628415 | norm 0.1530 | lr 1.18e-03 | (9874.80 ms | 53094 tok/s) step 2507/76294 | train loss 3.653106 | norm 0.3427 | lr 1.18e-03 | (9914.28 ms | 52882 tok/s) step 2508/76294 | train loss 3.607085 | norm 0.7036 | lr 1.18e-03 | (9953.69 ms | 52673 tok/s) step 2509/76294 | train loss 3.649722 | norm 0.8930 | lr 1.18e-03 | (9882.07 ms | 53054 tok/s) step 2510/76294 | train loss 3.578721 | norm 0.2930 | lr 1.18e-03 | (9887.72 ms | 53024 tok/s) step 2511/76294 | train loss 3.633396 | norm 0.2480 | lr 1.18e-03 | (9888.15 ms | 53022 tok/s) step 2512/76294 | train loss 3.523506 | norm 0.4408 | lr 1.18e-03 | (9880.39 ms | 53063 tok/s) step 2513/76294 | train loss 3.573449 | norm 0.1961 | lr 1.18e-03 | (9962.19 ms | 52628 tok/s) step 2514/76294 | train loss 3.618009 | norm 0.1855 | lr 1.18e-03 | (9909.80 ms | 52906 tok/s) step 2515/76294 | train loss 3.570346 | norm 0.1810 | lr 1.18e-03 | (9896.76 ms | 52976 tok/s) step 2516/76294 | train loss 3.571642 | norm 0.1677 | lr 1.18e-03 | (9927.97 ms | 52809 tok/s) step 2517/76294 | train loss 3.610416 | norm 0.1853 | lr 1.18e-03 | (9894.49 ms | 52988 tok/s) step 2518/76294 | train loss 3.541749 | norm 0.1602 | lr 1.18e-03 | (9901.76 ms | 52949 tok/s) step 2519/76294 | train loss 3.522950 | norm 0.1431 | lr 1.18e-03 | (9895.57 ms | 52982 tok/s) step 2520/76294 | train loss 3.646340 | norm 0.1458 | lr 1.18e-03 | (9899.68 ms | 52960 tok/s) step 2521/76294 | train loss 3.552135 | norm 0.1622 | lr 1.18e-03 | (9901.90 ms | 52948 tok/s) step 2522/76294 | train loss 3.551470 | norm 0.1540 | lr 1.18e-03 | (9901.86 ms | 52948 tok/s) step 2523/76294 | train loss 3.583251 | norm 0.1525 | lr 1.18e-03 | (9899.10 ms | 52963 tok/s) step 2524/76294 | train loss 3.585854 | norm 0.1589 | lr 1.18e-03 | (9891.55 ms | 53004 tok/s) step 2525/76294 | train loss 3.544755 | norm 0.2182 | lr 1.18e-03 | (9897.29 ms | 52973 tok/s) step 2526/76294 | train loss 3.558155 | norm 0.1624 | lr 1.18e-03 | (9968.03 ms | 52597 tok/s) step 2527/76294 | train loss 3.481051 | norm 0.1543 | lr 1.18e-03 | (9893.83 ms | 52991 tok/s) step 2528/76294 | train loss 3.600862 | norm 0.1739 | lr 1.18e-03 | (9896.24 ms | 52979 tok/s) step 2529/76294 | train loss 3.504828 | norm 0.1604 | lr 1.18e-03 | (9893.11 ms | 52995 tok/s) step 2530/76294 | train loss 3.509578 | norm 0.1546 | lr 1.18e-03 | (9896.88 ms | 52975 tok/s) step 2531/76294 | train loss 3.566957 | norm 0.1491 | lr 1.18e-03 | (9893.48 ms | 52993 tok/s) step 2532/76294 | train loss 3.589595 | norm 0.1511 | lr 1.18e-03 | (9896.39 ms | 52978 tok/s) step 2533/76294 | train loss 3.594606 | norm 0.1637 | lr 1.18e-03 | (9892.56 ms | 52998 tok/s) step 2534/76294 | train loss 3.521312 | norm 0.1614 | lr 1.18e-03 | (9900.52 ms | 52956 tok/s) step 2535/76294 | train loss 3.582409 | norm 0.1837 | lr 1.18e-03 | (9891.73 ms | 53003 tok/s) step 2536/76294 | train loss 3.534314 | norm 0.2024 | lr 1.18e-03 | (10971.67 ms | 47786 tok/s) step 2537/76294 | train loss 3.569237 | norm 0.2209 | lr 1.18e-03 | (9901.97 ms | 52948 tok/s) step 2538/76294 | train loss 3.533889 | norm 0.2208 | lr 1.18e-03 | (9893.19 ms | 52995 tok/s) step 2539/76294 | train loss 3.586589 | norm 0.1905 | lr 1.18e-03 | (9891.78 ms | 53002 tok/s) step 2540/76294 | train loss 3.588358 | norm 0.1895 | lr 1.18e-03 | (9886.75 ms | 53029 tok/s) step 2541/76294 | train loss 3.609696 | norm 0.1977 | lr 1.18e-03 | (9888.28 ms | 53021 tok/s) step 2542/76294 | train loss 3.596359 | norm 0.1674 | lr 1.18e-03 | (9894.84 ms | 52986 tok/s) step 2543/76294 | train loss 3.559471 | norm 0.1570 | lr 1.18e-03 | (9894.04 ms | 52990 tok/s) step 2544/76294 | train loss 3.658801 | norm 0.1632 | lr 1.18e-03 | (10177.32 ms | 51515 tok/s) step 2545/76294 | train loss 3.537445 | norm 0.2226 | lr 1.18e-03 | (9884.72 ms | 53040 tok/s) step 2546/76294 | train loss 3.598415 | norm 0.1877 | lr 1.18e-03 | (9928.91 ms | 52804 tok/s) step 2547/76294 | train loss 3.659971 | norm 0.1817 | lr 1.18e-03 | (9930.13 ms | 52798 tok/s) step 2548/76294 | train loss 3.489448 | norm 0.1793 | lr 1.18e-03 | (9890.11 ms | 53011 tok/s) step 2549/76294 | train loss 3.583601 | norm 0.1892 | lr 1.17e-03 | (9928.34 ms | 52807 tok/s) step 2550/76294 | train loss 3.551489 | norm 0.1538 | lr 1.17e-03 | (9886.30 ms | 53032 tok/s) step 2551/76294 | train loss 3.537964 | norm 0.1696 | lr 1.17e-03 | (9901.05 ms | 52953 tok/s) step 2552/76294 | train loss 3.569534 | norm 0.1626 | lr 1.17e-03 | (9891.86 ms | 53002 tok/s) step 2553/76294 | train loss 3.573304 | norm 0.1659 | lr 1.17e-03 | (9898.15 ms | 52968 tok/s) step 2554/76294 | train loss 3.637412 | norm 0.1546 | lr 1.17e-03 | (9897.28 ms | 52973 tok/s) step 2555/76294 | train loss 3.610552 | norm 0.1602 | lr 1.17e-03 | (9909.94 ms | 52905 tok/s) step 2556/76294 | train loss 3.545188 | norm 0.2049 | lr 1.17e-03 | (9952.49 ms | 52679 tok/s) step 2557/76294 | train loss 3.657024 | norm 0.2115 | lr 1.17e-03 | (9898.78 ms | 52965 tok/s) step 2558/76294 | train loss 3.664395 | norm 0.1826 | lr 1.17e-03 | (9892.85 ms | 52997 tok/s) step 2559/76294 | train loss 3.553725 | norm 0.1714 | lr 1.17e-03 | (9931.88 ms | 52788 tok/s) step 2560/76294 | train loss 3.575091 | norm 0.1610 | lr 1.17e-03 | (9892.29 ms | 53000 tok/s) step 2561/76294 | train loss 3.625437 | norm 0.1783 | lr 1.17e-03 | (9902.97 ms | 52943 tok/s) step 2562/76294 | train loss 3.570997 | norm 0.1698 | lr 1.17e-03 | (9892.82 ms | 52997 tok/s) step 2563/76294 | train loss 3.573593 | norm 0.1512 | lr 1.17e-03 | (9953.60 ms | 52673 tok/s) step 2564/76294 | train loss 3.680922 | norm 0.1890 | lr 1.17e-03 | (9900.77 ms | 52954 tok/s) step 2565/76294 | train loss 3.543433 | norm 0.2118 | lr 1.17e-03 | (9896.37 ms | 52978 tok/s) step 2566/76294 | train loss 3.581417 | norm 0.1763 | lr 1.17e-03 | (9940.99 ms | 52740 tok/s) step 2567/76294 | train loss 3.541319 | norm 0.1689 | lr 1.17e-03 | (9910.34 ms | 52903 tok/s) step 2568/76294 | train loss 3.551792 | norm 0.1889 | lr 1.17e-03 | (9898.84 ms | 52965 tok/s) step 2569/76294 | train loss 3.541395 | norm 0.1962 | lr 1.17e-03 | (9891.51 ms | 53004 tok/s) step 2570/76294 | train loss 3.565650 | norm 0.1869 | lr 1.17e-03 | (9891.67 ms | 53003 tok/s) step 2571/76294 | train loss 3.558446 | norm 0.1585 | lr 1.17e-03 | (9898.31 ms | 52967 tok/s) step 2572/76294 | train loss 3.584666 | norm 0.1662 | lr 1.17e-03 | (9894.92 ms | 52986 tok/s) step 2573/76294 | train loss 3.524556 | norm 0.1947 | lr 1.17e-03 | (9899.07 ms | 52963 tok/s) step 2574/76294 | train loss 3.577327 | norm 0.1539 | lr 1.17e-03 | (9888.10 ms | 53022 tok/s) step 2575/76294 | train loss 3.567201 | norm 0.1825 | lr 1.17e-03 | (9899.90 ms | 52959 tok/s) step 2576/76294 | train loss 3.537301 | norm 0.2021 | lr 1.17e-03 | (9895.89 ms | 52980 tok/s) step 2577/76294 | train loss 3.573209 | norm 0.1635 | lr 1.17e-03 | (9965.96 ms | 52608 tok/s) step 2578/76294 | train loss 3.620302 | norm 0.1511 | lr 1.17e-03 | (9892.82 ms | 52997 tok/s) step 2579/76294 | train loss 3.624717 | norm 0.1593 | lr 1.17e-03 | (9893.85 ms | 52991 tok/s) step 2580/76294 | train loss 3.622936 | norm 0.1602 | lr 1.17e-03 | (9929.50 ms | 52801 tok/s) step 2581/76294 | train loss 3.631229 | norm 0.1651 | lr 1.17e-03 | (9888.90 ms | 53018 tok/s) step 2582/76294 | train loss 3.565043 | norm 0.1595 | lr 1.17e-03 | (9911.64 ms | 52896 tok/s) step 2583/76294 | train loss 3.535814 | norm 0.1689 | lr 1.17e-03 | (9895.81 ms | 52981 tok/s) step 2584/76294 | train loss 3.577200 | norm 0.1775 | lr 1.17e-03 | (9891.44 ms | 53004 tok/s) step 2585/76294 | train loss 3.567640 | norm 0.1764 | lr 1.17e-03 | (9895.93 ms | 52980 tok/s) step 2586/76294 | train loss 3.536964 | norm 0.2595 | lr 1.17e-03 | (10268.28 ms | 51059 tok/s) step 2587/76294 | train loss 3.574563 | norm 0.2349 | lr 1.17e-03 | (9896.42 ms | 52978 tok/s) step 2588/76294 | train loss 3.595883 | norm 0.1716 | lr 1.17e-03 | (9887.60 ms | 53025 tok/s) step 2589/76294 | train loss 3.528881 | norm 0.1801 | lr 1.17e-03 | (9893.19 ms | 52995 tok/s) step 2590/76294 | train loss 3.598533 | norm 0.1788 | lr 1.17e-03 | (9890.20 ms | 53011 tok/s) step 2591/76294 | train loss 3.520195 | norm 0.1523 | lr 1.17e-03 | (9895.63 ms | 52982 tok/s) step 2592/76294 | train loss 3.596833 | norm 0.1655 | lr 1.17e-03 | (9886.64 ms | 53030 tok/s) step 2593/76294 | train loss 3.534905 | norm 0.1861 | lr 1.17e-03 | (9894.82 ms | 52986 tok/s) step 2594/76294 | train loss 3.622366 | norm 0.1957 | lr 1.17e-03 | (9906.20 ms | 52925 tok/s) step 2595/76294 | train loss 3.500116 | norm 0.1609 | lr 1.17e-03 | (9892.02 ms | 53001 tok/s) step 2596/76294 | train loss 3.599390 | norm 0.1788 | lr 1.17e-03 | (9889.74 ms | 53013 tok/s) step 2597/76294 | train loss 3.569846 | norm 0.1898 | lr 1.17e-03 | (9892.17 ms | 53000 tok/s) step 2598/76294 | train loss 3.576607 | norm 0.1899 | lr 1.17e-03 | (9888.51 ms | 53020 tok/s) step 2599/76294 | train loss 3.563333 | norm 0.1952 | lr 1.17e-03 | (9949.05 ms | 52697 tok/s) step 2600/76294 | train loss 3.588111 | norm 0.1693 | lr 1.17e-03 | (9906.01 ms | 52926 tok/s) step 2601/76294 | train loss 3.546257 | norm 0.1752 | lr 1.17e-03 | (9900.04 ms | 52958 tok/s) step 2602/76294 | train loss 3.540616 | norm 0.1475 | lr 1.17e-03 | (9891.62 ms | 53003 tok/s) step 2603/76294 | train loss 3.617390 | norm 0.1852 | lr 1.17e-03 | (9899.92 ms | 52959 tok/s) step 2604/76294 | train loss 3.637073 | norm 0.2231 | lr 1.17e-03 | (9894.56 ms | 52988 tok/s) step 2605/76294 | train loss 3.604201 | norm 0.1720 | lr 1.17e-03 | (9897.29 ms | 52973 tok/s) step 2606/76294 | train loss 3.610801 | norm 0.1536 | lr 1.17e-03 | (9915.13 ms | 52878 tok/s) step 2607/76294 | train loss 3.651378 | norm 0.1691 | lr 1.17e-03 | (9966.45 ms | 52605 tok/s) step 2608/76294 | train loss 3.556275 | norm 0.1993 | lr 1.17e-03 | (9886.51 ms | 53031 tok/s) step 2609/76294 | train loss 3.566746 | norm 0.1764 | lr 1.17e-03 | (9960.23 ms | 52638 tok/s) step 2610/76294 | train loss 3.596210 | norm 0.2102 | lr 1.17e-03 | (9888.38 ms | 53021 tok/s) step 2611/76294 | train loss 3.581025 | norm 0.2307 | lr 1.17e-03 | (9886.66 ms | 53030 tok/s) step 2612/76294 | train loss 3.627417 | norm 0.1755 | lr 1.17e-03 | (9954.64 ms | 52668 tok/s) step 2613/76294 | train loss 3.502614 | norm 0.1646 | lr 1.17e-03 | (9892.97 ms | 52996 tok/s) step 2614/76294 | train loss 3.628403 | norm 0.1697 | lr 1.17e-03 | (9882.44 ms | 53052 tok/s) step 2615/76294 | train loss 3.522007 | norm 0.1583 | lr 1.17e-03 | (9891.93 ms | 53002 tok/s) step 2616/76294 | train loss 3.583829 | norm 0.1587 | lr 1.17e-03 | (9888.20 ms | 53022 tok/s) step 2617/76294 | train loss 3.528924 | norm 0.1804 | lr 1.17e-03 | (9972.67 ms | 52572 tok/s) step 2618/76294 | train loss 3.516846 | norm 0.1734 | lr 1.17e-03 | (9884.55 ms | 53041 tok/s) step 2619/76294 | train loss 3.543782 | norm 0.1576 | lr 1.17e-03 | (9937.32 ms | 52760 tok/s) step 2620/76294 | train loss 3.589793 | norm 0.1598 | lr 1.17e-03 | (9882.91 ms | 53050 tok/s) step 2621/76294 | train loss 3.539796 | norm 0.1588 | lr 1.17e-03 | (9881.22 ms | 53059 tok/s) step 2622/76294 | train loss 3.616461 | norm 0.1596 | lr 1.17e-03 | (9879.10 ms | 53070 tok/s) step 2623/76294 | train loss 3.559262 | norm 0.1568 | lr 1.17e-03 | (9892.80 ms | 52997 tok/s) step 2624/76294 | train loss 3.606287 | norm 0.1431 | lr 1.17e-03 | (9925.40 ms | 52823 tok/s) step 2625/76294 | train loss 3.479927 | norm 0.1526 | lr 1.17e-03 | (9888.25 ms | 53021 tok/s) step 2626/76294 | train loss 3.572836 | norm 0.1325 | lr 1.17e-03 | (9894.65 ms | 52987 tok/s) step 2627/76294 | train loss 3.531499 | norm 0.1757 | lr 1.17e-03 | (9884.60 ms | 53041 tok/s) step 2628/76294 | train loss 3.575397 | norm 0.1605 | lr 1.17e-03 | (9890.87 ms | 53007 tok/s) step 2629/76294 | train loss 3.512169 | norm 0.1573 | lr 1.17e-03 | (9907.26 ms | 52920 tok/s) step 2630/76294 | train loss 3.505292 | norm 0.1587 | lr 1.17e-03 | (9916.50 ms | 52870 tok/s) step 2631/76294 | train loss 3.562487 | norm 0.1699 | lr 1.17e-03 | (9890.48 ms | 53009 tok/s) step 2632/76294 | train loss 3.515726 | norm 0.1647 | lr 1.17e-03 | (10286.41 ms | 50969 tok/s) step 2633/76294 | train loss 3.591362 | norm 0.1674 | lr 1.17e-03 | (9983.48 ms | 52516 tok/s) step 2634/76294 | train loss 3.481684 | norm 0.1778 | lr 1.17e-03 | (11063.42 ms | 47389 tok/s) step 2635/76294 | train loss 3.562362 | norm 0.1630 | lr 1.17e-03 | (9925.88 ms | 52820 tok/s) step 2636/76294 | train loss 3.494616 | norm 0.1875 | lr 1.17e-03 | (9909.94 ms | 52905 tok/s) step 2637/76294 | train loss 3.594008 | norm 0.1984 | lr 1.17e-03 | (9886.22 ms | 53032 tok/s) step 2638/76294 | train loss 3.563700 | norm 0.4286 | lr 1.17e-03 | (9883.05 ms | 53049 tok/s) step 2639/76294 | train loss 3.531250 | norm 0.3805 | lr 1.17e-03 | (9901.73 ms | 52949 tok/s) step 2640/76294 | train loss 3.554598 | norm 0.2509 | lr 1.17e-03 | (9916.36 ms | 52871 tok/s) step 2641/76294 | train loss 3.552454 | norm 0.2091 | lr 1.17e-03 | (9885.50 ms | 53036 tok/s) step 2642/76294 | train loss 3.523901 | norm 0.2076 | lr 1.17e-03 | (9909.58 ms | 52907 tok/s) step 2643/76294 | train loss 3.532396 | norm 0.1781 | lr 1.17e-03 | (9890.71 ms | 53008 tok/s) step 2644/76294 | train loss 3.536231 | norm 0.1762 | lr 1.17e-03 | (9892.99 ms | 52996 tok/s) step 2645/76294 | train loss 3.553622 | norm 0.1681 | lr 1.17e-03 | (9888.73 ms | 53019 tok/s) step 2646/76294 | train loss 3.522356 | norm 0.1512 | lr 1.17e-03 | (9890.85 ms | 53007 tok/s) step 2647/76294 | train loss 3.550400 | norm 0.1452 | lr 1.17e-03 | (9887.04 ms | 53028 tok/s) step 2648/76294 | train loss 3.572194 | norm 0.1761 | lr 1.17e-03 | (9888.47 ms | 53020 tok/s) step 2649/76294 | train loss 3.563656 | norm 0.7901 | lr 1.17e-03 | (9887.60 ms | 53025 tok/s) step 2650/76294 | train loss 3.525322 | norm 0.2062 | lr 1.17e-03 | (9893.63 ms | 52992 tok/s) step 2651/76294 | train loss 3.570272 | norm 0.1910 | lr 1.17e-03 | (9887.63 ms | 53025 tok/s) step 2652/76294 | train loss 3.514792 | norm 0.1828 | lr 1.17e-03 | (9930.32 ms | 52797 tok/s) step 2653/76294 | train loss 3.658908 | norm 0.2045 | lr 1.17e-03 | (9963.72 ms | 52620 tok/s) step 2654/76294 | train loss 3.494637 | norm 0.2044 | lr 1.17e-03 | (9891.89 ms | 53002 tok/s) step 2655/76294 | train loss 3.511941 | norm 0.1601 | lr 1.17e-03 | (9891.97 ms | 53001 tok/s) step 2656/76294 | train loss 3.525280 | norm 0.1701 | lr 1.17e-03 | (9919.35 ms | 52855 tok/s) step 2657/76294 | train loss 3.576534 | norm 0.1591 | lr 1.17e-03 | (9889.67 ms | 53014 tok/s) step 2658/76294 | train loss 3.509144 | norm 0.1601 | lr 1.17e-03 | (9878.82 ms | 53072 tok/s) step 2659/76294 | train loss 3.577902 | norm 0.1443 | lr 1.17e-03 | (9890.40 ms | 53010 tok/s) step 2660/76294 | train loss 3.508217 | norm 0.1595 | lr 1.17e-03 | (9925.64 ms | 52822 tok/s) step 2661/76294 | train loss 3.502828 | norm 0.1520 | lr 1.17e-03 | (9888.58 ms | 53020 tok/s) step 2662/76294 | train loss 3.542877 | norm 0.1566 | lr 1.17e-03 | (9899.52 ms | 52961 tok/s) step 2663/76294 | train loss 3.611095 | norm 0.1431 | lr 1.17e-03 | (9959.24 ms | 52643 tok/s) step 2664/76294 | train loss 3.503281 | norm 0.1886 | lr 1.17e-03 | (9887.36 ms | 53026 tok/s) step 2665/76294 | train loss 3.516939 | norm 0.2293 | lr 1.17e-03 | (9978.11 ms | 52544 tok/s) step 2666/76294 | train loss 3.502275 | norm 0.1861 | lr 1.17e-03 | (9881.28 ms | 53059 tok/s) step 2667/76294 | train loss 3.685155 | norm 0.2140 | lr 1.17e-03 | (9884.48 ms | 53042 tok/s) step 2668/76294 | train loss 3.542377 | norm 0.1785 | lr 1.17e-03 | (9941.41 ms | 52738 tok/s) step 2669/76294 | train loss 3.505135 | norm 0.2042 | lr 1.17e-03 | (9890.94 ms | 53007 tok/s) step 2670/76294 | train loss 3.510550 | norm 0.2389 | lr 1.17e-03 | (9892.98 ms | 52996 tok/s) step 2671/76294 | train loss 3.624490 | norm 0.1649 | lr 1.17e-03 | (10042.51 ms | 52207 tok/s) step 2672/76294 | train loss 3.570650 | norm 0.1593 | lr 1.17e-03 | (9890.41 ms | 53010 tok/s) step 2673/76294 | train loss 3.541358 | norm 0.1603 | lr 1.17e-03 | (9896.08 ms | 52979 tok/s) step 2674/76294 | train loss 3.652866 | norm 0.1544 | lr 1.17e-03 | (9887.64 ms | 53025 tok/s) step 2675/76294 | train loss 3.506002 | norm 0.1484 | lr 1.17e-03 | (9900.73 ms | 52954 tok/s) step 2676/76294 | train loss 3.512845 | norm 0.1494 | lr 1.17e-03 | (9954.00 ms | 52671 tok/s) step 2677/76294 | train loss 3.550810 | norm 0.1394 | lr 1.17e-03 | (9889.41 ms | 53015 tok/s) step 2678/76294 | train loss 3.490631 | norm 0.1460 | lr 1.17e-03 | (9951.05 ms | 52687 tok/s) step 2679/76294 | train loss 3.562811 | norm 0.1580 | lr 1.17e-03 | (9887.01 ms | 53028 tok/s) step 2680/76294 | train loss 3.556265 | norm 0.1633 | lr 1.17e-03 | (9883.35 ms | 53048 tok/s) step 2681/76294 | train loss 3.516839 | norm 0.1600 | lr 1.17e-03 | (9897.42 ms | 52972 tok/s) step 2682/76294 | train loss 3.538831 | norm 0.1566 | lr 1.17e-03 | (9958.33 ms | 52648 tok/s) step 2683/76294 | train loss 3.587371 | norm 0.1541 | lr 1.17e-03 | (9891.22 ms | 53005 tok/s) step 2684/76294 | train loss 3.531448 | norm 0.1455 | lr 1.17e-03 | (9899.43 ms | 52961 tok/s) step 2685/76294 | train loss 3.527652 | norm 0.1675 | lr 1.17e-03 | (9891.19 ms | 53006 tok/s) step 2686/76294 | train loss 3.407069 | norm 0.1779 | lr 1.17e-03 | (9892.56 ms | 52998 tok/s) step 2687/76294 | train loss 3.563741 | norm 0.1668 | lr 1.17e-03 | (9893.05 ms | 52996 tok/s) step 2688/76294 | train loss 3.552995 | norm 0.1764 | lr 1.17e-03 | (9925.84 ms | 52821 tok/s) step 2689/76294 | train loss 3.686978 | norm 0.1791 | lr 1.17e-03 | (9891.43 ms | 53004 tok/s) step 2690/76294 | train loss 3.582597 | norm 0.1872 | lr 1.17e-03 | (9941.29 ms | 52738 tok/s) step 2691/76294 | train loss 3.542553 | norm 0.1680 | lr 1.17e-03 | (9888.71 ms | 53019 tok/s) step 2692/76294 | train loss 3.544579 | norm 0.1692 | lr 1.17e-03 | (9899.12 ms | 52963 tok/s) step 2693/76294 | train loss 3.486344 | norm 0.1643 | lr 1.17e-03 | (9932.18 ms | 52787 tok/s) step 2694/76294 | train loss 3.563970 | norm 0.1656 | lr 1.17e-03 | (9883.55 ms | 53047 tok/s) step 2695/76294 | train loss 3.565161 | norm 0.1569 | lr 1.17e-03 | (9896.74 ms | 52976 tok/s) step 2696/76294 | train loss 3.558396 | norm 0.1492 | lr 1.17e-03 | (9886.21 ms | 53032 tok/s) step 2697/76294 | train loss 3.554367 | norm 0.1724 | lr 1.17e-03 | (9897.55 ms | 52972 tok/s) step 2698/76294 | train loss 3.604146 | norm 0.1583 | lr 1.17e-03 | (9913.26 ms | 52888 tok/s) step 2699/76294 | train loss 3.588570 | norm 0.1402 | lr 1.17e-03 | (9896.02 ms | 52980 tok/s) step 2700/76294 | train loss 3.541602 | norm 0.1752 | lr 1.17e-03 | (9885.95 ms | 53034 tok/s) step 2701/76294 | train loss 3.646591 | norm 0.1703 | lr 1.17e-03 | (9896.35 ms | 52978 tok/s) step 2702/76294 | train loss 3.507833 | norm 0.1507 | lr 1.17e-03 | (9892.31 ms | 53000 tok/s) step 2703/76294 | train loss 3.584451 | norm 0.1671 | lr 1.17e-03 | (9932.78 ms | 52784 tok/s) step 2704/76294 | train loss 3.511557 | norm 0.1977 | lr 1.17e-03 | (9888.17 ms | 53022 tok/s) step 2705/76294 | train loss 3.518726 | norm 0.1505 | lr 1.17e-03 | (9899.82 ms | 52959 tok/s) step 2706/76294 | train loss 3.522043 | norm 0.1770 | lr 1.17e-03 | (9890.35 ms | 53010 tok/s) step 2707/76294 | train loss 3.497014 | norm 0.1619 | lr 1.17e-03 | (9894.90 ms | 52986 tok/s) step 2708/76294 | train loss 3.518576 | norm 0.1722 | lr 1.17e-03 | (9956.13 ms | 52660 tok/s) step 2709/76294 | train loss 3.546255 | norm 0.2225 | lr 1.17e-03 | (9887.70 ms | 53024 tok/s) step 2710/76294 | train loss 3.556259 | norm 0.1874 | lr 1.17e-03 | (9893.81 ms | 52992 tok/s) step 2711/76294 | train loss 3.546549 | norm 0.1898 | lr 1.17e-03 | (9895.99 ms | 52980 tok/s) step 2712/76294 | train loss 3.605923 | norm 0.2290 | lr 1.17e-03 | (9891.51 ms | 53004 tok/s) step 2713/76294 | train loss 3.555892 | norm 0.1698 | lr 1.17e-03 | (10229.48 ms | 51253 tok/s) step 2714/76294 | train loss 3.603116 | norm 0.1976 | lr 1.17e-03 | (9886.13 ms | 53033 tok/s) step 2715/76294 | train loss 3.447867 | norm 0.1705 | lr 1.17e-03 | (10843.81 ms | 48349 tok/s) step 2716/76294 | train loss 3.539882 | norm 0.1884 | lr 1.17e-03 | (9883.26 ms | 53048 tok/s) step 2717/76294 | train loss 3.509584 | norm 0.1722 | lr 1.17e-03 | (9911.07 ms | 52899 tok/s) step 2718/76294 | train loss 3.536076 | norm 0.1457 | lr 1.17e-03 | (9933.30 ms | 52781 tok/s) step 2719/76294 | train loss 3.518673 | norm 0.1752 | lr 1.17e-03 | (9887.35 ms | 53026 tok/s) step 2720/76294 | train loss 3.573051 | norm 0.1696 | lr 1.17e-03 | (9892.74 ms | 52997 tok/s) step 2721/76294 | train loss 3.513886 | norm 0.1918 | lr 1.17e-03 | (9882.15 ms | 53054 tok/s) step 2722/76294 | train loss 3.565999 | norm 0.1891 | lr 1.17e-03 | (9950.94 ms | 52687 tok/s) step 2723/76294 | train loss 3.558125 | norm 0.1692 | lr 1.17e-03 | (9892.73 ms | 52997 tok/s) step 2724/76294 | train loss 3.572410 | norm 0.1604 | lr 1.17e-03 | (9891.44 ms | 53004 tok/s) step 2725/76294 | train loss 3.592704 | norm 0.1573 | lr 1.17e-03 | (9933.55 ms | 52780 tok/s) step 2726/76294 | train loss 3.529791 | norm 0.1590 | lr 1.17e-03 | (9928.89 ms | 52804 tok/s) step 2727/76294 | train loss 3.616736 | norm 0.1627 | lr 1.17e-03 | (9888.54 ms | 53020 tok/s) step 2728/76294 | train loss 3.535813 | norm 0.1763 | lr 1.17e-03 | (9897.67 ms | 52971 tok/s) step 2729/76294 | train loss 3.559450 | norm 0.1476 | lr 1.17e-03 | (9892.63 ms | 52998 tok/s) step 2730/76294 | train loss 3.527946 | norm 0.1657 | lr 1.17e-03 | (9883.51 ms | 53047 tok/s) step 2731/76294 | train loss 3.498607 | norm 0.1420 | lr 1.17e-03 | (9890.60 ms | 53009 tok/s) step 2732/76294 | train loss 3.524917 | norm 0.1660 | lr 1.17e-03 | (11095.91 ms | 47251 tok/s) step 2733/76294 | train loss 3.530262 | norm 0.1439 | lr 1.17e-03 | (9887.96 ms | 53023 tok/s) step 2734/76294 | train loss 3.561043 | norm 0.1624 | lr 1.17e-03 | (9882.46 ms | 53052 tok/s) step 2735/76294 | train loss 3.585973 | norm 0.1914 | lr 1.17e-03 | (9894.92 ms | 52986 tok/s) step 2736/76294 | train loss 3.519659 | norm 0.2368 | lr 1.17e-03 | (9928.05 ms | 52809 tok/s) step 2737/76294 | train loss 3.482760 | norm 0.1996 | lr 1.17e-03 | (9896.08 ms | 52979 tok/s) step 2738/76294 | train loss 3.486141 | norm 0.1612 | lr 1.17e-03 | (9900.86 ms | 52954 tok/s) step 2739/76294 | train loss 3.534907 | norm 0.1973 | lr 1.17e-03 | (9903.73 ms | 52938 tok/s) step 2740/76294 | train loss 3.581270 | norm 0.1631 | lr 1.17e-03 | (9896.66 ms | 52976 tok/s) step 2741/76294 | train loss 3.551189 | norm 0.1652 | lr 1.17e-03 | (9914.63 ms | 52880 tok/s) step 2742/76294 | train loss 3.563319 | norm 0.1678 | lr 1.17e-03 | (13525.91 ms | 38762 tok/s) step 2743/76294 | train loss 3.492710 | norm 0.1524 | lr 1.17e-03 | (9861.90 ms | 53163 tok/s) step 2744/76294 | train loss 3.562979 | norm 0.1695 | lr 1.17e-03 | (9939.03 ms | 52750 tok/s) step 2745/76294 | train loss 3.515162 | norm 0.1788 | lr 1.17e-03 | (9894.65 ms | 52987 tok/s) step 2746/76294 | train loss 3.559653 | norm 0.1426 | lr 1.17e-03 | (9880.60 ms | 53062 tok/s) step 2747/76294 | train loss 3.560044 | norm 0.1736 | lr 1.17e-03 | (11068.25 ms | 47369 tok/s) step 2748/76294 | train loss 3.522446 | norm 0.1633 | lr 1.17e-03 | (9888.43 ms | 53020 tok/s) step 2749/76294 | train loss 3.566180 | norm 0.1710 | lr 1.17e-03 | (9880.41 ms | 53063 tok/s) step 2750/76294 | train loss 3.485949 | norm 0.1650 | lr 1.17e-03 | (9880.52 ms | 53063 tok/s) val loss: 3.517690 saving model checkpoint to ./results/gpt2-350M-gqa/step_2750.pth step 2751/76294 | train loss 3.573099 | norm 0.1527 | lr 1.17e-03 | (9992.03 ms | 52471 tok/s) step 2752/76294 | train loss 3.474056 | norm 0.1677 | lr 1.17e-03 | (9861.26 ms | 53166 tok/s) step 2753/76294 | train loss 3.518146 | norm 0.1628 | lr 1.17e-03 | (9873.34 ms | 53101 tok/s) step 2754/76294 | train loss 3.501002 | norm 0.1367 | lr 1.17e-03 | (9885.09 ms | 53038 tok/s) step 2755/76294 | train loss 3.519011 | norm 0.1783 | lr 1.17e-03 | (9875.82 ms | 53088 tok/s) step 2756/76294 | train loss 3.485121 | norm 0.1701 | lr 1.17e-03 | (9886.72 ms | 53030 tok/s) step 2757/76294 | train loss 3.580116 | norm 0.1837 | lr 1.17e-03 | (9884.96 ms | 53039 tok/s) step 2758/76294 | train loss 3.521688 | norm 0.1856 | lr 1.17e-03 | (9894.17 ms | 52990 tok/s) step 2759/76294 | train loss 3.560939 | norm 0.1835 | lr 1.17e-03 | (9890.17 ms | 53011 tok/s) step 2760/76294 | train loss 3.536170 | norm 0.1627 | lr 1.17e-03 | (9896.89 ms | 52975 tok/s) step 2761/76294 | train loss 3.518028 | norm 0.1685 | lr 1.17e-03 | (9932.06 ms | 52787 tok/s) step 2762/76294 | train loss 3.531717 | norm 0.1674 | lr 1.17e-03 | (9888.09 ms | 53022 tok/s) step 2763/76294 | train loss 3.535476 | norm 0.1592 | lr 1.17e-03 | (9908.14 ms | 52915 tok/s) step 2764/76294 | train loss 3.550271 | norm 0.1846 | lr 1.17e-03 | (9917.50 ms | 52865 tok/s) step 2765/76294 | train loss 3.518252 | norm 0.1932 | lr 1.17e-03 | (9903.87 ms | 52938 tok/s) step 2766/76294 | train loss 3.508039 | norm 0.2021 | lr 1.17e-03 | (9964.04 ms | 52618 tok/s) step 2767/76294 | train loss 3.542503 | norm 0.1627 | lr 1.17e-03 | (9892.95 ms | 52996 tok/s) step 2768/76294 | train loss 3.488670 | norm 0.2071 | lr 1.17e-03 | (9918.01 ms | 52862 tok/s) step 2769/76294 | train loss 3.557490 | norm 0.1692 | lr 1.17e-03 | (9900.09 ms | 52958 tok/s) step 2770/76294 | train loss 3.482282 | norm 0.1701 | lr 1.17e-03 | (9890.81 ms | 53008 tok/s) step 2771/76294 | train loss 3.514597 | norm 0.1622 | lr 1.17e-03 | (9908.68 ms | 52912 tok/s) step 2772/76294 | train loss 3.548455 | norm 0.1659 | lr 1.17e-03 | (9928.36 ms | 52807 tok/s) step 2773/76294 | train loss 3.533083 | norm 0.1595 | lr 1.17e-03 | (9893.33 ms | 52994 tok/s) step 2774/76294 | train loss 3.499690 | norm 0.1805 | lr 1.17e-03 | (9912.55 ms | 52891 tok/s) step 2775/76294 | train loss 3.511681 | norm 0.2000 | lr 1.17e-03 | (9957.19 ms | 52654 tok/s) step 2776/76294 | train loss 3.598676 | norm 0.1789 | lr 1.17e-03 | (10733.68 ms | 48845 tok/s) step 2777/76294 | train loss 3.539586 | norm 0.1877 | lr 1.17e-03 | (9950.36 ms | 52690 tok/s) step 2778/76294 | train loss 3.468057 | norm 0.1684 | lr 1.17e-03 | (9887.43 ms | 53026 tok/s) step 2779/76294 | train loss 3.566357 | norm 0.1604 | lr 1.17e-03 | (9884.20 ms | 53043 tok/s) step 2780/76294 | train loss 3.491490 | norm 0.1788 | lr 1.17e-03 | (9903.31 ms | 52941 tok/s) step 2781/76294 | train loss 3.528217 | norm 0.1406 | lr 1.17e-03 | (9893.85 ms | 52991 tok/s) step 2782/76294 | train loss 3.511804 | norm 0.1565 | lr 1.17e-03 | (9890.05 ms | 53012 tok/s) step 2783/76294 | train loss 3.562195 | norm 0.1539 | lr 1.17e-03 | (9892.03 ms | 53001 tok/s) step 2784/76294 | train loss 3.552342 | norm 0.1598 | lr 1.17e-03 | (9893.02 ms | 52996 tok/s) step 2785/76294 | train loss 3.519044 | norm 0.1685 | lr 1.17e-03 | (9966.28 ms | 52606 tok/s) step 2786/76294 | train loss 3.572458 | norm 0.1667 | lr 1.17e-03 | (11738.54 ms | 44664 tok/s) step 2787/76294 | train loss 3.498328 | norm 0.1792 | lr 1.17e-03 | (9938.51 ms | 52753 tok/s) step 2788/76294 | train loss 3.638925 | norm 0.1503 | lr 1.17e-03 | (9881.60 ms | 53057 tok/s) step 2789/76294 | train loss 3.484958 | norm 0.1777 | lr 1.17e-03 | (9885.07 ms | 53038 tok/s) step 2790/76294 | train loss 3.451149 | norm 0.1498 | lr 1.17e-03 | (9926.01 ms | 52820 tok/s) step 2791/76294 | train loss 3.542836 | norm 0.1645 | lr 1.17e-03 | (9923.56 ms | 52833 tok/s) step 2792/76294 | train loss 3.507656 | norm 0.1543 | lr 1.17e-03 | (9891.21 ms | 53005 tok/s) step 2793/76294 | train loss 3.553571 | norm 0.1620 | lr 1.17e-03 | (9898.99 ms | 52964 tok/s) step 2794/76294 | train loss 3.497417 | norm 0.1617 | lr 1.17e-03 | (10059.30 ms | 52120 tok/s) step 2795/76294 | train loss 3.585208 | norm 0.1692 | lr 1.17e-03 | (9956.07 ms | 52660 tok/s) step 2796/76294 | train loss 3.480935 | norm 0.1641 | lr 1.17e-03 | (9919.15 ms | 52856 tok/s) step 2797/76294 | train loss 3.438093 | norm 0.1501 | lr 1.17e-03 | (10064.92 ms | 52091 tok/s) step 2798/76294 | train loss 3.682758 | norm 0.1641 | lr 1.17e-03 | (9893.32 ms | 52994 tok/s) step 2799/76294 | train loss 3.481277 | norm 0.1785 | lr 1.17e-03 | (9902.11 ms | 52947 tok/s) step 2800/76294 | train loss 3.511394 | norm 0.2018 | lr 1.17e-03 | (9897.00 ms | 52974 tok/s) step 2801/76294 | train loss 3.502843 | norm 0.1940 | lr 1.17e-03 | (9928.35 ms | 52807 tok/s) step 2802/76294 | train loss 3.586166 | norm 0.1619 | lr 1.17e-03 | (9891.37 ms | 53005 tok/s) step 2803/76294 | train loss 3.579232 | norm 0.1868 | lr 1.17e-03 | (9899.24 ms | 52962 tok/s) step 2804/76294 | train loss 3.518522 | norm 0.1764 | lr 1.17e-03 | (9896.66 ms | 52976 tok/s) step 2805/76294 | train loss 3.530679 | norm 0.1851 | lr 1.17e-03 | (9893.36 ms | 52994 tok/s) step 2806/76294 | train loss 3.482801 | norm 0.1949 | lr 1.17e-03 | (9890.02 ms | 53012 tok/s) step 2807/76294 | train loss 3.571089 | norm 0.1677 | lr 1.17e-03 | (9889.80 ms | 53013 tok/s) step 2808/76294 | train loss 3.511246 | norm 0.1566 | lr 1.17e-03 | (9902.67 ms | 52944 tok/s) step 2809/76294 | train loss 3.511791 | norm 0.1666 | lr 1.17e-03 | (9888.49 ms | 53020 tok/s) step 2810/76294 | train loss 3.510688 | norm 0.1648 | lr 1.17e-03 | (9893.74 ms | 52992 tok/s) step 2811/76294 | train loss 3.523079 | norm 0.1884 | lr 1.17e-03 | (9892.40 ms | 52999 tok/s) step 2812/76294 | train loss 3.491316 | norm 0.1510 | lr 1.17e-03 | (9890.92 ms | 53007 tok/s) step 2813/76294 | train loss 3.485846 | norm 0.1668 | lr 1.17e-03 | (9888.09 ms | 53022 tok/s) step 2814/76294 | train loss 3.557091 | norm 0.1616 | lr 1.17e-03 | (9912.95 ms | 52889 tok/s) step 2815/76294 | train loss 3.523922 | norm 0.1735 | lr 1.17e-03 | (9916.94 ms | 52868 tok/s) step 2816/76294 | train loss 3.508591 | norm 0.1613 | lr 1.17e-03 | (9908.61 ms | 52912 tok/s) step 2817/76294 | train loss 3.503129 | norm 0.1794 | lr 1.17e-03 | (9890.13 ms | 53011 tok/s) step 2818/76294 | train loss 3.513266 | norm 0.1832 | lr 1.17e-03 | (9933.44 ms | 52780 tok/s) step 2819/76294 | train loss 3.577847 | norm 0.1964 | lr 1.17e-03 | (9954.61 ms | 52668 tok/s) step 2820/76294 | train loss 3.517009 | norm 0.1976 | lr 1.17e-03 | (9888.04 ms | 53022 tok/s) step 2821/76294 | train loss 3.502774 | norm 0.1589 | lr 1.17e-03 | (9887.74 ms | 53024 tok/s) step 2822/76294 | train loss 3.465446 | norm 0.1637 | lr 1.17e-03 | (9887.56 ms | 53025 tok/s) step 2823/76294 | train loss 3.480090 | norm 0.1881 | lr 1.17e-03 | (9932.77 ms | 52784 tok/s) step 2824/76294 | train loss 3.561912 | norm 0.1395 | lr 1.17e-03 | (9881.62 ms | 53057 tok/s) step 2825/76294 | train loss 3.521800 | norm 0.1884 | lr 1.17e-03 | (9896.69 ms | 52976 tok/s) step 2826/76294 | train loss 3.542807 | norm 0.1698 | lr 1.17e-03 | (9901.13 ms | 52952 tok/s) step 2827/76294 | train loss 3.459055 | norm 0.1596 | lr 1.17e-03 | (9953.34 ms | 52675 tok/s) step 2828/76294 | train loss 3.557976 | norm 0.1897 | lr 1.17e-03 | (9888.02 ms | 53023 tok/s) step 2829/76294 | train loss 3.496609 | norm 0.1908 | lr 1.17e-03 | (11451.07 ms | 45785 tok/s) step 2830/76294 | train loss 3.546299 | norm 0.1632 | lr 1.17e-03 | (9869.50 ms | 53122 tok/s) step 2831/76294 | train loss 3.497573 | norm 0.1820 | lr 1.17e-03 | (9889.25 ms | 53016 tok/s) step 2832/76294 | train loss 3.442744 | norm 0.2206 | lr 1.17e-03 | (9950.35 ms | 52690 tok/s) step 2833/76294 | train loss 3.551583 | norm 0.1697 | lr 1.17e-03 | (9922.05 ms | 52841 tok/s) step 2834/76294 | train loss 3.479706 | norm 0.1996 | lr 1.17e-03 | (9919.49 ms | 52854 tok/s) step 2835/76294 | train loss 3.502095 | norm 0.2058 | lr 1.17e-03 | (9884.48 ms | 53042 tok/s) step 2836/76294 | train loss 3.584327 | norm 0.1776 | lr 1.17e-03 | (9877.29 ms | 53080 tok/s) step 2837/76294 | train loss 3.485012 | norm 0.1849 | lr 1.17e-03 | (9902.61 ms | 52944 tok/s) step 2838/76294 | train loss 3.512440 | norm 0.1870 | lr 1.17e-03 | (9930.24 ms | 52797 tok/s) step 2839/76294 | train loss 3.495137 | norm 0.1673 | lr 1.17e-03 | (9892.15 ms | 53000 tok/s) step 2840/76294 | train loss 3.475681 | norm 0.1927 | lr 1.17e-03 | (9894.37 ms | 52989 tok/s) step 2841/76294 | train loss 3.530400 | norm 0.1768 | lr 1.17e-03 | (9893.65 ms | 52992 tok/s) step 2842/76294 | train loss 3.468727 | norm 0.1883 | lr 1.17e-03 | (9931.41 ms | 52791 tok/s) step 2843/76294 | train loss 3.461743 | norm 0.1910 | lr 1.17e-03 | (9888.16 ms | 53022 tok/s) step 2844/76294 | train loss 3.469610 | norm 0.1538 | lr 1.17e-03 | (9897.60 ms | 52971 tok/s) step 2845/76294 | train loss 3.556380 | norm 0.2024 | lr 1.17e-03 | (9900.45 ms | 52956 tok/s) step 2846/76294 | train loss 3.547345 | norm 0.2071 | lr 1.17e-03 | (9964.74 ms | 52614 tok/s) step 2847/76294 | train loss 3.468785 | norm 0.2010 | lr 1.17e-03 | (9922.48 ms | 52838 tok/s) step 2848/76294 | train loss 3.441118 | norm 0.1597 | lr 1.17e-03 | (9884.80 ms | 53040 tok/s) step 2849/76294 | train loss 3.532180 | norm 0.1687 | lr 1.17e-03 | (9892.45 ms | 52999 tok/s) step 2850/76294 | train loss 3.485277 | norm 0.1589 | lr 1.17e-03 | (9894.53 ms | 52988 tok/s) step 2851/76294 | train loss 3.492630 | norm 0.1579 | lr 1.17e-03 | (9894.44 ms | 52988 tok/s) step 2852/76294 | train loss 3.507136 | norm 0.1677 | lr 1.17e-03 | (9933.44 ms | 52780 tok/s) step 2853/76294 | train loss 3.483472 | norm 0.1884 | lr 1.17e-03 | (9893.42 ms | 52994 tok/s) step 2854/76294 | train loss 3.541584 | norm 0.1663 | lr 1.17e-03 | (9895.83 ms | 52981 tok/s) step 2855/76294 | train loss 3.597004 | norm 0.1862 | lr 1.17e-03 | (9891.55 ms | 53004 tok/s) step 2856/76294 | train loss 3.516106 | norm 0.1819 | lr 1.17e-03 | (9901.82 ms | 52949 tok/s) step 2857/76294 | train loss 3.512048 | norm 0.1631 | lr 1.17e-03 | (9889.49 ms | 53015 tok/s) step 2858/76294 | train loss 3.525398 | norm 0.1764 | lr 1.17e-03 | (9897.46 ms | 52972 tok/s) step 2859/76294 | train loss 3.428454 | norm 0.1464 | lr 1.17e-03 | (9931.97 ms | 52788 tok/s) step 2860/76294 | train loss 3.496223 | norm 0.1533 | lr 1.17e-03 | (9890.19 ms | 53011 tok/s) step 2861/76294 | train loss 3.506677 | norm 0.1448 | lr 1.17e-03 | (9897.06 ms | 52974 tok/s) step 2862/76294 | train loss 3.483050 | norm 0.1635 | lr 1.17e-03 | (9890.04 ms | 53012 tok/s) step 2863/76294 | train loss 3.562285 | norm 0.1665 | lr 1.17e-03 | (9898.57 ms | 52966 tok/s) step 2864/76294 | train loss 3.499938 | norm 0.1590 | lr 1.17e-03 | (9889.85 ms | 53013 tok/s) step 2865/76294 | train loss 3.535701 | norm 0.1703 | lr 1.17e-03 | (9901.02 ms | 52953 tok/s) step 2866/76294 | train loss 3.685921 | norm 0.1686 | lr 1.17e-03 | (9887.74 ms | 53024 tok/s) step 2867/76294 | train loss 3.456046 | norm 0.1838 | lr 1.17e-03 | (9903.65 ms | 52939 tok/s) step 2868/76294 | train loss 3.521709 | norm 0.1730 | lr 1.17e-03 | (9897.52 ms | 52972 tok/s) step 2869/76294 | train loss 3.537128 | norm 0.1644 | lr 1.17e-03 | (9931.42 ms | 52791 tok/s) step 2870/76294 | train loss 3.514751 | norm 0.1823 | lr 1.17e-03 | (9890.68 ms | 53008 tok/s) step 2871/76294 | train loss 3.488560 | norm 0.1634 | lr 1.17e-03 | (9903.68 ms | 52939 tok/s) step 2872/76294 | train loss 3.517138 | norm 0.1551 | lr 1.17e-03 | (9895.03 ms | 52985 tok/s) step 2873/76294 | train loss 3.516233 | norm 0.1802 | lr 1.17e-03 | (9893.46 ms | 52993 tok/s) step 2874/76294 | train loss 3.487404 | norm 0.1624 | lr 1.17e-03 | (9896.24 ms | 52978 tok/s) step 2875/76294 | train loss 3.493582 | norm 0.1689 | lr 1.17e-03 | (9894.92 ms | 52986 tok/s) step 2876/76294 | train loss 3.473686 | norm 0.1609 | lr 1.17e-03 | (9957.97 ms | 52650 tok/s) step 2877/76294 | train loss 3.579027 | norm 0.1801 | lr 1.17e-03 | (9962.60 ms | 52626 tok/s) step 2878/76294 | train loss 3.491876 | norm 0.1780 | lr 1.17e-03 | (9928.06 ms | 52809 tok/s) step 2879/76294 | train loss 3.450391 | norm 0.1816 | lr 1.17e-03 | (9887.08 ms | 53028 tok/s) step 2880/76294 | train loss 3.454203 | norm 0.1587 | lr 1.17e-03 | (9893.00 ms | 52996 tok/s) step 2881/76294 | train loss 3.536354 | norm 0.1761 | lr 1.17e-03 | (9893.16 ms | 52995 tok/s) step 2882/76294 | train loss 3.545903 | norm 0.1503 | lr 1.17e-03 | (9893.82 ms | 52991 tok/s) step 2883/76294 | train loss 3.485061 | norm 0.1689 | lr 1.17e-03 | (10004.00 ms | 52408 tok/s) step 2884/76294 | train loss 3.496616 | norm 0.1624 | lr 1.17e-03 | (9948.03 ms | 52703 tok/s) step 2885/76294 | train loss 3.463561 | norm 0.1729 | lr 1.17e-03 | (9900.29 ms | 52957 tok/s) step 2886/76294 | train loss 3.496037 | norm 0.1904 | lr 1.17e-03 | (9885.37 ms | 53037 tok/s) step 2887/76294 | train loss 3.527401 | norm 0.2019 | lr 1.17e-03 | (9888.17 ms | 53022 tok/s) step 2888/76294 | train loss 3.469248 | norm 0.1755 | lr 1.16e-03 | (9888.55 ms | 53020 tok/s) step 2889/76294 | train loss 3.549351 | norm 0.1976 | lr 1.16e-03 | (9909.76 ms | 52906 tok/s) step 2890/76294 | train loss 3.521695 | norm 0.2010 | lr 1.16e-03 | (9891.65 ms | 53003 tok/s) step 2891/76294 | train loss 3.441891 | norm 0.2205 | lr 1.16e-03 | (9889.01 ms | 53017 tok/s) step 2892/76294 | train loss 3.519063 | norm 0.1740 | lr 1.16e-03 | (9897.34 ms | 52973 tok/s) step 2893/76294 | train loss 3.477587 | norm 0.1733 | lr 1.16e-03 | (9896.05 ms | 52980 tok/s) step 2894/76294 | train loss 3.589969 | norm 0.1640 | lr 1.16e-03 | (9903.12 ms | 52942 tok/s) step 2895/76294 | train loss 3.483070 | norm 0.1607 | lr 1.16e-03 | (9890.61 ms | 53009 tok/s) step 2896/76294 | train loss 3.488814 | norm 0.1641 | lr 1.16e-03 | (9884.08 ms | 53044 tok/s) step 2897/76294 | train loss 3.524780 | norm 0.1482 | lr 1.16e-03 | (9894.45 ms | 52988 tok/s) step 2898/76294 | train loss 3.451379 | norm 0.1679 | lr 1.16e-03 | (9939.15 ms | 52750 tok/s) step 2899/76294 | train loss 3.517470 | norm 0.1375 | lr 1.16e-03 | (9896.16 ms | 52979 tok/s) step 2900/76294 | train loss 3.459988 | norm 0.1937 | lr 1.16e-03 | (9897.02 ms | 52974 tok/s) step 2901/76294 | train loss 3.514336 | norm 0.1664 | lr 1.16e-03 | (9894.54 ms | 52988 tok/s) step 2902/76294 | train loss 3.503609 | norm 0.1665 | lr 1.16e-03 | (9892.77 ms | 52997 tok/s) step 2903/76294 | train loss 3.462023 | norm 0.1413 | lr 1.16e-03 | (9892.67 ms | 52998 tok/s) step 2904/76294 | train loss 3.543700 | norm 0.1701 | lr 1.16e-03 | (9897.55 ms | 52972 tok/s) step 2905/76294 | train loss 3.455639 | norm 0.1515 | lr 1.16e-03 | (9947.68 ms | 52705 tok/s) step 2906/76294 | train loss 3.515513 | norm 0.1501 | lr 1.16e-03 | (10014.03 ms | 52355 tok/s) step 2907/76294 | train loss 3.510944 | norm 0.1527 | lr 1.16e-03 | (9902.83 ms | 52943 tok/s) step 2908/76294 | train loss 3.453824 | norm 0.1505 | lr 1.16e-03 | (9894.65 ms | 52987 tok/s) step 2909/76294 | train loss 3.544035 | norm 0.1613 | lr 1.16e-03 | (9894.37 ms | 52989 tok/s) step 2910/76294 | train loss 3.500710 | norm 0.1608 | lr 1.16e-03 | (9894.26 ms | 52989 tok/s) step 2911/76294 | train loss 3.622777 | norm 0.2083 | lr 1.16e-03 | (9915.13 ms | 52878 tok/s) step 2912/76294 | train loss 3.501554 | norm 0.2242 | lr 1.16e-03 | (9897.44 ms | 52972 tok/s) step 2913/76294 | train loss 3.561656 | norm 0.2035 | lr 1.16e-03 | (9895.84 ms | 52981 tok/s) step 2914/76294 | train loss 3.528377 | norm 0.1634 | lr 1.16e-03 | (9895.12 ms | 52984 tok/s) step 2915/76294 | train loss 3.472660 | norm 0.1860 | lr 1.16e-03 | (9893.65 ms | 52992 tok/s) step 2916/76294 | train loss 3.507983 | norm 0.1845 | lr 1.16e-03 | (9893.34 ms | 52994 tok/s) step 2917/76294 | train loss 3.524166 | norm 0.1464 | lr 1.16e-03 | (9889.54 ms | 53014 tok/s) step 2918/76294 | train loss 3.491104 | norm 0.1792 | lr 1.16e-03 | (9896.29 ms | 52978 tok/s) step 2919/76294 | train loss 3.472179 | norm 0.1710 | lr 1.16e-03 | (9904.86 ms | 52932 tok/s) step 2920/76294 | train loss 3.437951 | norm 0.1628 | lr 1.16e-03 | (9887.92 ms | 53023 tok/s) step 2921/76294 | train loss 3.519053 | norm 0.1513 | lr 1.16e-03 | (9946.00 ms | 52713 tok/s) step 2922/76294 | train loss 3.460210 | norm 0.1829 | lr 1.16e-03 | (9889.04 ms | 53017 tok/s) step 2923/76294 | train loss 3.532509 | norm 0.1756 | lr 1.16e-03 | (9894.86 ms | 52986 tok/s) step 2924/76294 | train loss 3.528339 | norm 0.1901 | lr 1.16e-03 | (9957.07 ms | 52655 tok/s) step 2925/76294 | train loss 3.443413 | norm 0.1729 | lr 1.16e-03 | (9892.18 ms | 53000 tok/s) step 2926/76294 | train loss 3.585813 | norm 0.1571 | lr 1.16e-03 | (9896.72 ms | 52976 tok/s) step 2927/76294 | train loss 3.419069 | norm 0.1981 | lr 1.16e-03 | (11033.47 ms | 47518 tok/s) step 2928/76294 | train loss 3.513407 | norm 0.2167 | lr 1.16e-03 | (9906.36 ms | 52924 tok/s) step 2929/76294 | train loss 3.490984 | norm 0.1968 | lr 1.16e-03 | (9883.13 ms | 53049 tok/s) step 2930/76294 | train loss 3.541925 | norm 0.1770 | lr 1.16e-03 | (9885.12 ms | 53038 tok/s) step 2931/76294 | train loss 3.580541 | norm 0.2025 | lr 1.16e-03 | (9894.72 ms | 52987 tok/s) step 2932/76294 | train loss 3.439504 | norm 0.1765 | lr 1.16e-03 | (9883.99 ms | 53044 tok/s) step 2933/76294 | train loss 3.524686 | norm 0.1817 | lr 1.16e-03 | (9889.17 ms | 53016 tok/s) step 2934/76294 | train loss 3.506404 | norm 0.1620 | lr 1.16e-03 | (9889.41 ms | 53015 tok/s) step 2935/76294 | train loss 3.447656 | norm 0.1910 | lr 1.16e-03 | (9896.58 ms | 52977 tok/s) step 2936/76294 | train loss 3.608711 | norm 0.2117 | lr 1.16e-03 | (9928.89 ms | 52804 tok/s) step 2937/76294 | train loss 3.449680 | norm 0.1884 | lr 1.16e-03 | (9889.47 ms | 53015 tok/s) step 2938/76294 | train loss 3.531806 | norm 0.1915 | lr 1.16e-03 | (9895.41 ms | 52983 tok/s) step 2939/76294 | train loss 3.476060 | norm 0.1399 | lr 1.16e-03 | (9891.37 ms | 53005 tok/s) step 2940/76294 | train loss 3.470958 | norm 0.1856 | lr 1.16e-03 | (9893.25 ms | 52995 tok/s) step 2941/76294 | train loss 3.603312 | norm 0.1776 | lr 1.16e-03 | (9891.23 ms | 53005 tok/s) step 2942/76294 | train loss 3.465960 | norm 0.1714 | lr 1.16e-03 | (9932.74 ms | 52784 tok/s) step 2943/76294 | train loss 3.486557 | norm 0.1609 | lr 1.16e-03 | (9953.04 ms | 52676 tok/s) step 2944/76294 | train loss 3.528471 | norm 0.1489 | lr 1.16e-03 | (9890.58 ms | 53009 tok/s) step 2945/76294 | train loss 3.511411 | norm 0.1979 | lr 1.16e-03 | (9897.28 ms | 52973 tok/s) step 2946/76294 | train loss 3.510672 | norm 0.1837 | lr 1.16e-03 | (9931.16 ms | 52792 tok/s) step 2947/76294 | train loss 3.447889 | norm 0.1775 | lr 1.16e-03 | (9894.54 ms | 52988 tok/s) step 2948/76294 | train loss 3.572771 | norm 0.1717 | lr 1.16e-03 | (9897.32 ms | 52973 tok/s) step 2949/76294 | train loss 3.491525 | norm 0.1813 | lr 1.16e-03 | (9891.84 ms | 53002 tok/s) step 2950/76294 | train loss 3.506985 | norm 0.1737 | lr 1.16e-03 | (9895.07 ms | 52985 tok/s) step 2951/76294 | train loss 3.502890 | norm 0.1639 | lr 1.16e-03 | (9889.70 ms | 53014 tok/s) step 2952/76294 | train loss 3.474921 | norm 0.1912 | lr 1.16e-03 | (9896.38 ms | 52978 tok/s) step 2953/76294 | train loss 3.640656 | norm 0.1683 | lr 1.16e-03 | (9892.84 ms | 52997 tok/s) step 2954/76294 | train loss 3.490618 | norm 0.1600 | lr 1.16e-03 | (9893.00 ms | 52996 tok/s) step 2955/76294 | train loss 3.459334 | norm 0.1716 | lr 1.16e-03 | (9897.25 ms | 52973 tok/s) step 2956/76294 | train loss 3.487360 | norm 0.1658 | lr 1.16e-03 | (9894.76 ms | 52986 tok/s) step 2957/76294 | train loss 3.436485 | norm 0.1536 | lr 1.16e-03 | (9892.87 ms | 52997 tok/s) step 2958/76294 | train loss 3.475328 | norm 0.1523 | lr 1.16e-03 | (9890.69 ms | 53008 tok/s) step 2959/76294 | train loss 3.570804 | norm 0.1486 | lr 1.16e-03 | (9894.42 ms | 52988 tok/s) step 2960/76294 | train loss 3.457562 | norm 0.1499 | lr 1.16e-03 | (9902.02 ms | 52948 tok/s) step 2961/76294 | train loss 3.553067 | norm 0.1623 | lr 1.16e-03 | (9890.54 ms | 53009 tok/s) step 2962/76294 | train loss 3.467907 | norm 0.1900 | lr 1.16e-03 | (9960.16 ms | 52639 tok/s) step 2963/76294 | train loss 3.573118 | norm 0.1608 | lr 1.16e-03 | (9888.95 ms | 53018 tok/s) step 2964/76294 | train loss 3.448038 | norm 0.1649 | lr 1.16e-03 | (9888.54 ms | 53020 tok/s) step 2965/76294 | train loss 3.377435 | norm 0.2003 | lr 1.16e-03 | (9902.50 ms | 52945 tok/s) step 2966/76294 | train loss 3.589295 | norm 0.1971 | lr 1.16e-03 | (9886.20 ms | 53032 tok/s) step 2967/76294 | train loss 3.489449 | norm 0.1893 | lr 1.16e-03 | (10166.73 ms | 51569 tok/s) step 2968/76294 | train loss 3.546768 | norm 0.1576 | lr 1.16e-03 | (9908.25 ms | 52914 tok/s) step 2969/76294 | train loss 3.477791 | norm 0.1770 | lr 1.16e-03 | (9894.99 ms | 52985 tok/s) step 2970/76294 | train loss 3.523182 | norm 0.1655 | lr 1.16e-03 | (9886.82 ms | 53029 tok/s) step 2971/76294 | train loss 3.507846 | norm 0.1731 | lr 1.16e-03 | (9932.92 ms | 52783 tok/s) step 2972/76294 | train loss 3.542050 | norm 0.2032 | lr 1.16e-03 | (9886.72 ms | 53030 tok/s) step 2973/76294 | train loss 3.446267 | norm 0.1632 | lr 1.16e-03 | (9889.56 ms | 53014 tok/s) step 2974/76294 | train loss 3.466687 | norm 0.1919 | lr 1.16e-03 | (9887.01 ms | 53028 tok/s) step 2975/76294 | train loss 3.499665 | norm 0.1894 | lr 1.16e-03 | (9895.56 ms | 52982 tok/s) step 2976/76294 | train loss 3.519041 | norm 0.1699 | lr 1.16e-03 | (9928.35 ms | 52807 tok/s) step 2977/76294 | train loss 3.488319 | norm 0.1676 | lr 1.16e-03 | (9892.79 ms | 52997 tok/s) step 2978/76294 | train loss 3.500385 | norm 0.1490 | lr 1.16e-03 | (9895.96 ms | 52980 tok/s) step 2979/76294 | train loss 3.532331 | norm 0.1502 | lr 1.16e-03 | (9952.92 ms | 52677 tok/s) step 2980/76294 | train loss 3.557850 | norm 0.1599 | lr 1.16e-03 | (9896.06 ms | 52979 tok/s) step 2981/76294 | train loss 3.532241 | norm 0.1564 | lr 1.16e-03 | (9899.04 ms | 52964 tok/s) step 2982/76294 | train loss 3.532264 | norm 0.1458 | lr 1.16e-03 | (9888.48 ms | 53020 tok/s) step 2983/76294 | train loss 3.543660 | norm 0.1523 | lr 1.16e-03 | (9917.46 ms | 52865 tok/s) step 2984/76294 | train loss 3.486703 | norm 0.1431 | lr 1.16e-03 | (9939.57 ms | 52748 tok/s) step 2985/76294 | train loss 3.468293 | norm 0.1583 | lr 1.16e-03 | (9890.78 ms | 53008 tok/s) step 2986/76294 | train loss 3.418826 | norm 0.1537 | lr 1.16e-03 | (9911.46 ms | 52897 tok/s) step 2987/76294 | train loss 3.465418 | norm 0.1629 | lr 1.16e-03 | (9898.68 ms | 52965 tok/s) step 2988/76294 | train loss 3.497474 | norm 0.1668 | lr 1.16e-03 | (9892.87 ms | 52997 tok/s) step 2989/76294 | train loss 3.489790 | norm 0.1578 | lr 1.16e-03 | (9896.79 ms | 52976 tok/s) step 2990/76294 | train loss 3.506884 | norm 0.1971 | lr 1.16e-03 | (9888.55 ms | 53020 tok/s) step 2991/76294 | train loss 3.485824 | norm 0.1767 | lr 1.16e-03 | (9897.65 ms | 52971 tok/s) step 2992/76294 | train loss 3.496608 | norm 0.1540 | lr 1.16e-03 | (9886.56 ms | 53030 tok/s) step 2993/76294 | train loss 3.593327 | norm 0.1803 | lr 1.16e-03 | (9946.29 ms | 52712 tok/s) step 2994/76294 | train loss 3.558887 | norm 0.1780 | lr 1.16e-03 | (9886.82 ms | 53029 tok/s) step 2995/76294 | train loss 3.522538 | norm 0.1771 | lr 1.16e-03 | (9900.41 ms | 52956 tok/s) step 2996/76294 | train loss 3.478746 | norm 0.1619 | lr 1.16e-03 | (9884.66 ms | 53041 tok/s) step 2997/76294 | train loss 3.518407 | norm 0.1524 | lr 1.16e-03 | (9903.76 ms | 52938 tok/s) step 2998/76294 | train loss 3.460411 | norm 0.1560 | lr 1.16e-03 | (9895.40 ms | 52983 tok/s) step 2999/76294 | train loss 3.491915 | norm 0.1538 | lr 1.16e-03 | (9926.88 ms | 52815 tok/s) step 3000/76294 | train loss 3.484529 | norm 0.1518 | lr 1.16e-03 | (9880.22 ms | 53064 tok/s) val loss: 3.494366 saving model checkpoint to ./results/gpt2-350M-gqa/step_3000.pth step 3001/76294 | train loss 3.585737 | norm 0.1585 | lr 1.16e-03 | (9978.59 ms | 52541 tok/s) step 3002/76294 | train loss 3.479417 | norm 0.1574 | lr 1.16e-03 | (9865.06 ms | 53146 tok/s) step 3003/76294 | train loss 3.549010 | norm 0.1540 | lr 1.16e-03 | (9884.57 ms | 53041 tok/s) step 3004/76294 | train loss 3.506432 | norm 0.1565 | lr 1.16e-03 | (9885.68 ms | 53035 tok/s) step 3005/76294 | train loss 3.480402 | norm 0.1562 | lr 1.16e-03 | (9881.73 ms | 53056 tok/s) step 3006/76294 | train loss 3.437771 | norm 0.1665 | lr 1.16e-03 | (9896.42 ms | 52978 tok/s) step 3007/76294 | train loss 3.547905 | norm 0.1741 | lr 1.16e-03 | (9887.07 ms | 53028 tok/s) step 3008/76294 | train loss 3.513721 | norm 0.1607 | lr 1.16e-03 | (9896.49 ms | 52977 tok/s) step 3009/76294 | train loss 3.501379 | norm 0.1979 | lr 1.16e-03 | (9956.74 ms | 52657 tok/s) step 3010/76294 | train loss 3.558552 | norm 0.2017 | lr 1.16e-03 | (9892.63 ms | 52998 tok/s) step 3011/76294 | train loss 3.491142 | norm 0.1989 | lr 1.16e-03 | (9908.22 ms | 52914 tok/s) step 3012/76294 | train loss 3.473123 | norm 0.1992 | lr 1.16e-03 | (9932.92 ms | 52783 tok/s) step 3013/76294 | train loss 3.486276 | norm 0.1670 | lr 1.16e-03 | (9897.42 ms | 52972 tok/s) step 3014/76294 | train loss 3.471198 | norm 0.1524 | lr 1.16e-03 | (9902.71 ms | 52944 tok/s) step 3015/76294 | train loss 3.499504 | norm 0.1868 | lr 1.16e-03 | (9933.77 ms | 52778 tok/s) step 3016/76294 | train loss 3.543060 | norm 0.1744 | lr 1.16e-03 | (9898.30 ms | 52967 tok/s) step 3017/76294 | train loss 3.573532 | norm 0.1769 | lr 1.16e-03 | (9901.39 ms | 52951 tok/s) step 3018/76294 | train loss 3.558849 | norm 0.1912 | lr 1.16e-03 | (9904.27 ms | 52936 tok/s) step 3019/76294 | train loss 3.478736 | norm 0.2026 | lr 1.16e-03 | (9934.76 ms | 52773 tok/s) step 3020/76294 | train loss 3.497927 | norm 0.1614 | lr 1.16e-03 | (9901.63 ms | 52950 tok/s) step 3021/76294 | train loss 3.518857 | norm 0.1692 | lr 1.16e-03 | (9895.04 ms | 52985 tok/s) step 3022/76294 | train loss 3.479971 | norm 0.1900 | lr 1.16e-03 | (9906.98 ms | 52921 tok/s) step 3023/76294 | train loss 3.512946 | norm 0.1797 | lr 1.16e-03 | (9938.56 ms | 52753 tok/s) step 3024/76294 | train loss 3.464423 | norm 0.1779 | lr 1.16e-03 | (11204.81 ms | 46791 tok/s) step 3025/76294 | train loss 3.473637 | norm 0.1627 | lr 1.16e-03 | (9903.54 ms | 52939 tok/s) step 3026/76294 | train loss 3.445317 | norm 0.1792 | lr 1.16e-03 | (9885.54 ms | 53036 tok/s) step 3027/76294 | train loss 3.438302 | norm 0.1606 | lr 1.16e-03 | (9889.76 ms | 53013 tok/s) step 3028/76294 | train loss 3.454346 | norm 0.1505 | lr 1.16e-03 | (9891.50 ms | 53004 tok/s) step 3029/76294 | train loss 3.472611 | norm 0.1698 | lr 1.16e-03 | (9900.63 ms | 52955 tok/s) step 3030/76294 | train loss 3.474650 | norm 0.1478 | lr 1.16e-03 | (9897.25 ms | 52973 tok/s) step 3031/76294 | train loss 3.441269 | norm 0.1660 | lr 1.16e-03 | (10470.70 ms | 50072 tok/s) step 3032/76294 | train loss 3.490532 | norm 0.1603 | lr 1.16e-03 | (9895.18 ms | 52984 tok/s) step 3033/76294 | train loss 3.532979 | norm 0.1484 | lr 1.16e-03 | (9928.08 ms | 52809 tok/s) step 3034/76294 | train loss 3.484431 | norm 0.1630 | lr 1.16e-03 | (9910.43 ms | 52903 tok/s) step 3035/76294 | train loss 3.467696 | norm 0.1485 | lr 1.16e-03 | (9899.88 ms | 52959 tok/s) step 3036/76294 | train loss 3.489173 | norm 0.1875 | lr 1.16e-03 | (9895.07 ms | 52985 tok/s) step 3037/76294 | train loss 3.491932 | norm 0.1649 | lr 1.16e-03 | (9899.99 ms | 52958 tok/s) step 3038/76294 | train loss 3.578396 | norm 0.1625 | lr 1.16e-03 | (9891.94 ms | 53002 tok/s) step 3039/76294 | train loss 3.505782 | norm 0.1579 | lr 1.16e-03 | (9901.74 ms | 52949 tok/s) step 3040/76294 | train loss 3.490879 | norm 0.1485 | lr 1.16e-03 | (9896.13 ms | 52979 tok/s) step 3041/76294 | train loss 3.513423 | norm 0.1539 | lr 1.16e-03 | (9903.12 ms | 52942 tok/s) step 3042/76294 | train loss 3.550295 | norm 0.1433 | lr 1.16e-03 | (9896.32 ms | 52978 tok/s) step 3043/76294 | train loss 3.465440 | norm 0.1578 | lr 1.16e-03 | (9899.73 ms | 52960 tok/s) step 3044/76294 | train loss 3.521777 | norm 0.1820 | lr 1.16e-03 | (9891.75 ms | 53003 tok/s) step 3045/76294 | train loss 3.447543 | norm 0.1539 | lr 1.16e-03 | (9964.87 ms | 52614 tok/s) step 3046/76294 | train loss 3.505601 | norm 0.1456 | lr 1.16e-03 | (9891.65 ms | 53003 tok/s) step 3047/76294 | train loss 3.515535 | norm 0.1867 | lr 1.16e-03 | (9896.30 ms | 52978 tok/s) step 3048/76294 | train loss 3.429589 | norm 0.1511 | lr 1.16e-03 | (9946.90 ms | 52709 tok/s) step 3049/76294 | train loss 3.506196 | norm 0.1922 | lr 1.16e-03 | (9893.94 ms | 52991 tok/s) step 3050/76294 | train loss 3.501182 | norm 0.1963 | lr 1.16e-03 | (9887.35 ms | 53026 tok/s) step 3051/76294 | train loss 3.445724 | norm 0.1895 | lr 1.16e-03 | (9943.00 ms | 52729 tok/s) step 3052/76294 | train loss 3.505898 | norm 0.1555 | lr 1.16e-03 | (9889.13 ms | 53017 tok/s) step 3053/76294 | train loss 3.489991 | norm 0.1761 | lr 1.16e-03 | (9895.37 ms | 52983 tok/s) step 3054/76294 | train loss 3.508086 | norm 0.1520 | lr 1.16e-03 | (9895.68 ms | 52982 tok/s) step 3055/76294 | train loss 3.495745 | norm 0.1778 | lr 1.16e-03 | (9899.41 ms | 52962 tok/s) step 3056/76294 | train loss 3.456939 | norm 0.2058 | lr 1.16e-03 | (9961.34 ms | 52632 tok/s) step 3057/76294 | train loss 3.554678 | norm 0.1637 | lr 1.16e-03 | (9893.78 ms | 52992 tok/s) step 3058/76294 | train loss 3.442267 | norm 0.1687 | lr 1.16e-03 | (9885.01 ms | 53039 tok/s) step 3059/76294 | train loss 3.538022 | norm 0.1774 | lr 1.16e-03 | (9897.02 ms | 52974 tok/s) step 3060/76294 | train loss 3.521489 | norm 0.1657 | lr 1.16e-03 | (9933.48 ms | 52780 tok/s) step 3061/76294 | train loss 3.490579 | norm 0.1531 | lr 1.16e-03 | (9889.77 ms | 53013 tok/s) step 3062/76294 | train loss 3.512404 | norm 0.1761 | lr 1.16e-03 | (9897.06 ms | 52974 tok/s) step 3063/76294 | train loss 3.508208 | norm 0.1540 | lr 1.16e-03 | (9893.54 ms | 52993 tok/s) step 3064/76294 | train loss 3.490891 | norm 0.1534 | lr 1.16e-03 | (9899.23 ms | 52963 tok/s) step 3065/76294 | train loss 3.636683 | norm 0.1908 | lr 1.16e-03 | (9892.15 ms | 53000 tok/s) step 3066/76294 | train loss 3.452384 | norm 0.1819 | lr 1.16e-03 | (9895.96 ms | 52980 tok/s) step 3067/76294 | train loss 3.518154 | norm 0.1622 | lr 1.16e-03 | (9965.41 ms | 52611 tok/s) step 3068/76294 | train loss 3.498584 | norm 0.1763 | lr 1.16e-03 | (9895.69 ms | 52981 tok/s) step 3069/76294 | train loss 3.464320 | norm 0.1858 | lr 1.16e-03 | (9924.54 ms | 52827 tok/s) step 3070/76294 | train loss 3.437798 | norm 0.1730 | lr 1.16e-03 | (9887.22 ms | 53027 tok/s) step 3071/76294 | train loss 3.518860 | norm 0.1781 | lr 1.16e-03 | (9896.56 ms | 52977 tok/s) step 3072/76294 | train loss 3.473888 | norm 0.1962 | lr 1.16e-03 | (9884.91 ms | 53039 tok/s) step 3073/76294 | train loss 3.490212 | norm 0.1538 | lr 1.16e-03 | (9893.69 ms | 52992 tok/s) step 3074/76294 | train loss 3.496728 | norm 0.1718 | lr 1.16e-03 | (9887.27 ms | 53027 tok/s) step 3075/76294 | train loss 3.546952 | norm 0.2466 | lr 1.16e-03 | (9945.63 ms | 52715 tok/s) step 3076/76294 | train loss 3.519961 | norm 0.2511 | lr 1.16e-03 | (9887.84 ms | 53024 tok/s) step 3077/76294 | train loss 3.497310 | norm 0.3379 | lr 1.16e-03 | (9896.22 ms | 52979 tok/s) step 3078/76294 | train loss 3.550907 | norm 0.2719 | lr 1.16e-03 | (9889.77 ms | 53013 tok/s) step 3079/76294 | train loss 3.485728 | norm 0.2378 | lr 1.16e-03 | (9899.10 ms | 52963 tok/s) step 3080/76294 | train loss 3.425670 | norm 0.2003 | lr 1.16e-03 | (9892.46 ms | 52999 tok/s) step 3081/76294 | train loss 3.530193 | norm 0.1922 | lr 1.16e-03 | (9962.96 ms | 52624 tok/s) step 3082/76294 | train loss 3.470992 | norm 0.2065 | lr 1.16e-03 | (9892.98 ms | 52996 tok/s) step 3083/76294 | train loss 3.466530 | norm 0.1659 | lr 1.16e-03 | (9916.24 ms | 52872 tok/s) step 3084/76294 | train loss 3.518669 | norm 0.1730 | lr 1.16e-03 | (9950.19 ms | 52691 tok/s) step 3085/76294 | train loss 3.418700 | norm 0.1725 | lr 1.16e-03 | (9900.80 ms | 52954 tok/s) step 3086/76294 | train loss 3.500146 | norm 0.1575 | lr 1.16e-03 | (9886.28 ms | 53032 tok/s) step 3087/76294 | train loss 3.483162 | norm 0.1893 | lr 1.16e-03 | (9904.52 ms | 52934 tok/s) step 3088/76294 | train loss 3.498405 | norm 0.1585 | lr 1.16e-03 | (9889.45 ms | 53015 tok/s) step 3089/76294 | train loss 3.511738 | norm 0.1819 | lr 1.16e-03 | (9890.94 ms | 53007 tok/s) step 3090/76294 | train loss 3.463071 | norm 0.1614 | lr 1.16e-03 | (9898.68 ms | 52965 tok/s) step 3091/76294 | train loss 3.459378 | norm 0.1819 | lr 1.16e-03 | (9896.78 ms | 52976 tok/s) step 3092/76294 | train loss 3.522344 | norm 0.1588 | lr 1.16e-03 | (9930.91 ms | 52794 tok/s) step 3093/76294 | train loss 3.462649 | norm 0.1587 | lr 1.16e-03 | (9903.95 ms | 52937 tok/s) step 3094/76294 | train loss 3.549384 | norm 0.1785 | lr 1.16e-03 | (9889.72 ms | 53013 tok/s) step 3095/76294 | train loss 3.481319 | norm 0.1619 | lr 1.16e-03 | (9968.85 ms | 52593 tok/s) step 3096/76294 | train loss 3.437015 | norm 0.1780 | lr 1.16e-03 | (9902.33 ms | 52946 tok/s) step 3097/76294 | train loss 3.518548 | norm 0.1647 | lr 1.16e-03 | (9911.55 ms | 52897 tok/s) step 3098/76294 | train loss 3.411043 | norm 0.1686 | lr 1.16e-03 | (9895.11 ms | 52985 tok/s) step 3099/76294 | train loss 3.493743 | norm 0.1579 | lr 1.16e-03 | (9940.79 ms | 52741 tok/s) step 3100/76294 | train loss 3.528685 | norm 0.1748 | lr 1.16e-03 | (9892.55 ms | 52998 tok/s) step 3101/76294 | train loss 3.451177 | norm 0.1659 | lr 1.16e-03 | (9897.79 ms | 52970 tok/s) step 3102/76294 | train loss 3.499147 | norm 0.1615 | lr 1.16e-03 | (9890.92 ms | 53007 tok/s) step 3103/76294 | train loss 3.479477 | norm 0.1717 | lr 1.16e-03 | (9898.67 ms | 52965 tok/s) step 3104/76294 | train loss 3.423594 | norm 0.1457 | lr 1.16e-03 | (9896.20 ms | 52979 tok/s) step 3105/76294 | train loss 3.494968 | norm 0.1643 | lr 1.16e-03 | (9936.07 ms | 52766 tok/s) step 3106/76294 | train loss 3.447746 | norm 0.1666 | lr 1.16e-03 | (9973.47 ms | 52568 tok/s) step 3107/76294 | train loss 3.470001 | norm 0.1455 | lr 1.16e-03 | (10447.67 ms | 50182 tok/s) step 3108/76294 | train loss 3.591948 | norm 0.1836 | lr 1.16e-03 | (9882.63 ms | 53051 tok/s) step 3109/76294 | train loss 3.447195 | norm 0.1721 | lr 1.16e-03 | (9895.47 ms | 52983 tok/s) step 3110/76294 | train loss 3.496323 | norm 0.1603 | lr 1.16e-03 | (9884.54 ms | 53041 tok/s) step 3111/76294 | train loss 3.480209 | norm 0.1684 | lr 1.16e-03 | (9895.31 ms | 52983 tok/s) step 3112/76294 | train loss 3.459496 | norm 0.1413 | lr 1.16e-03 | (9888.38 ms | 53021 tok/s) step 3113/76294 | train loss 3.480892 | norm 0.1625 | lr 1.16e-03 | (9896.13 ms | 52979 tok/s) step 3114/76294 | train loss 3.554935 | norm 0.1686 | lr 1.16e-03 | (9926.71 ms | 52816 tok/s) step 3115/76294 | train loss 3.488362 | norm 0.1661 | lr 1.16e-03 | (9892.98 ms | 52996 tok/s) step 3116/76294 | train loss 3.477616 | norm 0.1655 | lr 1.16e-03 | (9890.52 ms | 53009 tok/s) step 3117/76294 | train loss 3.467129 | norm 0.1688 | lr 1.16e-03 | (9892.61 ms | 52998 tok/s) step 3118/76294 | train loss 3.443538 | norm 0.1675 | lr 1.16e-03 | (9890.57 ms | 53009 tok/s) step 3119/76294 | train loss 3.516490 | norm 0.1610 | lr 1.16e-03 | (9894.55 ms | 52988 tok/s) step 3120/76294 | train loss 3.455887 | norm 0.1687 | lr 1.16e-03 | (9891.08 ms | 53006 tok/s) step 3121/76294 | train loss 3.476581 | norm 0.1685 | lr 1.16e-03 | (9890.53 ms | 53009 tok/s) step 3122/76294 | train loss 3.428719 | norm 0.1570 | lr 1.16e-03 | (11113.83 ms | 47174 tok/s) step 3123/76294 | train loss 3.545731 | norm 0.1741 | lr 1.16e-03 | (9918.52 ms | 52860 tok/s) step 3124/76294 | train loss 3.381218 | norm 0.1698 | lr 1.16e-03 | (9888.77 ms | 53019 tok/s) step 3125/76294 | train loss 3.489753 | norm 0.1511 | lr 1.16e-03 | (9891.27 ms | 53005 tok/s) step 3126/76294 | train loss 3.475589 | norm 0.1664 | lr 1.16e-03 | (9893.14 ms | 52995 tok/s) step 3127/76294 | train loss 3.501824 | norm 0.1554 | lr 1.16e-03 | (9885.91 ms | 53034 tok/s) step 3128/76294 | train loss 3.544331 | norm 0.1998 | lr 1.16e-03 | (9893.80 ms | 52992 tok/s) step 3129/76294 | train loss 3.562793 | norm 0.1952 | lr 1.16e-03 | (9905.25 ms | 52930 tok/s) step 3130/76294 | train loss 3.530907 | norm 0.1710 | lr 1.16e-03 | (9888.59 ms | 53019 tok/s) step 3131/76294 | train loss 3.491904 | norm 0.1630 | lr 1.16e-03 | (9901.37 ms | 52951 tok/s) step 3132/76294 | train loss 3.450479 | norm 0.1738 | lr 1.16e-03 | (9886.56 ms | 53030 tok/s) step 3133/76294 | train loss 3.530405 | norm 0.1670 | lr 1.16e-03 | (9898.36 ms | 52967 tok/s) step 3134/76294 | train loss 3.477834 | norm 0.1970 | lr 1.16e-03 | (11595.28 ms | 45216 tok/s) step 3135/76294 | train loss 3.475257 | norm 0.1537 | lr 1.16e-03 | (9879.07 ms | 53071 tok/s) step 3136/76294 | train loss 3.529312 | norm 0.1714 | lr 1.16e-03 | (9944.77 ms | 52720 tok/s) step 3137/76294 | train loss 3.498707 | norm 0.1556 | lr 1.16e-03 | (9882.61 ms | 53052 tok/s) step 3138/76294 | train loss 3.472646 | norm 0.1602 | lr 1.16e-03 | (9893.21 ms | 52995 tok/s) step 3139/76294 | train loss 3.441725 | norm 0.1664 | lr 1.16e-03 | (9893.06 ms | 52996 tok/s) step 3140/76294 | train loss 3.494247 | norm 0.1730 | lr 1.16e-03 | (11294.69 ms | 46419 tok/s) step 3141/76294 | train loss 3.483947 | norm 0.1770 | lr 1.16e-03 | (9923.35 ms | 52834 tok/s) step 3142/76294 | train loss 3.558229 | norm 0.1839 | lr 1.16e-03 | (9889.31 ms | 53016 tok/s) step 3143/76294 | train loss 3.471064 | norm 0.1598 | lr 1.16e-03 | (11105.90 ms | 47208 tok/s) step 3144/76294 | train loss 3.441696 | norm 0.1555 | lr 1.16e-03 | (9880.32 ms | 53064 tok/s) step 3145/76294 | train loss 3.472239 | norm 0.1381 | lr 1.16e-03 | (9898.75 ms | 52965 tok/s) step 3146/76294 | train loss 3.487204 | norm 0.1491 | lr 1.16e-03 | (9889.06 ms | 53017 tok/s) step 3147/76294 | train loss 3.450016 | norm 0.1258 | lr 1.16e-03 | (9905.35 ms | 52930 tok/s) step 3148/76294 | train loss 3.543043 | norm 0.1417 | lr 1.16e-03 | (9961.38 ms | 52632 tok/s) step 3149/76294 | train loss 3.475603 | norm 0.1534 | lr 1.16e-03 | (9897.10 ms | 52974 tok/s) step 3150/76294 | train loss 3.444543 | norm 0.1489 | lr 1.16e-03 | (9896.33 ms | 52978 tok/s) step 3151/76294 | train loss 3.475607 | norm 0.1642 | lr 1.16e-03 | (9906.87 ms | 52922 tok/s) step 3152/76294 | train loss 3.521724 | norm 0.1703 | lr 1.16e-03 | (9939.89 ms | 52746 tok/s) step 3153/76294 | train loss 3.477257 | norm 0.1816 | lr 1.16e-03 | (9902.78 ms | 52944 tok/s) step 3154/76294 | train loss 3.497786 | norm 0.1477 | lr 1.16e-03 | (9907.83 ms | 52917 tok/s) step 3155/76294 | train loss 3.510304 | norm 0.1612 | lr 1.16e-03 | (9898.98 ms | 52964 tok/s) step 3156/76294 | train loss 3.472247 | norm 0.1475 | lr 1.16e-03 | (9920.64 ms | 52848 tok/s) step 3157/76294 | train loss 3.496413 | norm 0.1849 | lr 1.16e-03 | (9902.97 ms | 52942 tok/s) step 3158/76294 | train loss 3.544871 | norm 0.1622 | lr 1.16e-03 | (10706.90 ms | 48967 tok/s) step 3159/76294 | train loss 3.470442 | norm 0.1614 | lr 1.16e-03 | (9901.54 ms | 52950 tok/s) step 3160/76294 | train loss 3.566307 | norm 0.1544 | lr 1.16e-03 | (9957.93 ms | 52650 tok/s) step 3161/76294 | train loss 3.458416 | norm 0.1754 | lr 1.16e-03 | (9922.55 ms | 52838 tok/s) step 3162/76294 | train loss 3.447729 | norm 0.1754 | lr 1.16e-03 | (9911.65 ms | 52896 tok/s) step 3163/76294 | train loss 3.582578 | norm 0.2094 | lr 1.16e-03 | (9901.17 ms | 52952 tok/s) step 3164/76294 | train loss 3.490496 | norm 0.2018 | lr 1.16e-03 | (9888.01 ms | 53023 tok/s) step 3165/76294 | train loss 3.498269 | norm 0.1947 | lr 1.16e-03 | (9898.60 ms | 52966 tok/s) step 3166/76294 | train loss 3.428034 | norm 0.1789 | lr 1.16e-03 | (9891.40 ms | 53004 tok/s) step 3167/76294 | train loss 3.520769 | norm 0.1823 | lr 1.16e-03 | (9902.33 ms | 52946 tok/s) step 3168/76294 | train loss 3.577755 | norm 0.1662 | lr 1.16e-03 | (9911.47 ms | 52897 tok/s) step 3169/76294 | train loss 3.543716 | norm 0.1626 | lr 1.16e-03 | (9901.94 ms | 52948 tok/s) step 3170/76294 | train loss 3.498449 | norm 0.1558 | lr 1.16e-03 | (9892.55 ms | 52998 tok/s) step 3171/76294 | train loss 3.581497 | norm 0.1473 | lr 1.16e-03 | (9898.52 ms | 52966 tok/s) step 3172/76294 | train loss 3.496814 | norm 0.1790 | lr 1.16e-03 | (9902.38 ms | 52946 tok/s) step 3173/76294 | train loss 3.531950 | norm 0.1697 | lr 1.16e-03 | (9896.27 ms | 52978 tok/s) step 3174/76294 | train loss 3.424292 | norm 0.1460 | lr 1.16e-03 | (9891.48 ms | 53004 tok/s) step 3175/76294 | train loss 3.439597 | norm 0.1631 | lr 1.16e-03 | (9959.32 ms | 52643 tok/s) step 3176/76294 | train loss 3.455490 | norm 0.2184 | lr 1.16e-03 | (9906.87 ms | 52922 tok/s) step 3177/76294 | train loss 3.501659 | norm 0.2021 | lr 1.16e-03 | (9932.72 ms | 52784 tok/s) step 3178/76294 | train loss 3.496700 | norm 0.1857 | lr 1.16e-03 | (9895.21 ms | 52984 tok/s) step 3179/76294 | train loss 3.490915 | norm 0.1881 | lr 1.16e-03 | (9891.30 ms | 53005 tok/s) step 3180/76294 | train loss 3.500201 | norm 0.1982 | lr 1.16e-03 | (9895.58 ms | 52982 tok/s) step 3181/76294 | train loss 3.544575 | norm 0.1913 | lr 1.16e-03 | (9902.65 ms | 52944 tok/s) step 3182/76294 | train loss 3.496166 | norm 0.1663 | lr 1.16e-03 | (9889.76 ms | 53013 tok/s) step 3183/76294 | train loss 3.510494 | norm 0.1646 | lr 1.15e-03 | (9895.95 ms | 52980 tok/s) step 3184/76294 | train loss 3.498784 | norm 0.1656 | lr 1.15e-03 | (9888.84 ms | 53018 tok/s) step 3185/76294 | train loss 3.478430 | norm 0.1698 | lr 1.15e-03 | (9894.86 ms | 52986 tok/s) step 3186/76294 | train loss 3.546288 | norm 0.1569 | lr 1.15e-03 | (9933.30 ms | 52781 tok/s) step 3187/76294 | train loss 3.476804 | norm 0.1395 | lr 1.15e-03 | (9900.70 ms | 52955 tok/s) step 3188/76294 | train loss 3.475022 | norm 0.1712 | lr 1.15e-03 | (9883.52 ms | 53047 tok/s) step 3189/76294 | train loss 3.473170 | norm 0.1917 | lr 1.15e-03 | (9985.18 ms | 52507 tok/s) step 3190/76294 | train loss 3.560032 | norm 0.1533 | lr 1.15e-03 | (9925.69 ms | 52821 tok/s) step 3191/76294 | train loss 3.571667 | norm 0.1603 | lr 1.15e-03 | (9889.12 ms | 53017 tok/s) step 3192/76294 | train loss 3.527140 | norm 0.1450 | lr 1.15e-03 | (9896.19 ms | 52979 tok/s) step 3193/76294 | train loss 3.486839 | norm 0.1495 | lr 1.15e-03 | (9888.24 ms | 53021 tok/s) step 3194/76294 | train loss 3.513021 | norm 0.1762 | lr 1.15e-03 | (9896.76 ms | 52976 tok/s) step 3195/76294 | train loss 3.446334 | norm 0.2075 | lr 1.15e-03 | (9885.00 ms | 53039 tok/s) step 3196/76294 | train loss 3.516551 | norm 0.2100 | lr 1.15e-03 | (9889.14 ms | 53017 tok/s) step 3197/76294 | train loss 3.475059 | norm 0.1764 | lr 1.15e-03 | (9974.82 ms | 52561 tok/s) step 3198/76294 | train loss 3.519165 | norm 0.1832 | lr 1.15e-03 | (9886.77 ms | 53029 tok/s) step 3199/76294 | train loss 3.452863 | norm 0.1903 | lr 1.15e-03 | (9886.84 ms | 53029 tok/s) step 3200/76294 | train loss 3.492980 | norm 0.1481 | lr 1.15e-03 | (9906.05 ms | 52926 tok/s) step 3201/76294 | train loss 3.455315 | norm 0.1880 | lr 1.15e-03 | (9901.24 ms | 52952 tok/s) step 3202/76294 | train loss 3.459112 | norm 0.1814 | lr 1.15e-03 | (9888.61 ms | 53019 tok/s) step 3203/76294 | train loss 3.453438 | norm 0.1598 | lr 1.15e-03 | (9891.70 ms | 53003 tok/s) step 3204/76294 | train loss 3.516331 | norm 0.1844 | lr 1.15e-03 | (9893.13 ms | 52995 tok/s) step 3205/76294 | train loss 3.472382 | norm 0.1647 | lr 1.15e-03 | (9889.62 ms | 53014 tok/s) step 3206/76294 | train loss 3.496220 | norm 0.1847 | lr 1.15e-03 | (9898.28 ms | 52968 tok/s) step 3207/76294 | train loss 3.415048 | norm 0.1746 | lr 1.15e-03 | (9888.84 ms | 53018 tok/s) step 3208/76294 | train loss 3.497155 | norm 0.1674 | lr 1.15e-03 | (9891.72 ms | 53003 tok/s) step 3209/76294 | train loss 3.480369 | norm 0.1484 | lr 1.15e-03 | (9893.52 ms | 52993 tok/s) step 3210/76294 | train loss 3.538581 | norm 0.1602 | lr 1.15e-03 | (9891.96 ms | 53001 tok/s) step 3211/76294 | train loss 3.468614 | norm 0.1715 | lr 1.15e-03 | (9936.65 ms | 52763 tok/s) step 3212/76294 | train loss 3.501176 | norm 0.1625 | lr 1.15e-03 | (9889.48 ms | 53015 tok/s) step 3213/76294 | train loss 3.445369 | norm 0.1637 | lr 1.15e-03 | (9899.60 ms | 52961 tok/s) step 3214/76294 | train loss 3.481532 | norm 0.1632 | lr 1.15e-03 | (9926.39 ms | 52818 tok/s) step 3215/76294 | train loss 3.590994 | norm 0.1794 | lr 1.15e-03 | (9887.76 ms | 53024 tok/s) step 3216/76294 | train loss 3.484945 | norm 0.1806 | lr 1.15e-03 | (9890.48 ms | 53009 tok/s) step 3217/76294 | train loss 3.412703 | norm 0.2020 | lr 1.15e-03 | (9887.60 ms | 53025 tok/s) step 3218/76294 | train loss 3.519365 | norm 0.1739 | lr 1.15e-03 | (9897.25 ms | 52973 tok/s) step 3219/76294 | train loss 3.500721 | norm 0.2350 | lr 1.15e-03 | (9888.99 ms | 53017 tok/s) step 3220/76294 | train loss 3.483888 | norm 0.2594 | lr 1.15e-03 | (11198.89 ms | 46816 tok/s) step 3221/76294 | train loss 3.480592 | norm 0.1939 | lr 1.15e-03 | (9885.16 ms | 53038 tok/s) step 3222/76294 | train loss 3.524838 | norm 0.1741 | lr 1.15e-03 | (11183.80 ms | 46879 tok/s) step 3223/76294 | train loss 3.499812 | norm 0.2102 | lr 1.15e-03 | (9882.98 ms | 53050 tok/s) step 3224/76294 | train loss 3.508439 | norm 0.1697 | lr 1.15e-03 | (9880.64 ms | 53062 tok/s) step 3225/76294 | train loss 3.452127 | norm 0.1708 | lr 1.15e-03 | (9887.07 ms | 53028 tok/s) step 3226/76294 | train loss 3.508877 | norm 0.1713 | lr 1.15e-03 | (9889.24 ms | 53016 tok/s) step 3227/76294 | train loss 3.442334 | norm 0.1582 | lr 1.15e-03 | (9937.55 ms | 52758 tok/s) step 3228/76294 | train loss 3.487183 | norm 0.1774 | lr 1.15e-03 | (9891.18 ms | 53006 tok/s) step 3229/76294 | train loss 3.488515 | norm 0.1723 | lr 1.15e-03 | (9908.32 ms | 52914 tok/s) step 3230/76294 | train loss 3.479765 | norm 0.1424 | lr 1.15e-03 | (9888.50 ms | 53020 tok/s) step 3231/76294 | train loss 3.531673 | norm 0.1811 | lr 1.15e-03 | (9922.54 ms | 52838 tok/s) step 3232/76294 | train loss 3.514256 | norm 0.1863 | lr 1.15e-03 | (9919.79 ms | 52853 tok/s) step 3233/76294 | train loss 3.442868 | norm 0.1589 | lr 1.15e-03 | (9955.21 ms | 52665 tok/s) step 3234/76294 | train loss 3.503710 | norm 0.1461 | lr 1.15e-03 | (9896.29 ms | 52978 tok/s) step 3235/76294 | train loss 3.507345 | norm 0.1579 | lr 1.15e-03 | (9891.57 ms | 53004 tok/s) step 3236/76294 | train loss 3.562776 | norm 0.1578 | lr 1.15e-03 | (9903.06 ms | 52942 tok/s) step 3237/76294 | train loss 3.518110 | norm 0.1344 | lr 1.15e-03 | (9940.19 ms | 52744 tok/s) step 3238/76294 | train loss 3.541154 | norm 0.1708 | lr 1.15e-03 | (9891.22 ms | 53005 tok/s) step 3239/76294 | train loss 3.483134 | norm 0.1508 | lr 1.15e-03 | (9908.71 ms | 52912 tok/s) step 3240/76294 | train loss 3.622292 | norm 0.1375 | lr 1.15e-03 | (9889.23 ms | 53016 tok/s) step 3241/76294 | train loss 3.519549 | norm 0.1620 | lr 1.15e-03 | (9930.35 ms | 52797 tok/s) step 3242/76294 | train loss 3.514470 | norm 0.1743 | lr 1.15e-03 | (9890.68 ms | 53008 tok/s) step 3243/76294 | train loss 3.509405 | norm 0.1543 | lr 1.15e-03 | (9899.17 ms | 52963 tok/s) step 3244/76294 | train loss 3.484420 | norm 0.1600 | lr 1.15e-03 | (9893.90 ms | 52991 tok/s) step 3245/76294 | train loss 3.519213 | norm 0.1794 | lr 1.15e-03 | (9985.51 ms | 52505 tok/s) step 3246/76294 | train loss 3.451865 | norm 0.1656 | lr 1.15e-03 | (9894.73 ms | 52987 tok/s) step 3247/76294 | train loss 3.507917 | norm 0.1557 | lr 1.15e-03 | (9895.04 ms | 52985 tok/s) step 3248/76294 | train loss 3.574740 | norm 0.1437 | lr 1.15e-03 | (9932.48 ms | 52785 tok/s) step 3249/76294 | train loss 3.512916 | norm 0.1741 | lr 1.15e-03 | (9891.05 ms | 53006 tok/s) step 3250/76294 | train loss 3.488436 | norm 0.1580 | lr 1.15e-03 | (9933.99 ms | 52777 tok/s) val loss: 3.471848 saving model checkpoint to ./results/gpt2-350M-gqa/step_3250.pth step 3251/76294 | train loss 3.480916 | norm 0.1519 | lr 1.15e-03 | (9971.02 ms | 52581 tok/s) step 3252/76294 | train loss 3.437401 | norm 0.1710 | lr 1.15e-03 | (9871.93 ms | 53109 tok/s) step 3253/76294 | train loss 3.543797 | norm 0.1585 | lr 1.15e-03 | (9893.87 ms | 52991 tok/s) step 3254/76294 | train loss 3.443370 | norm 0.1579 | lr 1.15e-03 | (9880.34 ms | 53064 tok/s) step 3255/76294 | train loss 3.505445 | norm 0.1400 | lr 1.15e-03 | (9889.97 ms | 53012 tok/s) step 3256/76294 | train loss 3.484538 | norm 0.1643 | lr 1.15e-03 | (9898.86 ms | 52964 tok/s) step 3257/76294 | train loss 3.498244 | norm 0.1334 | lr 1.15e-03 | (9896.95 ms | 52975 tok/s) step 3258/76294 | train loss 3.471895 | norm 0.1827 | lr 1.15e-03 | (9894.79 ms | 52986 tok/s) step 3259/76294 | train loss 3.498414 | norm 0.2028 | lr 1.15e-03 | (9911.80 ms | 52895 tok/s) step 3260/76294 | train loss 3.480897 | norm 0.2082 | lr 1.15e-03 | (9935.18 ms | 52771 tok/s) step 3261/76294 | train loss 3.459477 | norm 0.1605 | lr 1.15e-03 | (9897.25 ms | 52973 tok/s) step 3262/76294 | train loss 3.497450 | norm 0.1948 | lr 1.15e-03 | (9906.78 ms | 52922 tok/s) step 3263/76294 | train loss 3.414432 | norm 0.1698 | lr 1.15e-03 | (9898.04 ms | 52969 tok/s) step 3264/76294 | train loss 3.551487 | norm 0.1995 | lr 1.15e-03 | (9902.45 ms | 52945 tok/s) step 3265/76294 | train loss 3.438188 | norm 0.1646 | lr 1.15e-03 | (9897.98 ms | 52969 tok/s) step 3266/76294 | train loss 3.499553 | norm 0.1746 | lr 1.15e-03 | (9899.68 ms | 52960 tok/s) step 3267/76294 | train loss 3.526533 | norm 0.1494 | lr 1.15e-03 | (9896.90 ms | 52975 tok/s) step 3268/76294 | train loss 3.538827 | norm 0.1632 | lr 1.15e-03 | (9898.49 ms | 52966 tok/s) step 3269/76294 | train loss 3.583755 | norm 0.1746 | lr 1.15e-03 | (9898.95 ms | 52964 tok/s) step 3270/76294 | train loss 3.511775 | norm 0.1637 | lr 1.15e-03 | (9899.63 ms | 52960 tok/s) step 3271/76294 | train loss 3.468277 | norm 0.1584 | lr 1.15e-03 | (9896.45 ms | 52977 tok/s) step 3272/76294 | train loss 3.529246 | norm 0.2385 | lr 1.15e-03 | (9894.94 ms | 52985 tok/s) step 3273/76294 | train loss 3.524799 | norm 0.1908 | lr 1.15e-03 | (9897.56 ms | 52971 tok/s) step 3274/76294 | train loss 3.487943 | norm 0.1792 | lr 1.15e-03 | (9899.38 ms | 52962 tok/s) step 3275/76294 | train loss 3.458852 | norm 0.2042 | lr 1.15e-03 | (9893.87 ms | 52991 tok/s) step 3276/76294 | train loss 3.490286 | norm 0.1497 | lr 1.15e-03 | (9903.06 ms | 52942 tok/s) step 3277/76294 | train loss 3.495602 | norm 0.1921 | lr 1.15e-03 | (9903.79 ms | 52938 tok/s) step 3278/76294 | train loss 3.449030 | norm 0.1689 | lr 1.15e-03 | (9891.28 ms | 53005 tok/s) step 3279/76294 | train loss 3.509779 | norm 0.1607 | lr 1.15e-03 | (9897.66 ms | 52971 tok/s) step 3280/76294 | train loss 3.426593 | norm 0.1577 | lr 1.15e-03 | (9924.96 ms | 52825 tok/s) step 3281/76294 | train loss 3.485470 | norm 0.1441 | lr 1.15e-03 | (9896.22 ms | 52979 tok/s) step 3282/76294 | train loss 3.452509 | norm 0.1373 | lr 1.15e-03 | (9901.19 ms | 52952 tok/s) step 3283/76294 | train loss 3.532576 | norm 0.1454 | lr 1.15e-03 | (9895.48 ms | 52983 tok/s) step 3284/76294 | train loss 3.480650 | norm 0.1324 | lr 1.15e-03 | (9948.28 ms | 52701 tok/s) step 3285/76294 | train loss 3.567863 | norm 0.1521 | lr 1.15e-03 | (9895.42 ms | 52983 tok/s) step 3286/76294 | train loss 3.455673 | norm 0.1333 | lr 1.15e-03 | (9895.57 ms | 52982 tok/s) step 3287/76294 | train loss 3.611379 | norm 0.3319 | lr 1.15e-03 | (9902.54 ms | 52945 tok/s) step 3288/76294 | train loss 3.482188 | norm 0.2196 | lr 1.15e-03 | (9891.48 ms | 53004 tok/s) step 3289/76294 | train loss 3.518101 | norm 0.2348 | lr 1.15e-03 | (9987.21 ms | 52496 tok/s) step 3290/76294 | train loss 3.541250 | norm 0.2066 | lr 1.15e-03 | (9892.99 ms | 52996 tok/s) step 3291/76294 | train loss 3.547619 | norm 0.1933 | lr 1.15e-03 | (9902.90 ms | 52943 tok/s) step 3292/76294 | train loss 3.412718 | norm 0.2031 | lr 1.15e-03 | (9896.19 ms | 52979 tok/s) step 3293/76294 | train loss 3.555154 | norm 0.1688 | lr 1.15e-03 | (9901.73 ms | 52949 tok/s) step 3294/76294 | train loss 3.463046 | norm 0.1898 | lr 1.15e-03 | (10020.04 ms | 52324 tok/s) step 3295/76294 | train loss 3.469390 | norm 0.1854 | lr 1.15e-03 | (9900.04 ms | 52958 tok/s) step 3296/76294 | train loss 3.582014 | norm 0.1698 | lr 1.15e-03 | (9893.84 ms | 52991 tok/s) step 3297/76294 | train loss 3.485330 | norm 0.1775 | lr 1.15e-03 | (9959.57 ms | 52642 tok/s) step 3298/76294 | train loss 3.541929 | norm 0.1459 | lr 1.15e-03 | (9888.65 ms | 53019 tok/s) step 3299/76294 | train loss 3.464517 | norm 0.1534 | lr 1.15e-03 | (10010.95 ms | 52371 tok/s) step 3300/76294 | train loss 3.483930 | norm 0.1654 | lr 1.15e-03 | (9892.71 ms | 52997 tok/s) step 3301/76294 | train loss 3.470453 | norm 0.1427 | lr 1.15e-03 | (9897.06 ms | 52974 tok/s) step 3302/76294 | train loss 3.530680 | norm 0.1614 | lr 1.15e-03 | (9929.62 ms | 52800 tok/s) step 3303/76294 | train loss 3.473915 | norm 0.1690 | lr 1.15e-03 | (9898.60 ms | 52966 tok/s) step 3304/76294 | train loss 3.528280 | norm 0.1838 | lr 1.15e-03 | (9896.22 ms | 52979 tok/s) step 3305/76294 | train loss 3.421693 | norm 0.1716 | lr 1.15e-03 | (9917.68 ms | 52864 tok/s) step 3306/76294 | train loss 3.685299 | norm 0.2513 | lr 1.15e-03 | (9892.57 ms | 52998 tok/s) step 3307/76294 | train loss 3.419838 | norm 0.3376 | lr 1.15e-03 | (9892.16 ms | 53000 tok/s) step 3308/76294 | train loss 3.520385 | norm 0.2175 | lr 1.15e-03 | (9893.05 ms | 52996 tok/s) step 3309/76294 | train loss 3.449241 | norm 0.1972 | lr 1.15e-03 | (9894.10 ms | 52990 tok/s) step 3310/76294 | train loss 3.509461 | norm 0.1675 | lr 1.15e-03 | (9888.90 ms | 53018 tok/s) step 3311/76294 | train loss 3.465498 | norm 0.1774 | lr 1.15e-03 | (9898.80 ms | 52965 tok/s) step 3312/76294 | train loss 3.500585 | norm 0.1905 | lr 1.15e-03 | (9933.67 ms | 52779 tok/s) step 3313/76294 | train loss 3.451093 | norm 0.1465 | lr 1.15e-03 | (9895.87 ms | 52980 tok/s) step 3314/76294 | train loss 3.517794 | norm 0.1792 | lr 1.15e-03 | (9894.49 ms | 52988 tok/s) step 3315/76294 | train loss 3.490292 | norm 0.1644 | lr 1.15e-03 | (9892.84 ms | 52997 tok/s) step 3316/76294 | train loss 3.448818 | norm 0.1318 | lr 1.15e-03 | (9899.01 ms | 52964 tok/s) step 3317/76294 | train loss 3.485640 | norm 0.1697 | lr 1.15e-03 | (11277.99 ms | 46488 tok/s) step 3318/76294 | train loss 3.528270 | norm 0.1613 | lr 1.15e-03 | (9874.58 ms | 53095 tok/s) step 3319/76294 | train loss 3.639742 | norm 0.1553 | lr 1.15e-03 | (9887.83 ms | 53024 tok/s) step 3320/76294 | train loss 3.632946 | norm 0.1519 | lr 1.15e-03 | (9884.99 ms | 53039 tok/s) step 3321/76294 | train loss 3.527391 | norm 0.1810 | lr 1.15e-03 | (9888.35 ms | 53021 tok/s) step 3322/76294 | train loss 3.478123 | norm 0.1765 | lr 1.15e-03 | (9923.18 ms | 52835 tok/s) step 3323/76294 | train loss 3.513769 | norm 0.1339 | lr 1.15e-03 | (9887.21 ms | 53027 tok/s) step 3324/76294 | train loss 3.441430 | norm 0.1482 | lr 1.15e-03 | (9888.38 ms | 53021 tok/s) step 3325/76294 | train loss 3.434669 | norm 0.1440 | lr 1.15e-03 | (9890.87 ms | 53007 tok/s) step 3326/76294 | train loss 3.395267 | norm 0.1527 | lr 1.15e-03 | (9893.24 ms | 52995 tok/s) step 3327/76294 | train loss 3.509781 | norm 0.1512 | lr 1.15e-03 | (9890.17 ms | 53011 tok/s) step 3328/76294 | train loss 3.463847 | norm 0.1565 | lr 1.15e-03 | (9895.82 ms | 52981 tok/s) step 3329/76294 | train loss 3.470828 | norm 0.1843 | lr 1.15e-03 | (9885.58 ms | 53036 tok/s) step 3330/76294 | train loss 3.434395 | norm 0.1547 | lr 1.15e-03 | (9893.01 ms | 52996 tok/s) step 3331/76294 | train loss 3.467100 | norm 0.1496 | lr 1.15e-03 | (9894.63 ms | 52987 tok/s) step 3332/76294 | train loss 3.494917 | norm 0.1586 | lr 1.15e-03 | (9894.81 ms | 52986 tok/s) step 3333/76294 | train loss 3.428874 | norm 0.1453 | lr 1.15e-03 | (9893.67 ms | 52992 tok/s) step 3334/76294 | train loss 3.575320 | norm 0.1513 | lr 1.15e-03 | (9894.50 ms | 52988 tok/s) step 3335/76294 | train loss 3.502370 | norm 0.1691 | lr 1.15e-03 | (9959.37 ms | 52643 tok/s) step 3336/76294 | train loss 3.484649 | norm 0.1639 | lr 1.15e-03 | (9886.52 ms | 53031 tok/s) step 3337/76294 | train loss 3.467709 | norm 0.1493 | lr 1.15e-03 | (9923.74 ms | 52832 tok/s) step 3338/76294 | train loss 3.533258 | norm 0.1517 | lr 1.15e-03 | (9896.26 ms | 52978 tok/s) step 3339/76294 | train loss 3.498330 | norm 0.1774 | lr 1.15e-03 | (9898.28 ms | 52968 tok/s) step 3340/76294 | train loss 3.505147 | norm 0.1435 | lr 1.15e-03 | (9955.63 ms | 52662 tok/s) step 3341/76294 | train loss 3.410787 | norm 0.1559 | lr 1.15e-03 | (9893.22 ms | 52995 tok/s) step 3342/76294 | train loss 3.435266 | norm 0.1603 | lr 1.15e-03 | (9889.99 ms | 53012 tok/s) step 3343/76294 | train loss 3.497320 | norm 0.1535 | lr 1.15e-03 | (9891.55 ms | 53004 tok/s) step 3344/76294 | train loss 3.442964 | norm 0.1564 | lr 1.15e-03 | (9925.55 ms | 52822 tok/s) step 3345/76294 | train loss 3.455158 | norm 0.1739 | lr 1.15e-03 | (9895.21 ms | 52984 tok/s) step 3346/76294 | train loss 3.446257 | norm 0.1472 | lr 1.15e-03 | (9896.95 ms | 52975 tok/s) step 3347/76294 | train loss 3.460460 | norm 0.1591 | lr 1.15e-03 | (10269.37 ms | 51054 tok/s) step 3348/76294 | train loss 3.520812 | norm 0.1669 | lr 1.15e-03 | (10162.67 ms | 51590 tok/s) step 3349/76294 | train loss 3.469028 | norm 0.1433 | lr 1.15e-03 | (9887.28 ms | 53027 tok/s) step 3350/76294 | train loss 3.474469 | norm 0.1803 | lr 1.15e-03 | (9890.31 ms | 53010 tok/s) step 3351/76294 | train loss 3.482472 | norm 0.1809 | lr 1.15e-03 | (9888.58 ms | 53020 tok/s) step 3352/76294 | train loss 3.482566 | norm 0.1499 | lr 1.15e-03 | (9933.69 ms | 52779 tok/s) step 3353/76294 | train loss 3.511701 | norm 0.1775 | lr 1.15e-03 | (9890.77 ms | 53008 tok/s) step 3354/76294 | train loss 3.378550 | norm 0.1664 | lr 1.15e-03 | (9891.99 ms | 53001 tok/s) step 3355/76294 | train loss 3.528244 | norm 0.1550 | lr 1.15e-03 | (9894.03 ms | 52990 tok/s) step 3356/76294 | train loss 3.508738 | norm 0.1719 | lr 1.15e-03 | (9895.58 ms | 52982 tok/s) step 3357/76294 | train loss 3.456542 | norm 0.2115 | lr 1.15e-03 | (9888.86 ms | 53018 tok/s) step 3358/76294 | train loss 3.499683 | norm 0.2260 | lr 1.15e-03 | (10705.94 ms | 48972 tok/s) step 3359/76294 | train loss 3.473755 | norm 0.1729 | lr 1.15e-03 | (9898.62 ms | 52966 tok/s) step 3360/76294 | train loss 3.471488 | norm 0.1792 | lr 1.15e-03 | (9892.05 ms | 53001 tok/s) step 3361/76294 | train loss 3.522243 | norm 0.1920 | lr 1.15e-03 | (9889.04 ms | 53017 tok/s) step 3362/76294 | train loss 3.454570 | norm 0.1741 | lr 1.15e-03 | (9892.48 ms | 52999 tok/s) step 3363/76294 | train loss 3.567032 | norm 0.2138 | lr 1.15e-03 | (9891.57 ms | 53004 tok/s) step 3364/76294 | train loss 3.484140 | norm 0.1947 | lr 1.15e-03 | (9890.82 ms | 53008 tok/s) step 3365/76294 | train loss 3.458286 | norm 0.1539 | lr 1.15e-03 | (9893.03 ms | 52996 tok/s) step 3366/76294 | train loss 3.456242 | norm 0.1771 | lr 1.15e-03 | (9928.63 ms | 52806 tok/s) step 3367/76294 | train loss 3.471203 | norm 0.1679 | lr 1.15e-03 | (9889.89 ms | 53012 tok/s) step 3368/76294 | train loss 3.487463 | norm 0.1653 | lr 1.15e-03 | (9894.84 ms | 52986 tok/s) step 3369/76294 | train loss 3.405637 | norm 0.1766 | lr 1.15e-03 | (9888.55 ms | 53020 tok/s) step 3370/76294 | train loss 3.506690 | norm 0.1725 | lr 1.15e-03 | (9925.54 ms | 52822 tok/s) step 3371/76294 | train loss 3.406544 | norm 0.1801 | lr 1.15e-03 | (9890.46 ms | 53009 tok/s) step 3372/76294 | train loss 3.531124 | norm 0.1997 | lr 1.15e-03 | (9889.25 ms | 53016 tok/s) step 3373/76294 | train loss 3.541200 | norm 0.1622 | lr 1.15e-03 | (9890.70 ms | 53008 tok/s) step 3374/76294 | train loss 3.526412 | norm 0.2124 | lr 1.15e-03 | (9892.76 ms | 52997 tok/s) step 3375/76294 | train loss 3.494190 | norm 0.2226 | lr 1.15e-03 | (9889.00 ms | 53017 tok/s) step 3376/76294 | train loss 3.500707 | norm 0.1855 | lr 1.15e-03 | (9890.03 ms | 53012 tok/s) step 3377/76294 | train loss 3.456606 | norm 0.1546 | lr 1.15e-03 | (9907.65 ms | 52917 tok/s) step 3378/76294 | train loss 3.493565 | norm 0.1662 | lr 1.15e-03 | (9885.60 ms | 53036 tok/s) step 3379/76294 | train loss 3.531843 | norm 0.1563 | lr 1.15e-03 | (9885.74 ms | 53035 tok/s) step 3380/76294 | train loss 3.484527 | norm 0.1489 | lr 1.15e-03 | (9891.51 ms | 53004 tok/s) step 3381/76294 | train loss 3.509099 | norm 0.1487 | lr 1.15e-03 | (9887.04 ms | 53028 tok/s) step 3382/76294 | train loss 3.467543 | norm 0.1521 | lr 1.15e-03 | (9887.84 ms | 53024 tok/s) step 3383/76294 | train loss 3.490728 | norm 0.1420 | lr 1.15e-03 | (9949.09 ms | 52697 tok/s) step 3384/76294 | train loss 3.461865 | norm 0.1528 | lr 1.15e-03 | (9887.66 ms | 53024 tok/s) step 3385/76294 | train loss 3.412710 | norm 0.1693 | lr 1.15e-03 | (9895.41 ms | 52983 tok/s) step 3386/76294 | train loss 3.507506 | norm 0.1550 | lr 1.15e-03 | (9916.10 ms | 52872 tok/s) step 3387/76294 | train loss 3.484228 | norm 0.2054 | lr 1.15e-03 | (9888.79 ms | 53018 tok/s) step 3388/76294 | train loss 3.441674 | norm 0.1622 | lr 1.15e-03 | (9894.30 ms | 52989 tok/s) step 3389/76294 | train loss 3.447676 | norm 0.1544 | lr 1.15e-03 | (9885.40 ms | 53037 tok/s) step 3390/76294 | train loss 3.441998 | norm 0.1623 | lr 1.15e-03 | (9894.58 ms | 52987 tok/s) step 3391/76294 | train loss 3.486115 | norm 0.1636 | lr 1.15e-03 | (9882.99 ms | 53050 tok/s) step 3392/76294 | train loss 3.436569 | norm 0.1763 | lr 1.15e-03 | (9900.49 ms | 52956 tok/s) step 3393/76294 | train loss 3.539151 | norm 0.1471 | lr 1.15e-03 | (9904.58 ms | 52934 tok/s) step 3394/76294 | train loss 3.441025 | norm 0.1758 | lr 1.15e-03 | (9890.49 ms | 53009 tok/s) step 3395/76294 | train loss 3.446683 | norm 0.1702 | lr 1.15e-03 | (9921.42 ms | 52844 tok/s) step 3396/76294 | train loss 3.469064 | norm 0.1636 | lr 1.15e-03 | (9904.99 ms | 52932 tok/s) step 3397/76294 | train loss 3.477383 | norm 0.1578 | lr 1.15e-03 | (9894.92 ms | 52986 tok/s) step 3398/76294 | train loss 3.474011 | norm 0.1576 | lr 1.15e-03 | (9947.29 ms | 52707 tok/s) step 3399/76294 | train loss 3.476520 | norm 0.1469 | lr 1.15e-03 | (9889.90 ms | 53012 tok/s) step 3400/76294 | train loss 3.493815 | norm 0.1617 | lr 1.15e-03 | (9885.55 ms | 53036 tok/s) step 3401/76294 | train loss 3.483871 | norm 0.1548 | lr 1.15e-03 | (9947.97 ms | 52703 tok/s) step 3402/76294 | train loss 3.488687 | norm 0.1675 | lr 1.15e-03 | (9888.64 ms | 53019 tok/s) step 3403/76294 | train loss 3.546115 | norm 0.1420 | lr 1.15e-03 | (9896.64 ms | 52976 tok/s) step 3404/76294 | train loss 3.415487 | norm 0.1458 | lr 1.15e-03 | (9893.92 ms | 52991 tok/s) step 3405/76294 | train loss 3.465838 | norm 0.1661 | lr 1.15e-03 | (9890.14 ms | 53011 tok/s) step 3406/76294 | train loss 3.471514 | norm 0.1645 | lr 1.15e-03 | (9895.77 ms | 52981 tok/s) step 3407/76294 | train loss 3.440027 | norm 0.1698 | lr 1.15e-03 | (9891.09 ms | 53006 tok/s) step 3408/76294 | train loss 3.504267 | norm 0.1633 | lr 1.15e-03 | (9895.86 ms | 52981 tok/s) step 3409/76294 | train loss 3.437793 | norm 0.1696 | lr 1.15e-03 | (9892.93 ms | 52996 tok/s) step 3410/76294 | train loss 3.507181 | norm 0.1556 | lr 1.15e-03 | (9926.77 ms | 52816 tok/s) step 3411/76294 | train loss 3.452554 | norm 0.1505 | lr 1.15e-03 | (9893.03 ms | 52996 tok/s) step 3412/76294 | train loss 3.423002 | norm 0.1624 | lr 1.15e-03 | (9899.80 ms | 52959 tok/s) step 3413/76294 | train loss 3.429890 | norm 0.1990 | lr 1.15e-03 | (9968.20 ms | 52596 tok/s) step 3414/76294 | train loss 3.484876 | norm 0.2453 | lr 1.15e-03 | (9890.93 ms | 53007 tok/s) step 3415/76294 | train loss 3.449930 | norm 0.2126 | lr 1.15e-03 | (11123.42 ms | 47134 tok/s) step 3416/76294 | train loss 3.414566 | norm 0.2200 | lr 1.15e-03 | (9901.13 ms | 52952 tok/s) step 3417/76294 | train loss 3.496809 | norm 0.1743 | lr 1.15e-03 | (9887.17 ms | 53027 tok/s) step 3418/76294 | train loss 3.470781 | norm 0.2118 | lr 1.15e-03 | (9880.34 ms | 53064 tok/s) step 3419/76294 | train loss 3.497123 | norm 0.2186 | lr 1.15e-03 | (9893.94 ms | 52991 tok/s) step 3420/76294 | train loss 3.421217 | norm 0.1896 | lr 1.15e-03 | (9886.91 ms | 53028 tok/s) step 3421/76294 | train loss 3.466701 | norm 0.1845 | lr 1.15e-03 | (9891.82 ms | 53002 tok/s) step 3422/76294 | train loss 3.477839 | norm 0.1796 | lr 1.15e-03 | (9885.75 ms | 53035 tok/s) step 3423/76294 | train loss 3.432859 | norm 0.1752 | lr 1.15e-03 | (9897.95 ms | 52969 tok/s) step 3424/76294 | train loss 3.437549 | norm 0.1584 | lr 1.15e-03 | (9888.35 ms | 53021 tok/s) step 3425/76294 | train loss 3.559090 | norm 0.1651 | lr 1.15e-03 | (9892.99 ms | 52996 tok/s) step 3426/76294 | train loss 3.443406 | norm 0.1921 | lr 1.15e-03 | (9887.83 ms | 53024 tok/s) step 3427/76294 | train loss 3.546447 | norm 0.1810 | lr 1.15e-03 | (9931.88 ms | 52788 tok/s) step 3428/76294 | train loss 3.394876 | norm 0.2935 | lr 1.15e-03 | (9889.59 ms | 53014 tok/s) step 3429/76294 | train loss 3.519180 | norm 0.2967 | lr 1.15e-03 | (9905.91 ms | 52927 tok/s) step 3430/76294 | train loss 3.419566 | norm 0.2596 | lr 1.15e-03 | (9893.70 ms | 52992 tok/s) step 3431/76294 | train loss 3.465342 | norm 0.1889 | lr 1.15e-03 | (9925.47 ms | 52822 tok/s) step 3432/76294 | train loss 3.456746 | norm 0.1767 | lr 1.15e-03 | (9890.13 ms | 53011 tok/s) step 3433/76294 | train loss 3.532055 | norm 0.1640 | lr 1.15e-03 | (9889.09 ms | 53017 tok/s) step 3434/76294 | train loss 3.580288 | norm 0.1538 | lr 1.15e-03 | (9904.51 ms | 52934 tok/s) step 3435/76294 | train loss 3.452745 | norm 0.1426 | lr 1.15e-03 | (9932.40 ms | 52786 tok/s) step 3436/76294 | train loss 3.491744 | norm 0.1438 | lr 1.15e-03 | (9889.81 ms | 53013 tok/s) step 3437/76294 | train loss 3.448513 | norm 0.1464 | lr 1.15e-03 | (9922.54 ms | 52838 tok/s) step 3438/76294 | train loss 3.451607 | norm 0.1453 | lr 1.15e-03 | (9961.38 ms | 52632 tok/s) step 3439/76294 | train loss 3.445934 | norm 0.1493 | lr 1.15e-03 | (9903.49 ms | 52940 tok/s) step 3440/76294 | train loss 3.441071 | norm 0.1383 | lr 1.15e-03 | (9951.19 ms | 52686 tok/s) step 3441/76294 | train loss 3.487963 | norm 0.1533 | lr 1.15e-03 | (9887.99 ms | 53023 tok/s) step 3442/76294 | train loss 3.453216 | norm 0.1369 | lr 1.15e-03 | (9947.36 ms | 52706 tok/s) step 3443/76294 | train loss 3.431157 | norm 0.1436 | lr 1.15e-03 | (9892.32 ms | 52999 tok/s) step 3444/76294 | train loss 3.425249 | norm 0.1476 | lr 1.15e-03 | (9885.73 ms | 53035 tok/s) step 3445/76294 | train loss 3.483181 | norm 0.1486 | lr 1.15e-03 | (9937.87 ms | 52757 tok/s) step 3446/76294 | train loss 3.506631 | norm 0.1529 | lr 1.15e-03 | (9927.81 ms | 52810 tok/s) step 3447/76294 | train loss 3.528932 | norm 0.1368 | lr 1.14e-03 | (9889.36 ms | 53015 tok/s) step 3448/76294 | train loss 3.490139 | norm 0.1384 | lr 1.14e-03 | (9929.33 ms | 52802 tok/s) step 3449/76294 | train loss 3.463702 | norm 0.1577 | lr 1.14e-03 | (9922.53 ms | 52838 tok/s) step 3450/76294 | train loss 3.482556 | norm 0.1589 | lr 1.14e-03 | (9903.37 ms | 52940 tok/s) step 3451/76294 | train loss 3.461980 | norm 0.1759 | lr 1.14e-03 | (9892.46 ms | 52999 tok/s) step 3452/76294 | train loss 3.502083 | norm 0.1418 | lr 1.14e-03 | (9887.23 ms | 53027 tok/s) step 3453/76294 | train loss 3.524287 | norm 0.1727 | lr 1.14e-03 | (9896.43 ms | 52977 tok/s) step 3454/76294 | train loss 3.456739 | norm 0.1663 | lr 1.14e-03 | (9886.61 ms | 53030 tok/s) step 3455/76294 | train loss 3.537950 | norm 0.1555 | lr 1.14e-03 | (9895.85 ms | 52981 tok/s) step 3456/76294 | train loss 3.477866 | norm 0.1900 | lr 1.14e-03 | (9927.12 ms | 52814 tok/s) step 3457/76294 | train loss 3.426161 | norm 0.1803 | lr 1.14e-03 | (9892.03 ms | 53001 tok/s) step 3458/76294 | train loss 3.424667 | norm 0.1683 | lr 1.14e-03 | (9894.74 ms | 52987 tok/s) step 3459/76294 | train loss 3.448019 | norm 0.1475 | lr 1.14e-03 | (9892.65 ms | 52998 tok/s) step 3460/76294 | train loss 3.434245 | norm 0.1728 | lr 1.14e-03 | (9894.18 ms | 52990 tok/s) step 3461/76294 | train loss 3.509099 | norm 0.1544 | lr 1.14e-03 | (9891.69 ms | 53003 tok/s) step 3462/76294 | train loss 3.507007 | norm 0.1472 | lr 1.14e-03 | (9895.14 ms | 52984 tok/s) step 3463/76294 | train loss 3.486060 | norm 0.1540 | lr 1.14e-03 | (9890.63 ms | 53009 tok/s) step 3464/76294 | train loss 3.472589 | norm 0.1364 | lr 1.14e-03 | (9893.96 ms | 52991 tok/s) step 3465/76294 | train loss 3.508689 | norm 0.1670 | lr 1.14e-03 | (9930.76 ms | 52794 tok/s) step 3466/76294 | train loss 3.451899 | norm 0.1566 | lr 1.14e-03 | (9893.45 ms | 52993 tok/s) step 3467/76294 | train loss 3.431657 | norm 0.1766 | lr 1.14e-03 | (9959.12 ms | 52644 tok/s) step 3468/76294 | train loss 3.470263 | norm 0.1686 | lr 1.14e-03 | (9899.23 ms | 52963 tok/s) step 3469/76294 | train loss 3.415417 | norm 0.1843 | lr 1.14e-03 | (9899.11 ms | 52963 tok/s) step 3470/76294 | train loss 3.536372 | norm 0.1934 | lr 1.14e-03 | (9889.03 ms | 53017 tok/s) step 3471/76294 | train loss 3.491846 | norm 0.1635 | lr 1.14e-03 | (9898.51 ms | 52966 tok/s) step 3472/76294 | train loss 3.450809 | norm 0.1848 | lr 1.14e-03 | (9885.92 ms | 53034 tok/s) step 3473/76294 | train loss 3.539025 | norm 0.1711 | lr 1.14e-03 | (9920.15 ms | 52851 tok/s) step 3474/76294 | train loss 3.398749 | norm 0.1617 | lr 1.14e-03 | (9897.87 ms | 52970 tok/s) step 3475/76294 | train loss 3.491231 | norm 0.1742 | lr 1.14e-03 | (9936.79 ms | 52762 tok/s) step 3476/76294 | train loss 3.478175 | norm 0.1523 | lr 1.14e-03 | (9890.72 ms | 53008 tok/s) step 3477/76294 | train loss 3.404456 | norm 0.1682 | lr 1.14e-03 | (9902.56 ms | 52945 tok/s) step 3478/76294 | train loss 3.471104 | norm 0.1548 | lr 1.14e-03 | (9924.16 ms | 52829 tok/s) step 3479/76294 | train loss 3.453837 | norm 0.1497 | lr 1.14e-03 | (9964.91 ms | 52613 tok/s) step 3480/76294 | train loss 3.472587 | norm 0.1329 | lr 1.14e-03 | (9912.04 ms | 52894 tok/s) step 3481/76294 | train loss 3.428538 | norm 0.1605 | lr 1.14e-03 | (9954.68 ms | 52668 tok/s) step 3482/76294 | train loss 3.449193 | norm 0.1625 | lr 1.14e-03 | (9886.26 ms | 53032 tok/s) step 3483/76294 | train loss 3.442451 | norm 0.1396 | lr 1.14e-03 | (9958.88 ms | 52645 tok/s) step 3484/76294 | train loss 3.436223 | norm 0.1490 | lr 1.14e-03 | (9895.42 ms | 52983 tok/s) step 3485/76294 | train loss 3.542698 | norm 0.1499 | lr 1.14e-03 | (9919.81 ms | 52853 tok/s) step 3486/76294 | train loss 3.449766 | norm 0.1603 | lr 1.14e-03 | (9904.43 ms | 52935 tok/s) step 3487/76294 | train loss 3.483584 | norm 0.1630 | lr 1.14e-03 | (9897.22 ms | 52973 tok/s) step 3488/76294 | train loss 3.480806 | norm 0.1630 | lr 1.14e-03 | (9892.67 ms | 52998 tok/s) step 3489/76294 | train loss 3.436461 | norm 0.1837 | lr 1.14e-03 | (9900.86 ms | 52954 tok/s) step 3490/76294 | train loss 3.495491 | norm 0.1962 | lr 1.14e-03 | (9909.24 ms | 52909 tok/s) step 3491/76294 | train loss 3.463007 | norm 0.1933 | lr 1.14e-03 | (9966.41 ms | 52605 tok/s) step 3492/76294 | train loss 3.472002 | norm 0.1910 | lr 1.14e-03 | (9886.29 ms | 53032 tok/s) step 3493/76294 | train loss 3.482264 | norm 0.1528 | lr 1.14e-03 | (9967.59 ms | 52599 tok/s) step 3494/76294 | train loss 3.466284 | norm 0.1540 | lr 1.14e-03 | (9894.38 ms | 52988 tok/s) step 3495/76294 | train loss 3.473897 | norm 0.1650 | lr 1.14e-03 | (9897.95 ms | 52969 tok/s) step 3496/76294 | train loss 3.460444 | norm 0.1448 | lr 1.14e-03 | (9954.15 ms | 52670 tok/s) step 3497/76294 | train loss 3.459843 | norm 0.1665 | lr 1.14e-03 | (9891.64 ms | 53003 tok/s) step 3498/76294 | train loss 3.492814 | norm 0.2264 | lr 1.14e-03 | (9893.16 ms | 52995 tok/s) step 3499/76294 | train loss 3.427118 | norm 0.1981 | lr 1.14e-03 | (9912.24 ms | 52893 tok/s) step 3500/76294 | train loss 3.488379 | norm 0.1915 | lr 1.14e-03 | (9898.55 ms | 52966 tok/s) val loss: 3.452205 saving model checkpoint to ./results/gpt2-350M-gqa/step_3500.pth step 3501/76294 | train loss 3.606283 | norm 0.1812 | lr 1.14e-03 | (9989.25 ms | 52485 tok/s) step 3502/76294 | train loss 3.526917 | norm 0.2702 | lr 1.14e-03 | (9884.42 ms | 53042 tok/s) step 3503/76294 | train loss 3.504137 | norm 0.2059 | lr 1.14e-03 | (9904.81 ms | 52933 tok/s) step 3504/76294 | train loss 3.498272 | norm 0.1952 | lr 1.14e-03 | (9883.70 ms | 53046 tok/s) step 3505/76294 | train loss 3.478067 | norm 0.1757 | lr 1.14e-03 | (9930.27 ms | 52797 tok/s) step 3506/76294 | train loss 3.460420 | norm 0.1889 | lr 1.14e-03 | (9884.04 ms | 53044 tok/s) step 3507/76294 | train loss 3.505116 | norm 0.1894 | lr 1.14e-03 | (9895.25 ms | 52984 tok/s) step 3508/76294 | train loss 3.447245 | norm 0.1808 | lr 1.14e-03 | (9886.00 ms | 53033 tok/s) step 3509/76294 | train loss 3.509625 | norm 0.1719 | lr 1.14e-03 | (9931.05 ms | 52793 tok/s) step 3510/76294 | train loss 3.416519 | norm 0.1966 | lr 1.14e-03 | (9886.57 ms | 53030 tok/s) step 3511/76294 | train loss 3.512739 | norm 0.1600 | lr 1.14e-03 | (9895.72 ms | 52981 tok/s) step 3512/76294 | train loss 3.422668 | norm 0.1860 | lr 1.14e-03 | (9890.14 ms | 53011 tok/s) step 3513/76294 | train loss 3.513492 | norm 0.1858 | lr 1.14e-03 | (11050.53 ms | 47445 tok/s) step 3514/76294 | train loss 3.501446 | norm 0.1488 | lr 1.14e-03 | (9883.16 ms | 53049 tok/s) step 3515/76294 | train loss 3.491237 | norm 0.1477 | lr 1.14e-03 | (9890.76 ms | 53008 tok/s) step 3516/76294 | train loss 3.518145 | norm 0.1455 | lr 1.14e-03 | (9884.84 ms | 53040 tok/s) step 3517/76294 | train loss 3.467353 | norm 0.1485 | lr 1.14e-03 | (9905.47 ms | 52929 tok/s) step 3518/76294 | train loss 3.457596 | norm 0.1430 | lr 1.14e-03 | (9951.47 ms | 52684 tok/s) step 3519/76294 | train loss 3.476224 | norm 0.1565 | lr 1.14e-03 | (9887.22 ms | 53027 tok/s) step 3520/76294 | train loss 3.412922 | norm 0.1545 | lr 1.14e-03 | (9910.62 ms | 52902 tok/s) step 3521/76294 | train loss 3.492988 | norm 0.1552 | lr 1.14e-03 | (9916.51 ms | 52870 tok/s) step 3522/76294 | train loss 3.450028 | norm 0.1381 | lr 1.14e-03 | (9903.56 ms | 52939 tok/s) step 3523/76294 | train loss 3.449977 | norm 0.1574 | lr 1.14e-03 | (9902.98 ms | 52942 tok/s) step 3524/76294 | train loss 3.434832 | norm 0.1587 | lr 1.14e-03 | (9897.15 ms | 52974 tok/s) step 3525/76294 | train loss 3.458322 | norm 0.1882 | lr 1.14e-03 | (9940.13 ms | 52745 tok/s) step 3526/76294 | train loss 3.358660 | norm 0.1553 | lr 1.14e-03 | (11455.77 ms | 45766 tok/s) step 3527/76294 | train loss 3.378536 | norm 0.2474 | lr 1.14e-03 | (9887.89 ms | 53023 tok/s) step 3528/76294 | train loss 3.501499 | norm 0.2227 | lr 1.14e-03 | (9882.67 ms | 53051 tok/s) step 3529/76294 | train loss 3.445961 | norm 0.1849 | lr 1.14e-03 | (9947.62 ms | 52705 tok/s) step 3530/76294 | train loss 3.417034 | norm 0.1825 | lr 1.14e-03 | (9887.44 ms | 53026 tok/s) step 3531/76294 | train loss 3.556553 | norm 0.1803 | lr 1.14e-03 | (9886.66 ms | 53030 tok/s) step 3532/76294 | train loss 3.477721 | norm 0.1653 | lr 1.14e-03 | (9890.62 ms | 53009 tok/s) step 3533/76294 | train loss 3.478452 | norm 0.1860 | lr 1.14e-03 | (11498.62 ms | 45596 tok/s) step 3534/76294 | train loss 3.451019 | norm 0.1939 | lr 1.14e-03 | (11181.77 ms | 46888 tok/s) step 3535/76294 | train loss 3.472175 | norm 0.1494 | lr 1.14e-03 | (9874.72 ms | 53094 tok/s) step 3536/76294 | train loss 3.453333 | norm 0.1720 | lr 1.14e-03 | (9885.15 ms | 53038 tok/s) step 3537/76294 | train loss 3.373583 | norm 0.1353 | lr 1.14e-03 | (9885.09 ms | 53038 tok/s) step 3538/76294 | train loss 3.445222 | norm 0.1715 | lr 1.14e-03 | (9881.65 ms | 53057 tok/s) step 3539/76294 | train loss 3.413491 | norm 0.1645 | lr 1.14e-03 | (10768.03 ms | 48689 tok/s) step 3540/76294 | train loss 3.486570 | norm 0.1681 | lr 1.14e-03 | (9877.70 ms | 53078 tok/s) step 3541/76294 | train loss 3.457104 | norm 0.1616 | lr 1.14e-03 | (9887.06 ms | 53028 tok/s) step 3542/76294 | train loss 3.478466 | norm 0.1623 | lr 1.14e-03 | (9883.84 ms | 53045 tok/s) step 3543/76294 | train loss 3.474277 | norm 0.1502 | lr 1.14e-03 | (9913.03 ms | 52889 tok/s) step 3544/76294 | train loss 3.461315 | norm 0.1658 | lr 1.14e-03 | (9885.12 ms | 53038 tok/s) step 3545/76294 | train loss 3.463750 | norm 0.1548 | lr 1.14e-03 | (9895.84 ms | 52981 tok/s) step 3546/76294 | train loss 3.405656 | norm 0.1594 | lr 1.14e-03 | (9884.98 ms | 53039 tok/s) step 3547/76294 | train loss 3.445206 | norm 0.1535 | lr 1.14e-03 | (9895.30 ms | 52984 tok/s) step 3548/76294 | train loss 3.385145 | norm 0.1626 | lr 1.14e-03 | (9893.74 ms | 52992 tok/s) step 3549/76294 | train loss 3.465677 | norm 0.1647 | lr 1.14e-03 | (9931.41 ms | 52791 tok/s) step 3550/76294 | train loss 3.440729 | norm 0.1531 | lr 1.14e-03 | (9886.82 ms | 53029 tok/s) step 3551/76294 | train loss 3.433115 | norm 0.1492 | lr 1.14e-03 | (9940.31 ms | 52744 tok/s) step 3552/76294 | train loss 3.487429 | norm 0.1612 | lr 1.14e-03 | (9890.91 ms | 53007 tok/s) step 3553/76294 | train loss 3.451612 | norm 0.1664 | lr 1.14e-03 | (9898.32 ms | 52967 tok/s) step 3554/76294 | train loss 3.478586 | norm 0.1674 | lr 1.14e-03 | (9894.88 ms | 52986 tok/s) step 3555/76294 | train loss 3.516355 | norm 0.1614 | lr 1.14e-03 | (9906.49 ms | 52924 tok/s) step 3556/76294 | train loss 3.437116 | norm 0.1721 | lr 1.14e-03 | (9902.41 ms | 52946 tok/s) step 3557/76294 | train loss 3.434823 | norm 0.1349 | lr 1.14e-03 | (9937.36 ms | 52759 tok/s) step 3558/76294 | train loss 3.450801 | norm 0.1576 | lr 1.14e-03 | (9894.48 ms | 52988 tok/s) step 3559/76294 | train loss 3.576509 | norm 0.1735 | lr 1.14e-03 | (9894.94 ms | 52985 tok/s) step 3560/76294 | train loss 3.414556 | norm 0.1724 | lr 1.14e-03 | (9931.60 ms | 52790 tok/s) step 3561/76294 | train loss 3.473323 | norm 0.1571 | lr 1.14e-03 | (9904.84 ms | 52933 tok/s) step 3562/76294 | train loss 3.453693 | norm 0.1799 | lr 1.14e-03 | (9888.31 ms | 53021 tok/s) step 3563/76294 | train loss 3.500119 | norm 0.1678 | lr 1.14e-03 | (9961.84 ms | 52630 tok/s) step 3564/76294 | train loss 3.571810 | norm 0.1747 | lr 1.14e-03 | (9888.97 ms | 53017 tok/s) step 3565/76294 | train loss 3.511141 | norm 0.1804 | lr 1.14e-03 | (9892.76 ms | 52997 tok/s) step 3566/76294 | train loss 3.553502 | norm 0.2203 | lr 1.14e-03 | (9954.95 ms | 52666 tok/s) step 3567/76294 | train loss 3.450804 | norm 0.1952 | lr 1.14e-03 | (9894.15 ms | 52990 tok/s) step 3568/76294 | train loss 3.406430 | norm 0.1768 | lr 1.14e-03 | (9889.85 ms | 53013 tok/s) step 3569/76294 | train loss 3.597397 | norm 0.1672 | lr 1.14e-03 | (9898.70 ms | 52965 tok/s) step 3570/76294 | train loss 3.423432 | norm 0.1645 | lr 1.14e-03 | (9888.05 ms | 53022 tok/s) step 3571/76294 | train loss 3.432365 | norm 0.1867 | lr 1.14e-03 | (9895.27 ms | 52984 tok/s) step 3572/76294 | train loss 3.420798 | norm 0.1684 | lr 1.14e-03 | (9953.36 ms | 52674 tok/s) step 3573/76294 | train loss 3.482242 | norm 0.2008 | lr 1.14e-03 | (9903.67 ms | 52939 tok/s) step 3574/76294 | train loss 3.416907 | norm 0.1747 | lr 1.14e-03 | (9950.76 ms | 52688 tok/s) step 3575/76294 | train loss 3.506926 | norm 0.1677 | lr 1.14e-03 | (9942.04 ms | 52734 tok/s) step 3576/76294 | train loss 3.510535 | norm 0.1835 | lr 1.14e-03 | (9949.07 ms | 52697 tok/s) step 3577/76294 | train loss 3.496873 | norm 0.1687 | lr 1.14e-03 | (9894.86 ms | 52986 tok/s) step 3578/76294 | train loss 3.521951 | norm 0.1764 | lr 1.14e-03 | (9880.73 ms | 53062 tok/s) step 3579/76294 | train loss 3.435040 | norm 0.1408 | lr 1.14e-03 | (9891.00 ms | 53007 tok/s) step 3580/76294 | train loss 3.518078 | norm 0.1688 | lr 1.14e-03 | (9887.29 ms | 53026 tok/s) step 3581/76294 | train loss 3.412228 | norm 0.1538 | lr 1.14e-03 | (9893.40 ms | 52994 tok/s) step 3582/76294 | train loss 3.563305 | norm 0.1682 | lr 1.14e-03 | (9884.17 ms | 53043 tok/s) step 3583/76294 | train loss 3.455639 | norm 0.1703 | lr 1.14e-03 | (9900.55 ms | 52955 tok/s) step 3584/76294 | train loss 3.500072 | norm 0.1577 | lr 1.14e-03 | (9885.72 ms | 53035 tok/s) step 3585/76294 | train loss 3.524467 | norm 0.1672 | lr 1.14e-03 | (9893.17 ms | 52995 tok/s) step 3586/76294 | train loss 3.483935 | norm 0.1525 | lr 1.14e-03 | (9894.11 ms | 52990 tok/s) step 3587/76294 | train loss 3.485530 | norm 0.1778 | lr 1.14e-03 | (9891.65 ms | 53003 tok/s) step 3588/76294 | train loss 3.431431 | norm 0.1540 | lr 1.14e-03 | (9902.01 ms | 52948 tok/s) step 3589/76294 | train loss 3.477209 | norm 0.1473 | lr 1.14e-03 | (9935.80 ms | 52768 tok/s) step 3590/76294 | train loss 3.422278 | norm 0.1490 | lr 1.14e-03 | (9892.10 ms | 53001 tok/s) step 3591/76294 | train loss 3.433049 | norm 0.1429 | lr 1.14e-03 | (9898.21 ms | 52968 tok/s) step 3592/76294 | train loss 3.481371 | norm 0.1360 | lr 1.14e-03 | (9891.67 ms | 53003 tok/s) step 3593/76294 | train loss 3.449919 | norm 0.1528 | lr 1.14e-03 | (9929.62 ms | 52800 tok/s) step 3594/76294 | train loss 3.419691 | norm 0.1216 | lr 1.14e-03 | (9885.70 ms | 53035 tok/s) step 3595/76294 | train loss 3.439009 | norm 0.1341 | lr 1.14e-03 | (9913.76 ms | 52885 tok/s) step 3596/76294 | train loss 3.634711 | norm 0.1415 | lr 1.14e-03 | (9890.72 ms | 53008 tok/s) step 3597/76294 | train loss 3.483417 | norm 0.1829 | lr 1.14e-03 | (9934.77 ms | 52773 tok/s) step 3598/76294 | train loss 3.450187 | norm 0.1765 | lr 1.14e-03 | (9889.10 ms | 53017 tok/s) step 3599/76294 | train loss 3.464649 | norm 0.1454 | lr 1.14e-03 | (9905.41 ms | 52929 tok/s) step 3600/76294 | train loss 3.474979 | norm 0.1599 | lr 1.14e-03 | (9884.93 ms | 53039 tok/s) step 3601/76294 | train loss 3.579641 | norm 0.1974 | lr 1.14e-03 | (9914.97 ms | 52878 tok/s) step 3602/76294 | train loss 3.446610 | norm 0.2369 | lr 1.14e-03 | (9887.82 ms | 53024 tok/s) step 3603/76294 | train loss 3.419532 | norm 0.1874 | lr 1.14e-03 | (9899.50 ms | 52961 tok/s) step 3604/76294 | train loss 3.524364 | norm 0.1856 | lr 1.14e-03 | (9889.47 ms | 53015 tok/s) step 3605/76294 | train loss 3.406661 | norm 0.2147 | lr 1.14e-03 | (9898.14 ms | 52968 tok/s) step 3606/76294 | train loss 3.366597 | norm 0.1857 | lr 1.14e-03 | (9890.36 ms | 53010 tok/s) step 3607/76294 | train loss 3.482319 | norm 0.1842 | lr 1.14e-03 | (9898.22 ms | 52968 tok/s) step 3608/76294 | train loss 3.434233 | norm 0.1834 | lr 1.14e-03 | (9890.94 ms | 53007 tok/s) step 3609/76294 | train loss 3.435633 | norm 0.1801 | lr 1.14e-03 | (9898.92 ms | 52964 tok/s) step 3610/76294 | train loss 3.460047 | norm 0.1696 | lr 1.14e-03 | (11364.60 ms | 46133 tok/s) step 3611/76294 | train loss 3.368646 | norm 0.1489 | lr 1.14e-03 | (9917.81 ms | 52863 tok/s) step 3612/76294 | train loss 3.512676 | norm 0.1760 | lr 1.14e-03 | (9889.99 ms | 53012 tok/s) step 3613/76294 | train loss 3.483517 | norm 0.1773 | lr 1.14e-03 | (9907.20 ms | 52920 tok/s) step 3614/76294 | train loss 3.460852 | norm 0.1667 | lr 1.14e-03 | (9927.28 ms | 52813 tok/s) step 3615/76294 | train loss 3.427245 | norm 0.1817 | lr 1.14e-03 | (9896.73 ms | 52976 tok/s) step 3616/76294 | train loss 3.435215 | norm 0.1799 | lr 1.14e-03 | (9898.95 ms | 52964 tok/s) step 3617/76294 | train loss 3.494488 | norm 0.1791 | lr 1.14e-03 | (9958.79 ms | 52646 tok/s) step 3618/76294 | train loss 3.400612 | norm 0.1737 | lr 1.14e-03 | (9903.51 ms | 52940 tok/s) step 3619/76294 | train loss 3.446601 | norm 0.1725 | lr 1.14e-03 | (9948.04 ms | 52703 tok/s) step 3620/76294 | train loss 3.452622 | norm 0.1824 | lr 1.14e-03 | (9917.79 ms | 52863 tok/s) step 3621/76294 | train loss 3.448743 | norm 0.1513 | lr 1.14e-03 | (9966.47 ms | 52605 tok/s) step 3622/76294 | train loss 3.470101 | norm 0.1764 | lr 1.14e-03 | (9904.96 ms | 52932 tok/s) step 3623/76294 | train loss 3.460637 | norm 0.1561 | lr 1.14e-03 | (9951.23 ms | 52686 tok/s) step 3624/76294 | train loss 3.365009 | norm 0.1615 | lr 1.14e-03 | (9898.09 ms | 52969 tok/s) step 3625/76294 | train loss 3.479236 | norm 0.1476 | lr 1.14e-03 | (9928.79 ms | 52805 tok/s) step 3626/76294 | train loss 3.428514 | norm 0.1527 | lr 1.14e-03 | (9934.07 ms | 52777 tok/s) step 3627/76294 | train loss 3.408446 | norm 0.1483 | lr 1.14e-03 | (9892.92 ms | 52996 tok/s) step 3628/76294 | train loss 3.488661 | norm 0.1498 | lr 1.14e-03 | (9899.62 ms | 52960 tok/s) step 3629/76294 | train loss 3.475034 | norm 0.1474 | lr 1.14e-03 | (9897.17 ms | 52974 tok/s) step 3630/76294 | train loss 3.501765 | norm 0.1498 | lr 1.14e-03 | (9905.35 ms | 52930 tok/s) step 3631/76294 | train loss 3.468972 | norm 0.1619 | lr 1.14e-03 | (9896.20 ms | 52979 tok/s) step 3632/76294 | train loss 3.470543 | norm 0.1354 | lr 1.14e-03 | (9893.43 ms | 52994 tok/s) step 3633/76294 | train loss 3.491086 | norm 0.1493 | lr 1.14e-03 | (9887.88 ms | 53023 tok/s) step 3634/76294 | train loss 3.488567 | norm 0.1426 | lr 1.14e-03 | (9892.43 ms | 52999 tok/s) step 3635/76294 | train loss 3.416899 | norm 0.1539 | lr 1.14e-03 | (9936.35 ms | 52765 tok/s) step 3636/76294 | train loss 3.518343 | norm 0.1602 | lr 1.14e-03 | (9890.84 ms | 53007 tok/s) step 3637/76294 | train loss 3.458083 | norm 0.1520 | lr 1.14e-03 | (9898.39 ms | 52967 tok/s) step 3638/76294 | train loss 3.477199 | norm 0.1686 | lr 1.14e-03 | (9890.62 ms | 53009 tok/s) step 3639/76294 | train loss 3.483484 | norm 0.1465 | lr 1.14e-03 | (9902.81 ms | 52943 tok/s) step 3640/76294 | train loss 3.492110 | norm 0.1606 | lr 1.14e-03 | (9899.28 ms | 52962 tok/s) step 3641/76294 | train loss 3.394854 | norm 0.1612 | lr 1.14e-03 | (9897.75 ms | 52970 tok/s) step 3642/76294 | train loss 3.562093 | norm 0.1681 | lr 1.14e-03 | (9889.02 ms | 53017 tok/s) step 3643/76294 | train loss 3.468917 | norm 0.1861 | lr 1.14e-03 | (9905.42 ms | 52929 tok/s) step 3644/76294 | train loss 3.407067 | norm 0.1620 | lr 1.14e-03 | (9892.29 ms | 53000 tok/s) step 3645/76294 | train loss 3.474504 | norm 0.1581 | lr 1.14e-03 | (9923.26 ms | 52834 tok/s) step 3646/76294 | train loss 3.472741 | norm 0.1621 | lr 1.14e-03 | (9889.74 ms | 53013 tok/s) step 3647/76294 | train loss 3.509876 | norm 0.1765 | lr 1.14e-03 | (9905.42 ms | 52929 tok/s) step 3648/76294 | train loss 3.516021 | norm 0.1554 | lr 1.14e-03 | (9931.95 ms | 52788 tok/s) step 3649/76294 | train loss 3.403278 | norm 0.1674 | lr 1.14e-03 | (9896.59 ms | 52977 tok/s) step 3650/76294 | train loss 3.444160 | norm 0.1538 | lr 1.14e-03 | (9896.80 ms | 52976 tok/s) step 3651/76294 | train loss 3.489742 | norm 0.1577 | lr 1.14e-03 | (9892.57 ms | 52998 tok/s) step 3652/76294 | train loss 3.479459 | norm 0.1764 | lr 1.14e-03 | (9897.41 ms | 52972 tok/s) step 3653/76294 | train loss 3.613068 | norm 0.1970 | lr 1.14e-03 | (9889.13 ms | 53017 tok/s) step 3654/76294 | train loss 3.441909 | norm 0.1952 | lr 1.14e-03 | (9894.32 ms | 52989 tok/s) step 3655/76294 | train loss 3.502745 | norm 0.2065 | lr 1.14e-03 | (9891.72 ms | 53003 tok/s) step 3656/76294 | train loss 3.462213 | norm 0.2359 | lr 1.14e-03 | (9898.30 ms | 52967 tok/s) step 3657/76294 | train loss 3.423973 | norm 0.1920 | lr 1.14e-03 | (9887.13 ms | 53027 tok/s) step 3658/76294 | train loss 3.477563 | norm 0.2092 | lr 1.14e-03 | (9900.18 ms | 52957 tok/s) step 3659/76294 | train loss 3.457173 | norm 0.1838 | lr 1.14e-03 | (9889.78 ms | 53013 tok/s) step 3660/76294 | train loss 3.438525 | norm 0.1762 | lr 1.14e-03 | (9897.33 ms | 52973 tok/s) step 3661/76294 | train loss 3.488364 | norm 0.1811 | lr 1.14e-03 | (9900.13 ms | 52958 tok/s) step 3662/76294 | train loss 3.443446 | norm 0.1625 | lr 1.14e-03 | (9929.16 ms | 52803 tok/s) step 3663/76294 | train loss 3.516801 | norm 0.1668 | lr 1.14e-03 | (9892.75 ms | 52997 tok/s) step 3664/76294 | train loss 3.426699 | norm 0.1588 | lr 1.14e-03 | (9919.19 ms | 52856 tok/s) step 3665/76294 | train loss 3.466417 | norm 0.1786 | lr 1.14e-03 | (9891.16 ms | 53006 tok/s) step 3666/76294 | train loss 3.409380 | norm 0.1792 | lr 1.14e-03 | (9927.26 ms | 52813 tok/s) step 3667/76294 | train loss 3.454554 | norm 0.2017 | lr 1.14e-03 | (9891.61 ms | 53003 tok/s) step 3668/76294 | train loss 3.452923 | norm 0.1466 | lr 1.14e-03 | (9968.64 ms | 52594 tok/s) step 3669/76294 | train loss 3.476718 | norm 0.1839 | lr 1.14e-03 | (9895.61 ms | 52982 tok/s) step 3670/76294 | train loss 3.396192 | norm 0.1691 | lr 1.14e-03 | (9959.07 ms | 52644 tok/s) step 3671/76294 | train loss 3.442699 | norm 0.1579 | lr 1.14e-03 | (9896.27 ms | 52978 tok/s) step 3672/76294 | train loss 3.426533 | norm 0.1872 | lr 1.14e-03 | (9890.72 ms | 53008 tok/s) step 3673/76294 | train loss 3.453209 | norm 0.1574 | lr 1.14e-03 | (9932.40 ms | 52786 tok/s) step 3674/76294 | train loss 3.449767 | norm 0.1567 | lr 1.14e-03 | (9889.39 ms | 53015 tok/s) step 3675/76294 | train loss 3.479301 | norm 0.1933 | lr 1.14e-03 | (9909.76 ms | 52906 tok/s) step 3676/76294 | train loss 3.458828 | norm 0.1691 | lr 1.14e-03 | (9887.69 ms | 53024 tok/s) step 3677/76294 | train loss 3.426756 | norm 0.1711 | lr 1.14e-03 | (9899.88 ms | 52959 tok/s) step 3678/76294 | train loss 3.467612 | norm 0.1497 | lr 1.14e-03 | (9891.11 ms | 53006 tok/s) step 3679/76294 | train loss 3.398532 | norm 0.1613 | lr 1.14e-03 | (9897.13 ms | 52974 tok/s) step 3680/76294 | train loss 3.447640 | norm 0.1563 | lr 1.14e-03 | (9893.23 ms | 52995 tok/s) step 3681/76294 | train loss 3.481454 | norm 0.1590 | lr 1.14e-03 | (9888.46 ms | 53020 tok/s) step 3682/76294 | train loss 3.478153 | norm 0.1497 | lr 1.14e-03 | (9887.18 ms | 53027 tok/s) step 3683/76294 | train loss 3.451530 | norm 0.1508 | lr 1.14e-03 | (9896.08 ms | 52979 tok/s) step 3684/76294 | train loss 3.451725 | norm 0.1554 | lr 1.14e-03 | (9912.93 ms | 52889 tok/s) step 3685/76294 | train loss 3.458747 | norm 0.1464 | lr 1.14e-03 | (9891.10 ms | 53006 tok/s) step 3686/76294 | train loss 3.396191 | norm 0.1455 | lr 1.14e-03 | (9895.73 ms | 52981 tok/s) step 3687/76294 | train loss 3.505831 | norm 0.1580 | lr 1.14e-03 | (9891.74 ms | 53003 tok/s) step 3688/76294 | train loss 3.551359 | norm 0.1640 | lr 1.14e-03 | (9890.50 ms | 53009 tok/s) step 3689/76294 | train loss 3.432097 | norm 0.1781 | lr 1.14e-03 | (9890.33 ms | 53010 tok/s) step 3690/76294 | train loss 3.416206 | norm 0.1828 | lr 1.13e-03 | (9891.11 ms | 53006 tok/s) step 3691/76294 | train loss 3.423859 | norm 0.1635 | lr 1.13e-03 | (9896.36 ms | 52978 tok/s) step 3692/76294 | train loss 3.462456 | norm 0.1611 | lr 1.13e-03 | (9928.56 ms | 52806 tok/s) step 3693/76294 | train loss 3.453923 | norm 0.1751 | lr 1.13e-03 | (9889.38 ms | 53015 tok/s) step 3694/76294 | train loss 3.470523 | norm 0.1628 | lr 1.13e-03 | (9896.76 ms | 52976 tok/s) step 3695/76294 | train loss 3.475938 | norm 0.1630 | lr 1.13e-03 | (9891.86 ms | 53002 tok/s) step 3696/76294 | train loss 3.472101 | norm 0.1775 | lr 1.13e-03 | (9896.35 ms | 52978 tok/s) step 3697/76294 | train loss 3.468276 | norm 0.1629 | lr 1.13e-03 | (9890.96 ms | 53007 tok/s) step 3698/76294 | train loss 3.401078 | norm 0.1540 | lr 1.13e-03 | (9895.00 ms | 52985 tok/s) step 3699/76294 | train loss 3.670763 | norm 0.1941 | lr 1.13e-03 | (9904.04 ms | 52937 tok/s) step 3700/76294 | train loss 3.509580 | norm 0.1667 | lr 1.13e-03 | (9895.29 ms | 52984 tok/s) step 3701/76294 | train loss 3.424216 | norm 0.1778 | lr 1.13e-03 | (9893.41 ms | 52994 tok/s) step 3702/76294 | train loss 3.390214 | norm 0.1820 | lr 1.13e-03 | (9944.27 ms | 52723 tok/s) step 3703/76294 | train loss 3.482874 | norm 0.1730 | lr 1.13e-03 | (9890.09 ms | 53011 tok/s) step 3704/76294 | train loss 3.433306 | norm 0.1773 | lr 1.13e-03 | (9909.78 ms | 52906 tok/s) step 3705/76294 | train loss 3.424432 | norm 0.1621 | lr 1.13e-03 | (9889.66 ms | 53014 tok/s) step 3706/76294 | train loss 3.433837 | norm 0.1640 | lr 1.13e-03 | (9953.59 ms | 52673 tok/s) step 3707/76294 | train loss 3.435448 | norm 0.1424 | lr 1.13e-03 | (9922.79 ms | 52837 tok/s) step 3708/76294 | train loss 3.400797 | norm 0.1547 | lr 1.13e-03 | (10935.00 ms | 47946 tok/s) step 3709/76294 | train loss 3.430169 | norm 0.1478 | lr 1.13e-03 | (11589.18 ms | 45239 tok/s) step 3710/76294 | train loss 3.465266 | norm 0.1619 | lr 1.13e-03 | (9869.80 ms | 53120 tok/s) step 3711/76294 | train loss 3.397638 | norm 0.1572 | lr 1.13e-03 | (9890.84 ms | 53007 tok/s) step 3712/76294 | train loss 3.468434 | norm 0.1338 | lr 1.13e-03 | (9884.11 ms | 53044 tok/s) step 3713/76294 | train loss 3.479890 | norm 0.1463 | lr 1.13e-03 | (9880.30 ms | 53064 tok/s) step 3714/76294 | train loss 3.564401 | norm 0.1502 | lr 1.13e-03 | (9889.73 ms | 53013 tok/s) step 3715/76294 | train loss 3.402098 | norm 0.1583 | lr 1.13e-03 | (9881.61 ms | 53057 tok/s) step 3716/76294 | train loss 3.374314 | norm 0.1428 | lr 1.13e-03 | (9880.55 ms | 53063 tok/s) step 3717/76294 | train loss 3.364620 | norm 0.1393 | lr 1.13e-03 | (9890.75 ms | 53008 tok/s) step 3718/76294 | train loss 3.426125 | norm 0.1533 | lr 1.13e-03 | (9887.47 ms | 53026 tok/s) step 3719/76294 | train loss 3.438008 | norm 0.1561 | lr 1.13e-03 | (9891.66 ms | 53003 tok/s) step 3720/76294 | train loss 3.419094 | norm 0.1539 | lr 1.13e-03 | (9959.47 ms | 52642 tok/s) step 3721/76294 | train loss 3.447490 | norm 0.1607 | lr 1.13e-03 | (9912.72 ms | 52890 tok/s) step 3722/76294 | train loss 3.450269 | norm 0.1614 | lr 1.13e-03 | (9885.84 ms | 53034 tok/s) step 3723/76294 | train loss 3.487792 | norm 0.1468 | lr 1.13e-03 | (9928.06 ms | 52809 tok/s) step 3724/76294 | train loss 3.491113 | norm 0.1700 | lr 1.13e-03 | (9883.13 ms | 53049 tok/s) step 3725/76294 | train loss 3.429883 | norm 0.1799 | lr 1.13e-03 | (9885.79 ms | 53035 tok/s) step 3726/76294 | train loss 3.441941 | norm 0.1493 | lr 1.13e-03 | (9892.83 ms | 52997 tok/s) step 3727/76294 | train loss 3.434109 | norm 0.1593 | lr 1.13e-03 | (9931.29 ms | 52792 tok/s) step 3728/76294 | train loss 3.477432 | norm 0.1505 | lr 1.13e-03 | (9888.42 ms | 53020 tok/s) step 3729/76294 | train loss 3.420367 | norm 0.1474 | lr 1.13e-03 | (9903.23 ms | 52941 tok/s) step 3730/76294 | train loss 3.392607 | norm 0.1507 | lr 1.13e-03 | (10277.08 ms | 51015 tok/s) step 3731/76294 | train loss 3.464459 | norm 0.1944 | lr 1.13e-03 | (9927.36 ms | 52812 tok/s) step 3732/76294 | train loss 3.396769 | norm 0.1502 | lr 1.13e-03 | (9886.39 ms | 53031 tok/s) step 3733/76294 | train loss 3.464887 | norm 0.1881 | lr 1.13e-03 | (9892.82 ms | 52997 tok/s) step 3734/76294 | train loss 3.450326 | norm 0.1896 | lr 1.13e-03 | (9890.85 ms | 53007 tok/s) step 3735/76294 | train loss 3.436945 | norm 0.1679 | lr 1.13e-03 | (9889.60 ms | 53014 tok/s) step 3736/76294 | train loss 3.362410 | norm 0.1654 | lr 1.13e-03 | (9891.97 ms | 53001 tok/s) step 3737/76294 | train loss 3.450204 | norm 0.1677 | lr 1.13e-03 | (9890.64 ms | 53009 tok/s) step 3738/76294 | train loss 3.439017 | norm 0.1720 | lr 1.13e-03 | (9906.82 ms | 52922 tok/s) step 3739/76294 | train loss 3.399907 | norm 0.1836 | lr 1.13e-03 | (9893.44 ms | 52993 tok/s) step 3740/76294 | train loss 3.437978 | norm 0.1612 | lr 1.13e-03 | (9897.36 ms | 52972 tok/s) step 3741/76294 | train loss 3.415482 | norm 0.1873 | lr 1.13e-03 | (9893.62 ms | 52993 tok/s) step 3742/76294 | train loss 3.433962 | norm 0.2041 | lr 1.13e-03 | (9894.43 ms | 52988 tok/s) step 3743/76294 | train loss 3.445681 | norm 0.1605 | lr 1.13e-03 | (9892.29 ms | 53000 tok/s) step 3744/76294 | train loss 3.410992 | norm 0.1945 | lr 1.13e-03 | (9895.11 ms | 52985 tok/s) step 3745/76294 | train loss 3.456321 | norm 0.1668 | lr 1.13e-03 | (9891.10 ms | 53006 tok/s) step 3746/76294 | train loss 3.411839 | norm 0.2023 | lr 1.13e-03 | (9892.64 ms | 52998 tok/s) step 3747/76294 | train loss 3.424555 | norm 0.2295 | lr 1.13e-03 | (9891.17 ms | 53006 tok/s) step 3748/76294 | train loss 3.428047 | norm 0.1994 | lr 1.13e-03 | (9890.63 ms | 53009 tok/s) step 3749/76294 | train loss 3.458491 | norm 0.2092 | lr 1.13e-03 | (9891.97 ms | 53001 tok/s) step 3750/76294 | train loss 3.439170 | norm 0.2035 | lr 1.13e-03 | (9890.40 ms | 53010 tok/s) val loss: 3.432487 saving model checkpoint to ./results/gpt2-350M-gqa/step_3750.pth step 3751/76294 | train loss 3.471729 | norm 0.1835 | lr 1.13e-03 | (9976.92 ms | 52550 tok/s) step 3752/76294 | train loss 3.439197 | norm 0.1978 | lr 1.13e-03 | (9868.88 ms | 53125 tok/s) step 3753/76294 | train loss 3.433019 | norm 0.1719 | lr 1.13e-03 | (9882.58 ms | 53052 tok/s) step 3754/76294 | train loss 3.428486 | norm 0.1988 | lr 1.13e-03 | (9935.20 ms | 52771 tok/s) step 3755/76294 | train loss 3.478492 | norm 0.1598 | lr 1.13e-03 | (9880.58 ms | 53062 tok/s) step 3756/76294 | train loss 3.419648 | norm 0.1937 | lr 1.13e-03 | (9875.44 ms | 53090 tok/s) step 3757/76294 | train loss 3.352057 | norm 0.1434 | lr 1.13e-03 | (9989.34 ms | 52485 tok/s) step 3758/76294 | train loss 3.484348 | norm 0.1814 | lr 1.13e-03 | (9883.72 ms | 53046 tok/s) step 3759/76294 | train loss 3.437770 | norm 0.1928 | lr 1.13e-03 | (9881.64 ms | 53057 tok/s) step 3760/76294 | train loss 3.389855 | norm 0.1619 | lr 1.13e-03 | (9890.00 ms | 53012 tok/s) step 3761/76294 | train loss 3.424030 | norm 0.1597 | lr 1.13e-03 | (9926.51 ms | 52817 tok/s) step 3762/76294 | train loss 3.430486 | norm 0.1619 | lr 1.13e-03 | (9886.07 ms | 53033 tok/s) step 3763/76294 | train loss 3.363198 | norm 0.1851 | lr 1.13e-03 | (9896.87 ms | 52975 tok/s) step 3764/76294 | train loss 3.417584 | norm 0.1419 | lr 1.13e-03 | (9961.52 ms | 52631 tok/s) step 3765/76294 | train loss 3.397625 | norm 0.1682 | lr 1.13e-03 | (9894.64 ms | 52987 tok/s) step 3766/76294 | train loss 3.464666 | norm 0.1534 | lr 1.13e-03 | (9974.34 ms | 52564 tok/s) step 3767/76294 | train loss 3.493442 | norm 0.1627 | lr 1.13e-03 | (9890.23 ms | 53011 tok/s) step 3768/76294 | train loss 3.441280 | norm 0.1661 | lr 1.13e-03 | (9917.95 ms | 52863 tok/s) step 3769/76294 | train loss 3.487279 | norm 0.1876 | lr 1.13e-03 | (9900.26 ms | 52957 tok/s) step 3770/76294 | train loss 3.443186 | norm 0.1459 | lr 1.13e-03 | (9902.00 ms | 52948 tok/s) step 3771/76294 | train loss 3.487599 | norm 0.2026 | lr 1.13e-03 | (9896.81 ms | 52975 tok/s) step 3772/76294 | train loss 3.488689 | norm 0.1928 | lr 1.13e-03 | (9916.16 ms | 52872 tok/s) step 3773/76294 | train loss 3.388482 | norm 0.1821 | lr 1.13e-03 | (9895.40 ms | 52983 tok/s) step 3774/76294 | train loss 3.424530 | norm 0.1635 | lr 1.13e-03 | (9899.56 ms | 52961 tok/s) step 3775/76294 | train loss 3.483023 | norm 0.1713 | lr 1.13e-03 | (9898.06 ms | 52969 tok/s) step 3776/76294 | train loss 3.372170 | norm 0.1544 | lr 1.13e-03 | (9921.04 ms | 52846 tok/s) step 3777/76294 | train loss 3.446993 | norm 0.1696 | lr 1.13e-03 | (9898.39 ms | 52967 tok/s) step 3778/76294 | train loss 3.371073 | norm 0.1449 | lr 1.13e-03 | (9896.94 ms | 52975 tok/s) step 3779/76294 | train loss 3.422827 | norm 0.1823 | lr 1.13e-03 | (9897.13 ms | 52974 tok/s) step 3780/76294 | train loss 3.433698 | norm 0.1583 | lr 1.13e-03 | (9945.27 ms | 52717 tok/s) step 3781/76294 | train loss 3.477303 | norm 0.1952 | lr 1.13e-03 | (9894.93 ms | 52986 tok/s) step 3782/76294 | train loss 3.495975 | norm 0.1710 | lr 1.13e-03 | (9894.64 ms | 52987 tok/s) step 3783/76294 | train loss 3.356318 | norm 0.1896 | lr 1.13e-03 | (9904.34 ms | 52935 tok/s) step 3784/76294 | train loss 3.439992 | norm 0.1896 | lr 1.13e-03 | (9887.10 ms | 53027 tok/s) step 3785/76294 | train loss 3.447284 | norm 0.1720 | lr 1.13e-03 | (9901.04 ms | 52953 tok/s) step 3786/76294 | train loss 3.420285 | norm 0.1787 | lr 1.13e-03 | (9891.21 ms | 53005 tok/s) step 3787/76294 | train loss 3.405358 | norm 0.1741 | lr 1.13e-03 | (9917.35 ms | 52866 tok/s) step 3788/76294 | train loss 3.409261 | norm 0.1649 | lr 1.13e-03 | (9892.05 ms | 53001 tok/s) step 3789/76294 | train loss 3.426109 | norm 0.1737 | lr 1.13e-03 | (9901.72 ms | 52949 tok/s) step 3790/76294 | train loss 3.485271 | norm 0.1462 | lr 1.13e-03 | (9925.46 ms | 52823 tok/s) step 3791/76294 | train loss 3.386268 | norm 0.1807 | lr 1.13e-03 | (9896.63 ms | 52976 tok/s) step 3792/76294 | train loss 3.458877 | norm 0.1599 | lr 1.13e-03 | (9905.69 ms | 52928 tok/s) step 3793/76294 | train loss 3.456461 | norm 0.1642 | lr 1.13e-03 | (9975.24 ms | 52559 tok/s) step 3794/76294 | train loss 3.417320 | norm 0.1643 | lr 1.13e-03 | (9891.39 ms | 53005 tok/s) step 3795/76294 | train loss 3.448769 | norm 0.1561 | lr 1.13e-03 | (10056.57 ms | 52134 tok/s) step 3796/76294 | train loss 3.378575 | norm 0.1805 | lr 1.13e-03 | (9887.29 ms | 53026 tok/s) step 3797/76294 | train loss 3.498958 | norm 0.1661 | lr 1.13e-03 | (9901.32 ms | 52951 tok/s) step 3798/76294 | train loss 3.453812 | norm 0.1742 | lr 1.13e-03 | (9890.15 ms | 53011 tok/s) step 3799/76294 | train loss 3.407144 | norm 0.1637 | lr 1.13e-03 | (9903.57 ms | 52939 tok/s) step 3800/76294 | train loss 3.480173 | norm 0.1741 | lr 1.13e-03 | (9893.24 ms | 52995 tok/s) step 3801/76294 | train loss 3.349383 | norm 0.1820 | lr 1.13e-03 | (9919.82 ms | 52853 tok/s) step 3802/76294 | train loss 3.464342 | norm 0.1766 | lr 1.13e-03 | (9892.77 ms | 52997 tok/s) step 3803/76294 | train loss 3.372452 | norm 0.1736 | lr 1.13e-03 | (9897.22 ms | 52973 tok/s) step 3804/76294 | train loss 3.437795 | norm 0.1865 | lr 1.13e-03 | (9892.48 ms | 52999 tok/s) step 3805/76294 | train loss 3.441570 | norm 0.1851 | lr 1.13e-03 | (9903.65 ms | 52939 tok/s) step 3806/76294 | train loss 3.428784 | norm 0.1551 | lr 1.13e-03 | (11252.11 ms | 46595 tok/s) step 3807/76294 | train loss 3.433337 | norm 0.1731 | lr 1.13e-03 | (9886.72 ms | 53029 tok/s) step 3808/76294 | train loss 3.346635 | norm 0.1646 | lr 1.13e-03 | (9912.18 ms | 52893 tok/s) step 3809/76294 | train loss 3.406696 | norm 0.1717 | lr 1.13e-03 | (9887.33 ms | 53026 tok/s) step 3810/76294 | train loss 3.527465 | norm 0.1483 | lr 1.13e-03 | (9891.62 ms | 53003 tok/s) step 3811/76294 | train loss 3.507682 | norm 0.1676 | lr 1.13e-03 | (9934.31 ms | 52775 tok/s) step 3812/76294 | train loss 3.458705 | norm 0.1552 | lr 1.13e-03 | (9919.67 ms | 52853 tok/s) step 3813/76294 | train loss 3.499891 | norm 0.1758 | lr 1.13e-03 | (9891.46 ms | 53004 tok/s) step 3814/76294 | train loss 3.389127 | norm 0.1789 | lr 1.13e-03 | (9942.43 ms | 52732 tok/s) step 3815/76294 | train loss 3.459610 | norm 0.1740 | lr 1.13e-03 | (9922.43 ms | 52839 tok/s) step 3816/76294 | train loss 3.431065 | norm 0.1581 | lr 1.13e-03 | (9884.59 ms | 53041 tok/s) step 3817/76294 | train loss 3.443157 | norm 0.1834 | lr 1.13e-03 | (9897.52 ms | 52972 tok/s) step 3818/76294 | train loss 3.413706 | norm 0.1746 | lr 1.13e-03 | (9928.89 ms | 52804 tok/s) step 3819/76294 | train loss 3.365212 | norm 0.1717 | lr 1.13e-03 | (9892.03 ms | 53001 tok/s) step 3820/76294 | train loss 3.406600 | norm 0.1857 | lr 1.13e-03 | (9893.37 ms | 52994 tok/s) step 3821/76294 | train loss 3.493984 | norm 0.1974 | lr 1.13e-03 | (9889.45 ms | 53015 tok/s) step 3822/76294 | train loss 3.486983 | norm 0.1868 | lr 1.13e-03 | (9922.40 ms | 52839 tok/s) step 3823/76294 | train loss 3.471151 | norm 0.1800 | lr 1.13e-03 | (9886.45 ms | 53031 tok/s) step 3824/76294 | train loss 3.435878 | norm 0.1673 | lr 1.13e-03 | (9886.79 ms | 53029 tok/s) step 3825/76294 | train loss 3.450253 | norm 0.1995 | lr 1.13e-03 | (9964.06 ms | 52618 tok/s) step 3826/76294 | train loss 3.437335 | norm 0.2146 | lr 1.13e-03 | (9900.03 ms | 52958 tok/s) step 3827/76294 | train loss 3.363018 | norm 0.1973 | lr 1.13e-03 | (9932.08 ms | 52787 tok/s) step 3828/76294 | train loss 3.468404 | norm 0.2239 | lr 1.13e-03 | (9989.55 ms | 52484 tok/s) step 3829/76294 | train loss 3.384459 | norm 0.2004 | lr 1.13e-03 | (9924.86 ms | 52826 tok/s) step 3830/76294 | train loss 3.437989 | norm 0.1942 | lr 1.13e-03 | (9902.45 ms | 52945 tok/s) step 3831/76294 | train loss 3.426006 | norm 0.1945 | lr 1.13e-03 | (9895.31 ms | 52984 tok/s) step 3832/76294 | train loss 3.318697 | norm 0.1571 | lr 1.13e-03 | (9889.99 ms | 53012 tok/s) step 3833/76294 | train loss 3.469692 | norm 0.1977 | lr 1.13e-03 | (9897.16 ms | 52974 tok/s) step 3834/76294 | train loss 3.397641 | norm 0.2008 | lr 1.13e-03 | (9891.65 ms | 53003 tok/s) step 3835/76294 | train loss 3.450244 | norm 0.1677 | lr 1.13e-03 | (9894.93 ms | 52986 tok/s) step 3836/76294 | train loss 3.469519 | norm 0.1839 | lr 1.13e-03 | (9950.71 ms | 52689 tok/s) step 3837/76294 | train loss 3.666060 | norm 0.1983 | lr 1.13e-03 | (9894.82 ms | 52986 tok/s) step 3838/76294 | train loss 3.435004 | norm 0.1763 | lr 1.13e-03 | (9892.61 ms | 52998 tok/s) step 3839/76294 | train loss 3.401291 | norm 0.1707 | lr 1.13e-03 | (9918.87 ms | 52858 tok/s) step 3840/76294 | train loss 3.421521 | norm 0.1612 | lr 1.13e-03 | (9892.45 ms | 52999 tok/s) step 3841/76294 | train loss 3.411159 | norm 0.1665 | lr 1.13e-03 | (9901.51 ms | 52950 tok/s) step 3842/76294 | train loss 3.424599 | norm 0.1442 | lr 1.13e-03 | (9890.28 ms | 53010 tok/s) step 3843/76294 | train loss 3.455817 | norm 0.1519 | lr 1.13e-03 | (9894.60 ms | 52987 tok/s) step 3844/76294 | train loss 3.378049 | norm 0.1528 | lr 1.13e-03 | (9889.40 ms | 53015 tok/s) step 3845/76294 | train loss 3.479087 | norm 0.1579 | lr 1.13e-03 | (9901.01 ms | 52953 tok/s) step 3846/76294 | train loss 3.435117 | norm 0.1364 | lr 1.13e-03 | (9919.81 ms | 52853 tok/s) step 3847/76294 | train loss 3.370746 | norm 0.1389 | lr 1.13e-03 | (9956.02 ms | 52660 tok/s) step 3848/76294 | train loss 3.458922 | norm 0.1348 | lr 1.13e-03 | (9890.91 ms | 53007 tok/s) step 3849/76294 | train loss 3.396271 | norm 0.1574 | lr 1.13e-03 | (9933.77 ms | 52778 tok/s) step 3850/76294 | train loss 3.533339 | norm 0.1756 | lr 1.13e-03 | (9889.93 ms | 53012 tok/s) step 3851/76294 | train loss 3.414244 | norm 0.1736 | lr 1.13e-03 | (9898.46 ms | 52967 tok/s) step 3852/76294 | train loss 3.396405 | norm 0.1571 | lr 1.13e-03 | (9891.10 ms | 53006 tok/s) step 3853/76294 | train loss 3.484328 | norm 0.1511 | lr 1.13e-03 | (9895.58 ms | 52982 tok/s) step 3854/76294 | train loss 3.381314 | norm 0.1478 | lr 1.13e-03 | (9887.96 ms | 53023 tok/s) step 3855/76294 | train loss 3.409806 | norm 0.1853 | lr 1.13e-03 | (9894.13 ms | 52990 tok/s) step 3856/76294 | train loss 3.412433 | norm 0.1408 | lr 1.13e-03 | (9887.66 ms | 53024 tok/s) step 3857/76294 | train loss 3.379505 | norm 0.1684 | lr 1.13e-03 | (9905.07 ms | 52931 tok/s) step 3858/76294 | train loss 3.456997 | norm 0.1722 | lr 1.13e-03 | (9890.99 ms | 53007 tok/s) step 3859/76294 | train loss 3.378606 | norm 0.1569 | lr 1.13e-03 | (9922.70 ms | 52837 tok/s) step 3860/76294 | train loss 3.406192 | norm 0.1558 | lr 1.13e-03 | (9953.74 ms | 52672 tok/s) step 3861/76294 | train loss 3.382232 | norm 0.1539 | lr 1.13e-03 | (9894.15 ms | 52990 tok/s) step 3862/76294 | train loss 3.374279 | norm 0.1621 | lr 1.13e-03 | (9968.00 ms | 52597 tok/s) step 3863/76294 | train loss 3.513944 | norm 0.1499 | lr 1.13e-03 | (9891.41 ms | 53004 tok/s) step 3864/76294 | train loss 3.409326 | norm 0.1589 | lr 1.13e-03 | (9887.69 ms | 53024 tok/s) step 3865/76294 | train loss 3.389470 | norm 0.1473 | lr 1.13e-03 | (9939.03 ms | 52750 tok/s) step 3866/76294 | train loss 3.398313 | norm 0.1492 | lr 1.13e-03 | (9905.13 ms | 52931 tok/s) step 3867/76294 | train loss 3.496784 | norm 0.1729 | lr 1.13e-03 | (9963.31 ms | 52622 tok/s) step 3868/76294 | train loss 3.453905 | norm 0.1516 | lr 1.13e-03 | (9894.80 ms | 52986 tok/s) step 3869/76294 | train loss 3.411657 | norm 0.1570 | lr 1.13e-03 | (9897.74 ms | 52970 tok/s) step 3870/76294 | train loss 3.421689 | norm 0.1542 | lr 1.13e-03 | (9928.87 ms | 52804 tok/s) step 3871/76294 | train loss 3.402137 | norm 0.1515 | lr 1.13e-03 | (9896.11 ms | 52979 tok/s) step 3872/76294 | train loss 3.460216 | norm 0.1497 | lr 1.13e-03 | (9943.86 ms | 52725 tok/s) step 3873/76294 | train loss 3.392118 | norm 0.1472 | lr 1.13e-03 | (9893.43 ms | 52994 tok/s) step 3874/76294 | train loss 3.434577 | norm 0.1616 | lr 1.13e-03 | (9894.64 ms | 52987 tok/s) step 3875/76294 | train loss 3.459268 | norm 0.1747 | lr 1.13e-03 | (9921.14 ms | 52846 tok/s) step 3876/76294 | train loss 3.420786 | norm 0.1666 | lr 1.13e-03 | (9893.44 ms | 52993 tok/s) step 3877/76294 | train loss 3.430130 | norm 0.1621 | lr 1.13e-03 | (9984.27 ms | 52511 tok/s) step 3878/76294 | train loss 3.426096 | norm 0.1416 | lr 1.13e-03 | (9889.31 ms | 53016 tok/s) step 3879/76294 | train loss 3.402762 | norm 0.1611 | lr 1.13e-03 | (10054.45 ms | 52145 tok/s) step 3880/76294 | train loss 3.399549 | norm 0.1656 | lr 1.13e-03 | (9909.97 ms | 52905 tok/s) step 3881/76294 | train loss 3.482866 | norm 0.1624 | lr 1.13e-03 | (9890.69 ms | 53008 tok/s) step 3882/76294 | train loss 3.424246 | norm 0.1637 | lr 1.13e-03 | (9889.30 ms | 53016 tok/s) step 3883/76294 | train loss 3.341463 | norm 0.1610 | lr 1.13e-03 | (9926.20 ms | 52819 tok/s) step 3884/76294 | train loss 3.391542 | norm 0.1531 | lr 1.13e-03 | (9891.68 ms | 53003 tok/s) step 3885/76294 | train loss 3.489298 | norm 0.1652 | lr 1.13e-03 | (9908.77 ms | 52912 tok/s) step 3886/76294 | train loss 3.392091 | norm 0.1590 | lr 1.13e-03 | (9888.17 ms | 53022 tok/s) step 3887/76294 | train loss 3.420723 | norm 0.1641 | lr 1.13e-03 | (9908.28 ms | 52914 tok/s) step 3888/76294 | train loss 3.442868 | norm 0.1624 | lr 1.13e-03 | (9890.87 ms | 53007 tok/s) step 3889/76294 | train loss 3.397013 | norm 0.2059 | lr 1.13e-03 | (9917.83 ms | 52863 tok/s) step 3890/76294 | train loss 3.405566 | norm 0.1648 | lr 1.13e-03 | (9955.27 ms | 52664 tok/s) step 3891/76294 | train loss 3.380030 | norm 0.1959 | lr 1.13e-03 | (9895.74 ms | 52981 tok/s) step 3892/76294 | train loss 3.430601 | norm 0.1431 | lr 1.13e-03 | (9893.19 ms | 52995 tok/s) step 3893/76294 | train loss 3.412218 | norm 0.1706 | lr 1.13e-03 | (9912.32 ms | 52893 tok/s) step 3894/76294 | train loss 3.362504 | norm 0.1536 | lr 1.13e-03 | (9889.15 ms | 53017 tok/s) step 3895/76294 | train loss 3.423172 | norm 0.1617 | lr 1.13e-03 | (10962.42 ms | 47826 tok/s) step 3896/76294 | train loss 3.455037 | norm 0.1630 | lr 1.13e-03 | (9883.02 ms | 53049 tok/s) step 3897/76294 | train loss 3.417107 | norm 0.1588 | lr 1.13e-03 | (9892.81 ms | 52997 tok/s) step 3898/76294 | train loss 3.425638 | norm 0.1613 | lr 1.13e-03 | (9881.80 ms | 53056 tok/s) step 3899/76294 | train loss 3.459657 | norm 0.1576 | lr 1.13e-03 | (9894.22 ms | 52989 tok/s) step 3900/76294 | train loss 3.362193 | norm 0.1351 | lr 1.13e-03 | (9884.88 ms | 53039 tok/s) step 3901/76294 | train loss 3.453891 | norm 0.1739 | lr 1.13e-03 | (9898.06 ms | 52969 tok/s) step 3902/76294 | train loss 3.380282 | norm 0.1526 | lr 1.13e-03 | (9887.25 ms | 53027 tok/s) step 3903/76294 | train loss 3.461733 | norm 0.1444 | lr 1.13e-03 | (11259.51 ms | 46564 tok/s) step 3904/76294 | train loss 3.393810 | norm 0.1747 | lr 1.13e-03 | (9871.50 ms | 53111 tok/s) step 3905/76294 | train loss 3.383381 | norm 0.1705 | lr 1.13e-03 | (9905.82 ms | 52927 tok/s) step 3906/76294 | train loss 3.465001 | norm 0.1479 | lr 1.13e-03 | (9908.14 ms | 52915 tok/s) step 3907/76294 | train loss 3.380779 | norm 0.1642 | lr 1.13e-03 | (9892.71 ms | 52997 tok/s) step 3908/76294 | train loss 3.418316 | norm 0.1492 | lr 1.13e-03 | (9885.20 ms | 53038 tok/s) step 3909/76294 | train loss 3.407089 | norm 0.1466 | lr 1.13e-03 | (9893.67 ms | 52992 tok/s) step 3910/76294 | train loss 3.406801 | norm 0.1384 | lr 1.13e-03 | (9887.86 ms | 53023 tok/s) step 3911/76294 | train loss 3.329962 | norm 0.1466 | lr 1.13e-03 | (9896.63 ms | 52976 tok/s) step 3912/76294 | train loss 3.446198 | norm 0.1503 | lr 1.13e-03 | (9940.34 ms | 52743 tok/s) step 3913/76294 | train loss 3.393952 | norm 0.1534 | lr 1.13e-03 | (9895.12 ms | 52985 tok/s) step 3914/76294 | train loss 3.430218 | norm 0.1836 | lr 1.13e-03 | (9975.64 ms | 52557 tok/s) step 3915/76294 | train loss 3.454528 | norm 0.1661 | lr 1.13e-03 | (9891.79 ms | 53002 tok/s) step 3916/76294 | train loss 3.380107 | norm 0.1780 | lr 1.12e-03 | (9895.55 ms | 52982 tok/s) step 3917/76294 | train loss 3.411324 | norm 0.1623 | lr 1.12e-03 | (9896.92 ms | 52975 tok/s) step 3918/76294 | train loss 3.416374 | norm 0.1812 | lr 1.12e-03 | (10449.58 ms | 50173 tok/s) step 3919/76294 | train loss 3.404000 | norm 0.3963 | lr 1.12e-03 | (9950.81 ms | 52688 tok/s) step 3920/76294 | train loss 3.445983 | norm 0.2498 | lr 1.12e-03 | (9888.84 ms | 53018 tok/s) step 3921/76294 | train loss 3.478278 | norm 0.2173 | lr 1.12e-03 | (10200.21 ms | 51400 tok/s) step 3922/76294 | train loss 3.474184 | norm 0.2084 | lr 1.12e-03 | (9895.68 ms | 52981 tok/s) step 3923/76294 | train loss 3.412595 | norm 0.1733 | lr 1.12e-03 | (9894.58 ms | 52987 tok/s) step 3924/76294 | train loss 3.524856 | norm 0.2464 | lr 1.12e-03 | (9931.33 ms | 52791 tok/s) step 3925/76294 | train loss 3.424391 | norm 0.2024 | lr 1.12e-03 | (10857.38 ms | 48289 tok/s) step 3926/76294 | train loss 3.462713 | norm 0.1825 | lr 1.12e-03 | (10991.89 ms | 47698 tok/s) step 3927/76294 | train loss 3.463222 | norm 0.1809 | lr 1.12e-03 | (9952.08 ms | 52681 tok/s) step 3928/76294 | train loss 3.416824 | norm 0.1585 | lr 1.12e-03 | (9880.56 ms | 53063 tok/s) step 3929/76294 | train loss 3.422646 | norm 0.1647 | lr 1.12e-03 | (9882.36 ms | 53053 tok/s) step 3930/76294 | train loss 3.473680 | norm 0.1424 | lr 1.12e-03 | (9887.91 ms | 53023 tok/s) step 3931/76294 | train loss 3.469357 | norm 0.1641 | lr 1.12e-03 | (9890.97 ms | 53007 tok/s) step 3932/76294 | train loss 3.449525 | norm 0.1561 | lr 1.12e-03 | (9890.28 ms | 53010 tok/s) step 3933/76294 | train loss 3.405286 | norm 0.1455 | lr 1.12e-03 | (9889.73 ms | 53013 tok/s) step 3934/76294 | train loss 3.430670 | norm 0.1526 | lr 1.12e-03 | (9894.00 ms | 52991 tok/s) step 3935/76294 | train loss 3.469692 | norm 0.1499 | lr 1.12e-03 | (9896.95 ms | 52975 tok/s) step 3936/76294 | train loss 3.420795 | norm 0.1563 | lr 1.12e-03 | (9888.18 ms | 53022 tok/s) step 3937/76294 | train loss 3.439443 | norm 0.1524 | lr 1.12e-03 | (9911.51 ms | 52897 tok/s) step 3938/76294 | train loss 3.472162 | norm 0.1636 | lr 1.12e-03 | (9893.09 ms | 52995 tok/s) step 3939/76294 | train loss 3.422410 | norm 0.1677 | lr 1.12e-03 | (9905.38 ms | 52930 tok/s) step 3940/76294 | train loss 3.422404 | norm 0.1760 | lr 1.12e-03 | (9899.31 ms | 52962 tok/s) step 3941/76294 | train loss 3.414809 | norm 0.1804 | lr 1.12e-03 | (9893.54 ms | 52993 tok/s) step 3942/76294 | train loss 3.395209 | norm 0.1807 | lr 1.12e-03 | (9894.85 ms | 52986 tok/s) step 3943/76294 | train loss 3.378987 | norm 0.2054 | lr 1.12e-03 | (9889.19 ms | 53016 tok/s) step 3944/76294 | train loss 3.447884 | norm 0.1738 | lr 1.12e-03 | (9891.33 ms | 53005 tok/s) step 3945/76294 | train loss 3.458570 | norm 0.1520 | lr 1.12e-03 | (9889.94 ms | 53012 tok/s) step 3946/76294 | train loss 3.454133 | norm 0.1671 | lr 1.12e-03 | (9894.13 ms | 52990 tok/s) step 3947/76294 | train loss 3.443564 | norm 0.1533 | lr 1.12e-03 | (9891.85 ms | 53002 tok/s) step 3948/76294 | train loss 3.503269 | norm 0.1570 | lr 1.12e-03 | (9890.18 ms | 53011 tok/s) step 3949/76294 | train loss 3.386543 | norm 0.1647 | lr 1.12e-03 | (9890.47 ms | 53009 tok/s) step 3950/76294 | train loss 3.416746 | norm 0.1553 | lr 1.12e-03 | (9887.69 ms | 53024 tok/s) step 3951/76294 | train loss 3.364018 | norm 0.1513 | lr 1.12e-03 | (9885.85 ms | 53034 tok/s) step 3952/76294 | train loss 3.428603 | norm 0.1569 | lr 1.12e-03 | (9916.08 ms | 52873 tok/s) step 3953/76294 | train loss 3.415211 | norm 0.1672 | lr 1.12e-03 | (9887.27 ms | 53027 tok/s) step 3954/76294 | train loss 3.441175 | norm 0.1483 | lr 1.12e-03 | (9891.06 ms | 53006 tok/s) step 3955/76294 | train loss 3.409922 | norm 0.1618 | lr 1.12e-03 | (9890.13 ms | 53011 tok/s) step 3956/76294 | train loss 3.430764 | norm 0.1670 | lr 1.12e-03 | (9928.89 ms | 52804 tok/s) step 3957/76294 | train loss 3.459080 | norm 0.1593 | lr 1.12e-03 | (9902.85 ms | 52943 tok/s) step 3958/76294 | train loss 3.417232 | norm 0.1583 | lr 1.12e-03 | (9890.97 ms | 53007 tok/s) step 3959/76294 | train loss 3.411277 | norm 0.1658 | lr 1.12e-03 | (9889.58 ms | 53014 tok/s) step 3960/76294 | train loss 3.452608 | norm 0.1447 | lr 1.12e-03 | (9894.73 ms | 52987 tok/s) step 3961/76294 | train loss 3.470922 | norm 0.1745 | lr 1.12e-03 | (9959.51 ms | 52642 tok/s) step 3962/76294 | train loss 3.415615 | norm 0.1883 | lr 1.12e-03 | (9888.48 ms | 53020 tok/s) step 3963/76294 | train loss 3.398044 | norm 0.1523 | lr 1.12e-03 | (9899.25 ms | 52962 tok/s) step 3964/76294 | train loss 3.439100 | norm 0.1687 | lr 1.12e-03 | (9959.54 ms | 52642 tok/s) step 3965/76294 | train loss 3.441664 | norm 0.1733 | lr 1.12e-03 | (9898.65 ms | 52966 tok/s) step 3966/76294 | train loss 3.414490 | norm 0.1608 | lr 1.12e-03 | (9895.69 ms | 52981 tok/s) step 3967/76294 | train loss 3.387249 | norm 0.1546 | lr 1.12e-03 | (9932.77 ms | 52784 tok/s) step 3968/76294 | train loss 3.419408 | norm 0.1645 | lr 1.12e-03 | (9888.24 ms | 53021 tok/s) step 3969/76294 | train loss 3.437116 | norm 0.1495 | lr 1.12e-03 | (9893.41 ms | 52994 tok/s) step 3970/76294 | train loss 3.501874 | norm 0.1488 | lr 1.12e-03 | (9950.97 ms | 52687 tok/s) step 3971/76294 | train loss 3.513108 | norm 0.1458 | lr 1.12e-03 | (9893.27 ms | 52994 tok/s) step 3972/76294 | train loss 3.408014 | norm 0.1715 | lr 1.12e-03 | (9951.53 ms | 52684 tok/s) step 3973/76294 | train loss 3.386673 | norm 0.1675 | lr 1.12e-03 | (9888.60 ms | 53019 tok/s) step 3974/76294 | train loss 3.389710 | norm 0.1461 | lr 1.12e-03 | (9924.51 ms | 52828 tok/s) step 3975/76294 | train loss 3.460424 | norm 0.1910 | lr 1.12e-03 | (9961.41 ms | 52632 tok/s) step 3976/76294 | train loss 3.446235 | norm 0.1959 | lr 1.12e-03 | (9893.25 ms | 52995 tok/s) step 3977/76294 | train loss 3.529542 | norm 0.1574 | lr 1.12e-03 | (9894.35 ms | 52989 tok/s) step 3978/76294 | train loss 3.435398 | norm 0.1519 | lr 1.12e-03 | (9931.54 ms | 52790 tok/s) step 3979/76294 | train loss 3.405567 | norm 0.1562 | lr 1.12e-03 | (9892.97 ms | 52996 tok/s) step 3980/76294 | train loss 3.453298 | norm 0.1760 | lr 1.12e-03 | (9895.90 ms | 52980 tok/s) step 3981/76294 | train loss 3.442499 | norm 0.1729 | lr 1.12e-03 | (9891.97 ms | 53001 tok/s) step 3982/76294 | train loss 3.569622 | norm 0.1564 | lr 1.12e-03 | (9894.21 ms | 52989 tok/s) step 3983/76294 | train loss 3.450473 | norm 0.1764 | lr 1.12e-03 | (9897.40 ms | 52972 tok/s) step 3984/76294 | train loss 3.479196 | norm 0.1573 | lr 1.12e-03 | (9895.32 ms | 52983 tok/s) step 3985/76294 | train loss 3.456214 | norm 0.2054 | lr 1.12e-03 | (9954.51 ms | 52668 tok/s) step 3986/76294 | train loss 3.390359 | norm 0.1929 | lr 1.12e-03 | (9894.75 ms | 52986 tok/s) step 3987/76294 | train loss 3.419279 | norm 0.1514 | lr 1.12e-03 | (9894.40 ms | 52988 tok/s) step 3988/76294 | train loss 3.397191 | norm 0.1718 | lr 1.12e-03 | (9910.56 ms | 52902 tok/s) step 3989/76294 | train loss 3.464501 | norm 0.1397 | lr 1.12e-03 | (9897.29 ms | 52973 tok/s) step 3990/76294 | train loss 3.503211 | norm 0.1866 | lr 1.12e-03 | (9893.50 ms | 52993 tok/s) step 3991/76294 | train loss 3.425681 | norm 0.1609 | lr 1.12e-03 | (9902.42 ms | 52945 tok/s) step 3992/76294 | train loss 3.389176 | norm 0.1699 | lr 1.12e-03 | (9932.09 ms | 52787 tok/s) step 3993/76294 | train loss 3.513257 | norm 0.1603 | lr 1.12e-03 | (9892.42 ms | 52999 tok/s) step 3994/76294 | train loss 3.437088 | norm 0.1722 | lr 1.12e-03 | (9916.99 ms | 52868 tok/s) step 3995/76294 | train loss 3.478109 | norm 0.1609 | lr 1.12e-03 | (9894.72 ms | 52987 tok/s) step 3996/76294 | train loss 3.448572 | norm 0.1672 | lr 1.12e-03 | (9891.97 ms | 53001 tok/s) step 3997/76294 | train loss 3.421927 | norm 0.1484 | lr 1.12e-03 | (9935.62 ms | 52769 tok/s) step 3998/76294 | train loss 3.449236 | norm 0.1589 | lr 1.12e-03 | (9886.49 ms | 53031 tok/s) step 3999/76294 | train loss 3.509459 | norm 0.1629 | lr 1.12e-03 | (9895.78 ms | 52981 tok/s) step 4000/76294 | train loss 3.413308 | norm 0.1515 | lr 1.12e-03 | (9888.57 ms | 53020 tok/s) val loss: 3.406347 saving model checkpoint to ./results/gpt2-350M-gqa/step_4000.pth step 4001/76294 | train loss 3.498927 | norm 0.1615 | lr 1.12e-03 | (11204.15 ms | 46794 tok/s) step 4002/76294 | train loss 3.384987 | norm 0.1442 | lr 1.12e-03 | (9864.89 ms | 53147 tok/s) step 4003/76294 | train loss 3.493695 | norm 0.1640 | lr 1.12e-03 | (9889.66 ms | 53014 tok/s) step 4004/76294 | train loss 3.446279 | norm 0.1714 | lr 1.12e-03 | (9871.93 ms | 53109 tok/s) step 4005/76294 | train loss 3.416518 | norm 0.1587 | lr 1.12e-03 | (9885.06 ms | 53038 tok/s) step 4006/76294 | train loss 3.420021 | norm 0.1511 | lr 1.12e-03 | (9877.08 ms | 53081 tok/s) step 4007/76294 | train loss 3.519018 | norm 0.1821 | lr 1.12e-03 | (9892.69 ms | 52998 tok/s) step 4008/76294 | train loss 3.319020 | norm 0.1457 | lr 1.12e-03 | (9892.33 ms | 52999 tok/s) step 4009/76294 | train loss 3.420403 | norm 0.1791 | lr 1.12e-03 | (9917.62 ms | 52864 tok/s) step 4010/76294 | train loss 3.460172 | norm 0.1773 | lr 1.12e-03 | (9923.84 ms | 52831 tok/s) step 4011/76294 | train loss 3.424579 | norm 0.1704 | lr 1.12e-03 | (9891.85 ms | 53002 tok/s) step 4012/76294 | train loss 3.427882 | norm 0.1792 | lr 1.12e-03 | (9889.26 ms | 53016 tok/s) step 4013/76294 | train loss 3.387630 | norm 0.1702 | lr 1.12e-03 | (9899.45 ms | 52961 tok/s) step 4014/76294 | train loss 3.423853 | norm 0.1751 | lr 1.12e-03 | (9976.38 ms | 52553 tok/s) step 4015/76294 | train loss 3.494663 | norm 0.1792 | lr 1.12e-03 | (9912.18 ms | 52893 tok/s) step 4016/76294 | train loss 3.329675 | norm 0.1548 | lr 1.12e-03 | (9958.61 ms | 52647 tok/s) step 4017/76294 | train loss 3.422112 | norm 0.1577 | lr 1.12e-03 | (9900.27 ms | 52957 tok/s) step 4018/76294 | train loss 3.368196 | norm 0.1496 | lr 1.12e-03 | (9896.62 ms | 52976 tok/s) step 4019/76294 | train loss 3.408834 | norm 0.1617 | lr 1.12e-03 | (9941.64 ms | 52737 tok/s) step 4020/76294 | train loss 3.411802 | norm 0.1525 | lr 1.12e-03 | (9899.25 ms | 52962 tok/s) step 4021/76294 | train loss 3.370573 | norm 0.1641 | lr 1.12e-03 | (9905.36 ms | 52930 tok/s) step 4022/76294 | train loss 3.472381 | norm 0.1510 | lr 1.12e-03 | (9892.24 ms | 53000 tok/s) step 4023/76294 | train loss 3.439412 | norm 0.1500 | lr 1.12e-03 | (9901.63 ms | 52950 tok/s) step 4024/76294 | train loss 3.439030 | norm 0.1458 | lr 1.12e-03 | (9897.86 ms | 52970 tok/s) step 4025/76294 | train loss 3.404397 | norm 0.1320 | lr 1.12e-03 | (9912.67 ms | 52891 tok/s) step 4026/76294 | train loss 3.382031 | norm 0.1444 | lr 1.12e-03 | (9972.13 ms | 52575 tok/s) step 4027/76294 | train loss 3.439071 | norm 0.1370 | lr 1.12e-03 | (9926.45 ms | 52817 tok/s) step 4028/76294 | train loss 3.405418 | norm 0.1479 | lr 1.12e-03 | (9943.68 ms | 52726 tok/s) step 4029/76294 | train loss 3.424472 | norm 0.1461 | lr 1.12e-03 | (9899.91 ms | 52959 tok/s) step 4030/76294 | train loss 3.412181 | norm 0.1507 | lr 1.12e-03 | (9927.01 ms | 52814 tok/s) step 4031/76294 | train loss 3.394201 | norm 0.1616 | lr 1.12e-03 | (9889.90 ms | 53012 tok/s) step 4032/76294 | train loss 3.441321 | norm 0.1582 | lr 1.12e-03 | (9891.96 ms | 53001 tok/s) step 4033/76294 | train loss 3.678853 | norm 0.1873 | lr 1.12e-03 | (9892.71 ms | 52997 tok/s) step 4034/76294 | train loss 3.415473 | norm 0.1911 | lr 1.12e-03 | (9896.19 ms | 52979 tok/s) step 4035/76294 | train loss 3.446146 | norm 0.1558 | lr 1.12e-03 | (9915.09 ms | 52878 tok/s) step 4036/76294 | train loss 3.431158 | norm 0.1665 | lr 1.12e-03 | (9895.74 ms | 52981 tok/s) step 4037/76294 | train loss 3.513096 | norm 0.1768 | lr 1.12e-03 | (9903.81 ms | 52938 tok/s) step 4038/76294 | train loss 3.435835 | norm 0.1569 | lr 1.12e-03 | (9907.36 ms | 52919 tok/s) step 4039/76294 | train loss 3.420717 | norm 0.1604 | lr 1.12e-03 | (9936.47 ms | 52764 tok/s) step 4040/76294 | train loss 3.385562 | norm 0.1513 | lr 1.12e-03 | (9890.32 ms | 53010 tok/s) step 4041/76294 | train loss 3.398393 | norm 0.1633 | lr 1.12e-03 | (9901.29 ms | 52951 tok/s) step 4042/76294 | train loss 3.448899 | norm 0.1527 | lr 1.12e-03 | (9891.43 ms | 53004 tok/s) step 4043/76294 | train loss 3.391189 | norm 0.1575 | lr 1.12e-03 | (9895.79 ms | 52981 tok/s) step 4044/76294 | train loss 3.399477 | norm 0.1536 | lr 1.12e-03 | (9890.80 ms | 53008 tok/s) step 4045/76294 | train loss 3.470926 | norm 0.1625 | lr 1.12e-03 | (9897.37 ms | 52972 tok/s) step 4046/76294 | train loss 3.471475 | norm 0.1718 | lr 1.12e-03 | (9951.56 ms | 52684 tok/s) step 4047/76294 | train loss 3.414864 | norm 0.1645 | lr 1.12e-03 | (9904.35 ms | 52935 tok/s) step 4048/76294 | train loss 3.423458 | norm 0.1455 | lr 1.12e-03 | (9898.49 ms | 52966 tok/s) step 4049/76294 | train loss 3.389528 | norm 0.1797 | lr 1.12e-03 | (9893.55 ms | 52993 tok/s) step 4050/76294 | train loss 3.476117 | norm 0.1580 | lr 1.12e-03 | (9894.00 ms | 52991 tok/s) step 4051/76294 | train loss 3.407746 | norm 0.1684 | lr 1.12e-03 | (9892.18 ms | 53000 tok/s) step 4052/76294 | train loss 3.480899 | norm 0.1535 | lr 1.12e-03 | (9895.92 ms | 52980 tok/s) step 4053/76294 | train loss 3.431808 | norm 0.1632 | lr 1.12e-03 | (9893.89 ms | 52991 tok/s) step 4054/76294 | train loss 3.469690 | norm 0.1516 | lr 1.12e-03 | (9894.47 ms | 52988 tok/s) step 4055/76294 | train loss 3.425071 | norm 0.1621 | lr 1.12e-03 | (9891.67 ms | 53003 tok/s) step 4056/76294 | train loss 3.456957 | norm 0.1554 | lr 1.12e-03 | (9890.54 ms | 53009 tok/s) step 4057/76294 | train loss 3.444155 | norm 0.1678 | lr 1.12e-03 | (9949.91 ms | 52693 tok/s) step 4058/76294 | train loss 3.402178 | norm 0.1467 | lr 1.12e-03 | (9892.29 ms | 53000 tok/s) step 4059/76294 | train loss 3.478335 | norm 0.1582 | lr 1.12e-03 | (9895.42 ms | 52983 tok/s) step 4060/76294 | train loss 3.394045 | norm 0.1415 | lr 1.12e-03 | (9934.61 ms | 52774 tok/s) step 4061/76294 | train loss 3.484571 | norm 0.1425 | lr 1.12e-03 | (9894.12 ms | 52990 tok/s) step 4062/76294 | train loss 3.378720 | norm 0.1396 | lr 1.12e-03 | (9896.02 ms | 52980 tok/s) step 4063/76294 | train loss 3.407435 | norm 0.1533 | lr 1.12e-03 | (9961.06 ms | 52634 tok/s) step 4064/76294 | train loss 3.427377 | norm 0.1561 | lr 1.12e-03 | (9888.45 ms | 53020 tok/s) step 4065/76294 | train loss 3.380504 | norm 0.1557 | lr 1.12e-03 | (9890.67 ms | 53008 tok/s) step 4066/76294 | train loss 3.557548 | norm 0.1889 | lr 1.12e-03 | (9908.62 ms | 52912 tok/s) step 4067/76294 | train loss 3.428741 | norm 0.1760 | lr 1.12e-03 | (9889.23 ms | 53016 tok/s) step 4068/76294 | train loss 3.423064 | norm 0.1763 | lr 1.12e-03 | (9888.33 ms | 53021 tok/s) step 4069/76294 | train loss 3.411453 | norm 0.1937 | lr 1.12e-03 | (9913.60 ms | 52886 tok/s) step 4070/76294 | train loss 3.342186 | norm 0.1970 | lr 1.12e-03 | (9889.98 ms | 53012 tok/s) step 4071/76294 | train loss 3.409864 | norm 0.1465 | lr 1.12e-03 | (9900.71 ms | 52955 tok/s) step 4072/76294 | train loss 3.437856 | norm 0.1858 | lr 1.12e-03 | (9888.50 ms | 53020 tok/s) step 4073/76294 | train loss 3.423211 | norm 0.1724 | lr 1.12e-03 | (9937.48 ms | 52759 tok/s) step 4074/76294 | train loss 3.405311 | norm 0.1578 | lr 1.12e-03 | (9890.44 ms | 53010 tok/s) step 4075/76294 | train loss 3.369973 | norm 0.1532 | lr 1.12e-03 | (9902.07 ms | 52947 tok/s) step 4076/76294 | train loss 3.420403 | norm 0.1574 | lr 1.12e-03 | (9908.17 ms | 52915 tok/s) step 4077/76294 | train loss 3.403884 | norm 0.1481 | lr 1.12e-03 | (9897.55 ms | 52971 tok/s) step 4078/76294 | train loss 3.425586 | norm 0.1707 | lr 1.12e-03 | (9956.61 ms | 52657 tok/s) step 4079/76294 | train loss 3.423256 | norm 0.1751 | lr 1.12e-03 | (9896.43 ms | 52978 tok/s) step 4080/76294 | train loss 3.390764 | norm 0.1669 | lr 1.12e-03 | (9892.60 ms | 52998 tok/s) step 4081/76294 | train loss 3.404698 | norm 0.1452 | lr 1.12e-03 | (9933.75 ms | 52778 tok/s) step 4082/76294 | train loss 3.472717 | norm 0.1673 | lr 1.12e-03 | (9890.07 ms | 53012 tok/s) step 4083/76294 | train loss 3.364615 | norm 0.1653 | lr 1.12e-03 | (9945.16 ms | 52718 tok/s) step 4084/76294 | train loss 3.384471 | norm 0.1663 | lr 1.12e-03 | (9899.10 ms | 52963 tok/s) step 4085/76294 | train loss 3.516273 | norm 0.1553 | lr 1.12e-03 | (9897.52 ms | 52972 tok/s) step 4086/76294 | train loss 3.433894 | norm 0.1480 | lr 1.12e-03 | (9910.04 ms | 52905 tok/s) step 4087/76294 | train loss 3.575158 | norm 0.1596 | lr 1.12e-03 | (9891.93 ms | 53002 tok/s) step 4088/76294 | train loss 3.441086 | norm 0.1572 | lr 1.12e-03 | (9906.76 ms | 52922 tok/s) step 4089/76294 | train loss 3.390152 | norm 0.1759 | lr 1.12e-03 | (9895.88 ms | 52980 tok/s) step 4090/76294 | train loss 3.401061 | norm 0.1521 | lr 1.12e-03 | (9890.12 ms | 53011 tok/s) step 4091/76294 | train loss 3.447737 | norm 0.1875 | lr 1.12e-03 | (9929.17 ms | 52803 tok/s) step 4092/76294 | train loss 3.406777 | norm 0.1817 | lr 1.12e-03 | (9884.70 ms | 53040 tok/s) step 4093/76294 | train loss 3.391992 | norm 0.1471 | lr 1.12e-03 | (9894.77 ms | 52986 tok/s) step 4094/76294 | train loss 3.399820 | norm 0.1438 | lr 1.12e-03 | (9887.41 ms | 53026 tok/s) step 4095/76294 | train loss 3.438288 | norm 0.1523 | lr 1.12e-03 | (9900.41 ms | 52956 tok/s) step 4096/76294 | train loss 3.400584 | norm 0.1556 | lr 1.12e-03 | (9889.34 ms | 53015 tok/s) step 4097/76294 | train loss 3.407642 | norm 0.1299 | lr 1.12e-03 | (9904.50 ms | 52934 tok/s) step 4098/76294 | train loss 3.369299 | norm 0.1487 | lr 1.12e-03 | (9885.15 ms | 53038 tok/s) step 4099/76294 | train loss 3.348467 | norm 0.1356 | lr 1.12e-03 | (11034.53 ms | 47513 tok/s) step 4100/76294 | train loss 3.460060 | norm 0.1374 | lr 1.12e-03 | (9880.93 ms | 53061 tok/s) step 4101/76294 | train loss 3.440628 | norm 0.1553 | lr 1.12e-03 | (9929.19 ms | 52803 tok/s) step 4102/76294 | train loss 3.418190 | norm 0.1385 | lr 1.12e-03 | (9883.11 ms | 53049 tok/s) step 4103/76294 | train loss 3.325370 | norm 0.1661 | lr 1.12e-03 | (9893.14 ms | 52995 tok/s) step 4104/76294 | train loss 3.384510 | norm 0.1455 | lr 1.12e-03 | (9885.31 ms | 53037 tok/s) step 4105/76294 | train loss 3.468348 | norm 0.1577 | lr 1.12e-03 | (9893.60 ms | 52993 tok/s) step 4106/76294 | train loss 3.349289 | norm 0.1782 | lr 1.12e-03 | (9884.20 ms | 53043 tok/s) step 4107/76294 | train loss 3.378187 | norm 0.1462 | lr 1.12e-03 | (9895.57 ms | 52982 tok/s) step 4108/76294 | train loss 3.429705 | norm 0.1558 | lr 1.12e-03 | (9884.49 ms | 53041 tok/s) step 4109/76294 | train loss 3.437287 | norm 0.1845 | lr 1.12e-03 | (9892.92 ms | 52996 tok/s) step 4110/76294 | train loss 3.378564 | norm 0.1685 | lr 1.12e-03 | (9890.30 ms | 53010 tok/s) step 4111/76294 | train loss 3.432973 | norm 0.1596 | lr 1.12e-03 | (10690.66 ms | 49042 tok/s) step 4112/76294 | train loss 3.438180 | norm 0.1661 | lr 1.12e-03 | (9876.17 ms | 53086 tok/s) step 4113/76294 | train loss 3.315686 | norm 0.1679 | lr 1.12e-03 | (9945.32 ms | 52717 tok/s) step 4114/76294 | train loss 3.419067 | norm 0.1645 | lr 1.12e-03 | (9886.81 ms | 53029 tok/s) step 4115/76294 | train loss 3.430831 | norm 0.1636 | lr 1.12e-03 | (9890.79 ms | 53008 tok/s) step 4116/76294 | train loss 3.370294 | norm 0.1688 | lr 1.12e-03 | (9949.97 ms | 52692 tok/s) step 4117/76294 | train loss 3.401543 | norm 0.1887 | lr 1.12e-03 | (9895.46 ms | 52983 tok/s) step 4118/76294 | train loss 3.432868 | norm 0.2038 | lr 1.12e-03 | (9896.85 ms | 52975 tok/s) step 4119/76294 | train loss 3.391602 | norm 0.1562 | lr 1.12e-03 | (9969.85 ms | 52587 tok/s) step 4120/76294 | train loss 3.403957 | norm 0.1975 | lr 1.12e-03 | (9892.20 ms | 53000 tok/s) step 4121/76294 | train loss 3.542205 | norm 0.1953 | lr 1.12e-03 | (9897.28 ms | 52973 tok/s) step 4122/76294 | train loss 3.413714 | norm 0.2062 | lr 1.12e-03 | (9895.46 ms | 52983 tok/s) step 4123/76294 | train loss 3.421261 | norm 0.1997 | lr 1.12e-03 | (9900.69 ms | 52955 tok/s) step 4124/76294 | train loss 3.377826 | norm 0.1606 | lr 1.12e-03 | (9906.41 ms | 52924 tok/s) step 4125/76294 | train loss 3.341344 | norm 0.1862 | lr 1.12e-03 | (9961.32 ms | 52632 tok/s) step 4126/76294 | train loss 3.428132 | norm 0.1622 | lr 1.12e-03 | (9885.71 ms | 53035 tok/s) step 4127/76294 | train loss 3.529274 | norm 0.1553 | lr 1.12e-03 | (10093.48 ms | 51943 tok/s) step 4128/76294 | train loss 3.357245 | norm 0.1569 | lr 1.11e-03 | (9908.98 ms | 52910 tok/s) step 4129/76294 | train loss 3.456364 | norm 0.1618 | lr 1.11e-03 | (9925.14 ms | 52824 tok/s) step 4130/76294 | train loss 3.372835 | norm 0.1506 | lr 1.11e-03 | (9912.82 ms | 52890 tok/s) step 4131/76294 | train loss 3.465480 | norm 0.1530 | lr 1.11e-03 | (9884.09 ms | 53044 tok/s) step 4132/76294 | train loss 3.383250 | norm 0.1460 | lr 1.11e-03 | (9929.13 ms | 52803 tok/s) step 4133/76294 | train loss 3.377533 | norm 0.1547 | lr 1.11e-03 | (9892.20 ms | 53000 tok/s) step 4134/76294 | train loss 3.380511 | norm 0.1509 | lr 1.11e-03 | (9906.72 ms | 52922 tok/s) step 4135/76294 | train loss 3.429172 | norm 0.1404 | lr 1.11e-03 | (9886.77 ms | 53029 tok/s) step 4136/76294 | train loss 3.357185 | norm 0.1468 | lr 1.11e-03 | (9903.71 ms | 52939 tok/s) step 4137/76294 | train loss 3.376512 | norm 0.1420 | lr 1.11e-03 | (9884.61 ms | 53041 tok/s) step 4138/76294 | train loss 3.406533 | norm 0.1454 | lr 1.11e-03 | (9903.41 ms | 52940 tok/s) step 4139/76294 | train loss 3.371626 | norm 0.1509 | lr 1.11e-03 | (9880.84 ms | 53061 tok/s) step 4140/76294 | train loss 3.423208 | norm 0.1636 | lr 1.11e-03 | (9972.97 ms | 52571 tok/s) step 4141/76294 | train loss 3.388459 | norm 0.1536 | lr 1.11e-03 | (9881.99 ms | 53055 tok/s) step 4142/76294 | train loss 3.349011 | norm 0.1526 | lr 1.11e-03 | (9900.43 ms | 52956 tok/s) step 4143/76294 | train loss 3.377707 | norm 0.1452 | lr 1.11e-03 | (11590.25 ms | 45235 tok/s) step 4144/76294 | train loss 3.349440 | norm 0.1497 | lr 1.11e-03 | (9893.84 ms | 52991 tok/s) step 4145/76294 | train loss 3.382269 | norm 0.1599 | lr 1.11e-03 | (9875.37 ms | 53090 tok/s) step 4146/76294 | train loss 3.480737 | norm 0.1542 | lr 1.11e-03 | (9872.50 ms | 53106 tok/s) step 4147/76294 | train loss 3.424099 | norm 0.1539 | lr 1.11e-03 | (9881.02 ms | 53060 tok/s) step 4148/76294 | train loss 3.381501 | norm 0.1651 | lr 1.11e-03 | (9877.36 ms | 53080 tok/s) step 4149/76294 | train loss 3.448953 | norm 0.1444 | lr 1.11e-03 | (9879.69 ms | 53067 tok/s) step 4150/76294 | train loss 3.371469 | norm 0.1777 | lr 1.11e-03 | (9881.66 ms | 53057 tok/s) step 4151/76294 | train loss 3.405922 | norm 0.1467 | lr 1.11e-03 | (9880.58 ms | 53062 tok/s) step 4152/76294 | train loss 3.443309 | norm 0.1432 | lr 1.11e-03 | (9890.69 ms | 53008 tok/s) step 4153/76294 | train loss 3.436418 | norm 0.1462 | lr 1.11e-03 | (9880.51 ms | 53063 tok/s) step 4154/76294 | train loss 3.389681 | norm 0.1733 | lr 1.11e-03 | (9894.47 ms | 52988 tok/s) step 4155/76294 | train loss 3.433521 | norm 0.1560 | lr 1.11e-03 | (9885.99 ms | 53033 tok/s) step 4156/76294 | train loss 3.461864 | norm 0.1692 | lr 1.11e-03 | (9899.04 ms | 52964 tok/s) step 4157/76294 | train loss 3.357928 | norm 0.1652 | lr 1.11e-03 | (9897.88 ms | 52970 tok/s) step 4158/76294 | train loss 3.404286 | norm 0.1666 | lr 1.11e-03 | (9920.96 ms | 52846 tok/s) step 4159/76294 | train loss 3.468038 | norm 0.1919 | lr 1.11e-03 | (9898.96 ms | 52964 tok/s) step 4160/76294 | train loss 3.392189 | norm 0.1602 | lr 1.11e-03 | (9878.32 ms | 53075 tok/s) step 4161/76294 | train loss 3.383125 | norm 0.1468 | lr 1.11e-03 | (9904.32 ms | 52935 tok/s) step 4162/76294 | train loss 3.418302 | norm 0.1594 | lr 1.11e-03 | (9941.96 ms | 52735 tok/s) step 4163/76294 | train loss 3.380929 | norm 0.1676 | lr 1.11e-03 | (9882.49 ms | 53052 tok/s) step 4164/76294 | train loss 3.391469 | norm 0.1658 | lr 1.11e-03 | (9883.51 ms | 53047 tok/s) step 4165/76294 | train loss 3.407999 | norm 0.1543 | lr 1.11e-03 | (9922.91 ms | 52836 tok/s) step 4166/76294 | train loss 3.473325 | norm 0.1551 | lr 1.11e-03 | (9924.87 ms | 52826 tok/s) step 4167/76294 | train loss 3.370744 | norm 0.1522 | lr 1.11e-03 | (9885.60 ms | 53036 tok/s) step 4168/76294 | train loss 3.379426 | norm 0.1425 | lr 1.11e-03 | (9896.79 ms | 52976 tok/s) step 4169/76294 | train loss 3.398180 | norm 0.1522 | lr 1.11e-03 | (9885.71 ms | 53035 tok/s) step 4170/76294 | train loss 3.391805 | norm 0.1408 | lr 1.11e-03 | (9923.65 ms | 52832 tok/s) step 4171/76294 | train loss 3.491336 | norm 0.1401 | lr 1.11e-03 | (9891.22 ms | 53005 tok/s) step 4172/76294 | train loss 3.449538 | norm 0.1689 | lr 1.11e-03 | (9917.27 ms | 52866 tok/s) step 4173/76294 | train loss 3.433016 | norm 0.1570 | lr 1.11e-03 | (9889.58 ms | 53014 tok/s) step 4174/76294 | train loss 3.397587 | norm 0.1471 | lr 1.11e-03 | (9887.66 ms | 53025 tok/s) step 4175/76294 | train loss 3.410397 | norm 0.1444 | lr 1.11e-03 | (9934.43 ms | 52775 tok/s) step 4176/76294 | train loss 3.446358 | norm 0.1713 | lr 1.11e-03 | (9887.12 ms | 53027 tok/s) step 4177/76294 | train loss 3.426400 | norm 0.1515 | lr 1.11e-03 | (9899.71 ms | 52960 tok/s) step 4178/76294 | train loss 3.351225 | norm 0.1505 | lr 1.11e-03 | (9885.30 ms | 53037 tok/s) step 4179/76294 | train loss 3.347483 | norm 0.1432 | lr 1.11e-03 | (9891.26 ms | 53005 tok/s) step 4180/76294 | train loss 3.470915 | norm 0.1763 | lr 1.11e-03 | (9881.32 ms | 53058 tok/s) step 4181/76294 | train loss 3.379759 | norm 0.1707 | lr 1.11e-03 | (9892.94 ms | 52996 tok/s) step 4182/76294 | train loss 3.457693 | norm 0.1452 | lr 1.11e-03 | (9886.75 ms | 53029 tok/s) step 4183/76294 | train loss 3.373305 | norm 0.1716 | lr 1.11e-03 | (9893.78 ms | 52992 tok/s) step 4184/76294 | train loss 3.370080 | norm 0.1806 | lr 1.11e-03 | (9926.89 ms | 52815 tok/s) step 4185/76294 | train loss 3.336861 | norm 0.1565 | lr 1.11e-03 | (9886.61 ms | 53030 tok/s) step 4186/76294 | train loss 3.389837 | norm 0.1736 | lr 1.11e-03 | (9891.47 ms | 53004 tok/s) step 4187/76294 | train loss 3.340225 | norm 0.1747 | lr 1.11e-03 | (9888.82 ms | 53018 tok/s) step 4188/76294 | train loss 3.478244 | norm 0.1784 | lr 1.11e-03 | (9892.97 ms | 52996 tok/s) step 4189/76294 | train loss 3.321764 | norm 0.1969 | lr 1.11e-03 | (9885.67 ms | 53035 tok/s) step 4190/76294 | train loss 3.402541 | norm 0.1561 | lr 1.11e-03 | (9898.55 ms | 52966 tok/s) step 4191/76294 | train loss 3.440709 | norm 0.2056 | lr 1.11e-03 | (9889.82 ms | 53013 tok/s) step 4192/76294 | train loss 3.419067 | norm 0.1707 | lr 1.11e-03 | (9926.30 ms | 52818 tok/s) step 4193/76294 | train loss 3.501175 | norm 0.2134 | lr 1.11e-03 | (9889.03 ms | 53017 tok/s) step 4194/76294 | train loss 3.297078 | norm 0.1648 | lr 1.11e-03 | (9925.59 ms | 52822 tok/s) step 4195/76294 | train loss 3.454747 | norm 0.1792 | lr 1.11e-03 | (9898.34 ms | 52967 tok/s) step 4196/76294 | train loss 3.375483 | norm 0.1693 | lr 1.11e-03 | (9882.64 ms | 53051 tok/s) step 4197/76294 | train loss 3.344825 | norm 0.1511 | lr 1.11e-03 | (10983.16 ms | 47736 tok/s) step 4198/76294 | train loss 3.391885 | norm 0.1544 | lr 1.11e-03 | (9909.71 ms | 52906 tok/s) step 4199/76294 | train loss 3.469462 | norm 0.1528 | lr 1.11e-03 | (9880.11 ms | 53065 tok/s) step 4200/76294 | train loss 3.415095 | norm 0.1604 | lr 1.11e-03 | (9879.76 ms | 53067 tok/s) step 4201/76294 | train loss 3.494188 | norm 0.1823 | lr 1.11e-03 | (9882.02 ms | 53055 tok/s) step 4202/76294 | train loss 3.372723 | norm 0.1362 | lr 1.11e-03 | (9891.68 ms | 53003 tok/s) step 4203/76294 | train loss 3.414586 | norm 0.1733 | lr 1.11e-03 | (9884.70 ms | 53040 tok/s) step 4204/76294 | train loss 3.367385 | norm 0.1617 | lr 1.11e-03 | (9888.06 ms | 53022 tok/s) step 4205/76294 | train loss 2.741539 | norm 0.9570 | lr 1.11e-03 | (9948.91 ms | 52698 tok/s) step 4206/76294 | train loss 3.423041 | norm 0.2851 | lr 1.11e-03 | (9882.61 ms | 53052 tok/s) step 4207/76294 | train loss 3.411706 | norm 0.2070 | lr 1.11e-03 | (9930.10 ms | 52798 tok/s) step 4208/76294 | train loss 3.339619 | norm 0.2189 | lr 1.11e-03 | (9883.24 ms | 53048 tok/s) step 4209/76294 | train loss 3.422277 | norm 0.2054 | lr 1.11e-03 | (9888.09 ms | 53022 tok/s) step 4210/76294 | train loss 3.375727 | norm 0.1818 | lr 1.11e-03 | (9920.54 ms | 52849 tok/s) step 4211/76294 | train loss 3.384014 | norm 0.1701 | lr 1.11e-03 | (9904.79 ms | 52933 tok/s) step 4212/76294 | train loss 3.426763 | norm 0.1802 | lr 1.11e-03 | (9890.81 ms | 53008 tok/s) step 4213/76294 | train loss 3.341304 | norm 0.1529 | lr 1.11e-03 | (9890.27 ms | 53011 tok/s) step 4214/76294 | train loss 3.385097 | norm 0.1477 | lr 1.11e-03 | (9948.33 ms | 52701 tok/s) step 4215/76294 | train loss 3.383502 | norm 0.1377 | lr 1.11e-03 | (9885.83 ms | 53034 tok/s) step 4216/76294 | train loss 3.385139 | norm 0.1496 | lr 1.11e-03 | (9896.64 ms | 52976 tok/s) step 4217/76294 | train loss 3.399815 | norm 0.1632 | lr 1.11e-03 | (9890.26 ms | 53011 tok/s) step 4218/76294 | train loss 3.373383 | norm 0.1556 | lr 1.11e-03 | (9921.40 ms | 52844 tok/s) step 4219/76294 | train loss 3.390131 | norm 0.1503 | lr 1.11e-03 | (9888.40 ms | 53020 tok/s) step 4220/76294 | train loss 3.430365 | norm 0.1441 | lr 1.11e-03 | (9888.36 ms | 53021 tok/s) step 4221/76294 | train loss 3.425691 | norm 0.1544 | lr 1.11e-03 | (9892.46 ms | 52999 tok/s) step 4222/76294 | train loss 3.377615 | norm 0.1349 | lr 1.11e-03 | (9884.95 ms | 53039 tok/s) step 4223/76294 | train loss 3.375692 | norm 0.1510 | lr 1.11e-03 | (9895.43 ms | 52983 tok/s) step 4224/76294 | train loss 3.437514 | norm 0.1510 | lr 1.11e-03 | (9887.94 ms | 53023 tok/s) step 4225/76294 | train loss 3.326438 | norm 0.1390 | lr 1.11e-03 | (9895.12 ms | 52985 tok/s) step 4226/76294 | train loss 3.515690 | norm 0.1622 | lr 1.11e-03 | (9887.77 ms | 53024 tok/s) step 4227/76294 | train loss 3.348145 | norm 0.1561 | lr 1.11e-03 | (9896.53 ms | 52977 tok/s) step 4228/76294 | train loss 3.462805 | norm 0.1550 | lr 1.11e-03 | (9920.78 ms | 52847 tok/s) step 4229/76294 | train loss 3.375664 | norm 0.1661 | lr 1.11e-03 | (9886.04 ms | 53033 tok/s) step 4230/76294 | train loss 3.355232 | norm 0.1400 | lr 1.11e-03 | (9893.75 ms | 52992 tok/s) step 4231/76294 | train loss 3.497501 | norm 0.1570 | lr 1.11e-03 | (9891.56 ms | 53004 tok/s) step 4232/76294 | train loss 3.416497 | norm 0.1490 | lr 1.11e-03 | (9887.52 ms | 53025 tok/s) step 4233/76294 | train loss 3.384120 | norm 0.1461 | lr 1.11e-03 | (9890.82 ms | 53008 tok/s) step 4234/76294 | train loss 3.392084 | norm 0.1363 | lr 1.11e-03 | (9886.02 ms | 53033 tok/s) step 4235/76294 | train loss 3.376805 | norm 0.1376 | lr 1.11e-03 | (9895.98 ms | 52980 tok/s) step 4236/76294 | train loss 3.423772 | norm 0.1400 | lr 1.11e-03 | (9884.45 ms | 53042 tok/s) step 4237/76294 | train loss 3.379419 | norm 0.1330 | lr 1.11e-03 | (9896.53 ms | 52977 tok/s) step 4238/76294 | train loss 3.378652 | norm 0.1481 | lr 1.11e-03 | (9892.23 ms | 53000 tok/s) step 4239/76294 | train loss 3.396572 | norm 0.1550 | lr 1.11e-03 | (9897.11 ms | 52974 tok/s) step 4240/76294 | train loss 3.328921 | norm 0.1649 | lr 1.11e-03 | (9887.66 ms | 53025 tok/s) step 4241/76294 | train loss 3.418998 | norm 0.1552 | lr 1.11e-03 | (9949.40 ms | 52695 tok/s) step 4242/76294 | train loss 3.465287 | norm 0.1573 | lr 1.11e-03 | (9895.93 ms | 52980 tok/s) step 4243/76294 | train loss 3.419572 | norm 0.1653 | lr 1.11e-03 | (9888.84 ms | 53018 tok/s) step 4244/76294 | train loss 3.384307 | norm 0.1562 | lr 1.11e-03 | (9955.10 ms | 52665 tok/s) step 4245/76294 | train loss 3.331030 | norm 0.1749 | lr 1.11e-03 | (9888.11 ms | 53022 tok/s) step 4246/76294 | train loss 3.384459 | norm 0.1431 | lr 1.11e-03 | (9921.98 ms | 52841 tok/s) step 4247/76294 | train loss 3.455093 | norm 0.1664 | lr 1.11e-03 | (9904.30 ms | 52935 tok/s) step 4248/76294 | train loss 3.301918 | norm 0.1690 | lr 1.11e-03 | (9893.72 ms | 52992 tok/s) step 4249/76294 | train loss 3.337526 | norm 0.1472 | lr 1.11e-03 | (9917.47 ms | 52865 tok/s) step 4250/76294 | train loss 3.410901 | norm 0.1685 | lr 1.11e-03 | (9888.88 ms | 53018 tok/s) val loss: 3.393991 saving model checkpoint to ./results/gpt2-350M-gqa/step_4250.pth step 4251/76294 | train loss 3.352084 | norm 0.1679 | lr 1.11e-03 | (9973.65 ms | 52567 tok/s) step 4252/76294 | train loss 3.395376 | norm 0.1454 | lr 1.11e-03 | (9868.74 ms | 53126 tok/s) step 4253/76294 | train loss 3.395909 | norm 0.1642 | lr 1.11e-03 | (9919.14 ms | 52856 tok/s) step 4254/76294 | train loss 3.376923 | norm 0.1567 | lr 1.11e-03 | (9889.78 ms | 53013 tok/s) step 4255/76294 | train loss 3.450118 | norm 0.1759 | lr 1.11e-03 | (9878.81 ms | 53072 tok/s) step 4256/76294 | train loss 3.412103 | norm 0.1637 | lr 1.11e-03 | (9877.34 ms | 53080 tok/s) step 4257/76294 | train loss 3.390625 | norm 0.1619 | lr 1.11e-03 | (9898.68 ms | 52965 tok/s) step 4258/76294 | train loss 3.401662 | norm 0.1595 | lr 1.11e-03 | (9888.50 ms | 53020 tok/s) step 4259/76294 | train loss 3.367651 | norm 0.1442 | lr 1.11e-03 | (9904.80 ms | 52933 tok/s) step 4260/76294 | train loss 3.371009 | norm 0.1566 | lr 1.11e-03 | (9887.27 ms | 53027 tok/s) step 4261/76294 | train loss 3.382604 | norm 0.1409 | lr 1.11e-03 | (9919.52 ms | 52854 tok/s) step 4262/76294 | train loss 3.433521 | norm 0.1620 | lr 1.11e-03 | (9888.79 ms | 53018 tok/s) step 4263/76294 | train loss 3.354010 | norm 0.1401 | lr 1.11e-03 | (9900.84 ms | 52954 tok/s) step 4264/76294 | train loss 3.343153 | norm 0.1530 | lr 1.11e-03 | (9913.72 ms | 52885 tok/s) step 4265/76294 | train loss 3.394857 | norm 0.1590 | lr 1.11e-03 | (9957.75 ms | 52651 tok/s) step 4266/76294 | train loss 3.388033 | norm 0.1446 | lr 1.11e-03 | (9904.80 ms | 52933 tok/s) step 4267/76294 | train loss 3.367483 | norm 0.1832 | lr 1.11e-03 | (9957.95 ms | 52650 tok/s) step 4268/76294 | train loss 3.395767 | norm 0.1593 | lr 1.11e-03 | (9897.57 ms | 52971 tok/s) step 4269/76294 | train loss 3.326628 | norm 0.1413 | lr 1.11e-03 | (9900.45 ms | 52956 tok/s) step 4270/76294 | train loss 3.417651 | norm 0.1631 | lr 1.11e-03 | (9934.54 ms | 52774 tok/s) step 4271/76294 | train loss 3.405520 | norm 0.1577 | lr 1.11e-03 | (9894.43 ms | 52988 tok/s) step 4272/76294 | train loss 3.342825 | norm 0.1482 | lr 1.11e-03 | (9900.62 ms | 52955 tok/s) step 4273/76294 | train loss 3.392380 | norm 0.1487 | lr 1.11e-03 | (9896.94 ms | 52975 tok/s) step 4274/76294 | train loss 3.405241 | norm 0.1630 | lr 1.11e-03 | (9899.22 ms | 52963 tok/s) step 4275/76294 | train loss 3.378208 | norm 0.1492 | lr 1.11e-03 | (9894.76 ms | 52986 tok/s) step 4276/76294 | train loss 3.405553 | norm 0.1675 | lr 1.11e-03 | (9897.46 ms | 52972 tok/s) step 4277/76294 | train loss 3.409283 | norm 0.1633 | lr 1.11e-03 | (9893.09 ms | 52995 tok/s) step 4278/76294 | train loss 3.375628 | norm 0.1761 | lr 1.11e-03 | (9900.21 ms | 52957 tok/s) step 4279/76294 | train loss 3.452559 | norm 0.1702 | lr 1.11e-03 | (9895.88 ms | 52980 tok/s) step 4280/76294 | train loss 3.339758 | norm 0.1495 | lr 1.11e-03 | (9899.93 ms | 52959 tok/s) step 4281/76294 | train loss 3.342216 | norm 0.1662 | lr 1.11e-03 | (9899.96 ms | 52959 tok/s) step 4282/76294 | train loss 3.401934 | norm 0.1543 | lr 1.11e-03 | (9926.76 ms | 52816 tok/s) step 4283/76294 | train loss 3.397375 | norm 0.1591 | lr 1.11e-03 | (9906.50 ms | 52924 tok/s) step 4284/76294 | train loss 3.348290 | norm 0.1845 | lr 1.11e-03 | (9898.30 ms | 52967 tok/s) step 4285/76294 | train loss 3.429598 | norm 0.1517 | lr 1.11e-03 | (9896.53 ms | 52977 tok/s) step 4286/76294 | train loss 3.452711 | norm 0.1856 | lr 1.11e-03 | (9896.37 ms | 52978 tok/s) step 4287/76294 | train loss 3.353356 | norm 0.1655 | lr 1.11e-03 | (9891.96 ms | 53001 tok/s) step 4288/76294 | train loss 3.434559 | norm 0.1482 | lr 1.11e-03 | (9897.21 ms | 52973 tok/s) step 4289/76294 | train loss 3.507803 | norm 0.1936 | lr 1.11e-03 | (9925.88 ms | 52820 tok/s) step 4290/76294 | train loss 3.332564 | norm 0.1868 | lr 1.11e-03 | (9892.83 ms | 52997 tok/s) step 4291/76294 | train loss 3.403960 | norm 0.1907 | lr 1.11e-03 | (9954.37 ms | 52669 tok/s) step 4292/76294 | train loss 3.429299 | norm 0.1435 | lr 1.11e-03 | (9896.98 ms | 52975 tok/s) step 4293/76294 | train loss 3.350960 | norm 0.1686 | lr 1.11e-03 | (9891.48 ms | 53004 tok/s) step 4294/76294 | train loss 3.443610 | norm 0.1627 | lr 1.11e-03 | (10840.19 ms | 48365 tok/s) step 4295/76294 | train loss 3.392299 | norm 0.1486 | lr 1.11e-03 | (9895.10 ms | 52985 tok/s) step 4296/76294 | train loss 3.402941 | norm 0.1411 | lr 1.11e-03 | (9884.84 ms | 53040 tok/s) step 4297/76294 | train loss 3.404941 | norm 0.1534 | lr 1.11e-03 | (9926.56 ms | 52817 tok/s) step 4298/76294 | train loss 3.446842 | norm 0.1562 | lr 1.11e-03 | (9885.51 ms | 53036 tok/s) step 4299/76294 | train loss 3.411875 | norm 0.1541 | lr 1.11e-03 | (9896.14 ms | 52979 tok/s) step 4300/76294 | train loss 3.379041 | norm 0.1649 | lr 1.11e-03 | (9915.06 ms | 52878 tok/s) step 4301/76294 | train loss 3.661593 | norm 0.2380 | lr 1.11e-03 | (9927.97 ms | 52809 tok/s) step 4302/76294 | train loss 3.423242 | norm 0.2692 | lr 1.11e-03 | (10226.53 ms | 51267 tok/s) step 4303/76294 | train loss 3.447505 | norm 0.2249 | lr 1.11e-03 | (9877.05 ms | 53081 tok/s) step 4304/76294 | train loss 3.469563 | norm 0.2188 | lr 1.11e-03 | (9888.51 ms | 53020 tok/s) step 4305/76294 | train loss 3.403129 | norm 0.1879 | lr 1.11e-03 | (9893.94 ms | 52991 tok/s) step 4306/76294 | train loss 3.393582 | norm 0.1868 | lr 1.11e-03 | (9943.97 ms | 52724 tok/s) step 4307/76294 | train loss 3.418780 | norm 0.1606 | lr 1.11e-03 | (9891.78 ms | 53002 tok/s) step 4308/76294 | train loss 3.436705 | norm 0.1867 | lr 1.11e-03 | (9879.58 ms | 53068 tok/s) step 4309/76294 | train loss 3.460972 | norm 0.1557 | lr 1.11e-03 | (9896.67 ms | 52976 tok/s) step 4310/76294 | train loss 3.435283 | norm 0.1912 | lr 1.11e-03 | (11424.76 ms | 45891 tok/s) step 4311/76294 | train loss 3.398705 | norm 0.1575 | lr 1.11e-03 | (9873.09 ms | 53103 tok/s) step 4312/76294 | train loss 3.431426 | norm 0.1622 | lr 1.11e-03 | (9916.23 ms | 52872 tok/s) step 4313/76294 | train loss 3.419933 | norm 0.1662 | lr 1.11e-03 | (9877.23 ms | 53080 tok/s) step 4314/76294 | train loss 3.404182 | norm 0.1393 | lr 1.11e-03 | (9877.87 ms | 53077 tok/s) step 4315/76294 | train loss 3.502342 | norm 0.1672 | lr 1.11e-03 | (9877.44 ms | 53079 tok/s) step 4316/76294 | train loss 3.444138 | norm 0.1484 | lr 1.11e-03 | (9919.75 ms | 52853 tok/s) step 4317/76294 | train loss 3.519191 | norm 0.1619 | lr 1.11e-03 | (9884.88 ms | 53039 tok/s) step 4318/76294 | train loss 3.450449 | norm 0.1685 | lr 1.11e-03 | (12628.55 ms | 41516 tok/s) step 4319/76294 | train loss 3.463068 | norm 0.1513 | lr 1.11e-03 | (9868.41 ms | 53128 tok/s) step 4320/76294 | train loss 3.465042 | norm 0.1689 | lr 1.11e-03 | (9867.00 ms | 53136 tok/s) step 4321/76294 | train loss 3.449003 | norm 0.1780 | lr 1.11e-03 | (9886.43 ms | 53031 tok/s) step 4322/76294 | train loss 3.470789 | norm 0.1465 | lr 1.11e-03 | (9874.60 ms | 53095 tok/s) step 4323/76294 | train loss 3.528854 | norm 0.1734 | lr 1.11e-03 | (9918.58 ms | 52859 tok/s) step 4324/76294 | train loss 3.435468 | norm 0.1654 | lr 1.11e-03 | (9877.09 ms | 53081 tok/s) step 4325/76294 | train loss 3.546531 | norm 0.1941 | lr 1.11e-03 | (9873.23 ms | 53102 tok/s) step 4326/76294 | train loss 3.427775 | norm 0.2192 | lr 1.11e-03 | (9875.31 ms | 53091 tok/s) step 4327/76294 | train loss 3.459574 | norm 0.1503 | lr 1.11e-03 | (9873.36 ms | 53101 tok/s) step 4328/76294 | train loss 3.365940 | norm 0.1606 | lr 1.11e-03 | (9881.91 ms | 53055 tok/s) step 4329/76294 | train loss 3.392073 | norm 0.1572 | lr 1.10e-03 | (9888.58 ms | 53020 tok/s) step 4330/76294 | train loss 3.380879 | norm 0.1501 | lr 1.10e-03 | (9875.34 ms | 53091 tok/s) step 4331/76294 | train loss 3.476013 | norm 0.1607 | lr 1.10e-03 | (9895.62 ms | 52982 tok/s) step 4332/76294 | train loss 3.389339 | norm 0.1786 | lr 1.10e-03 | (9875.14 ms | 53092 tok/s) step 4333/76294 | train loss 3.351843 | norm 0.1954 | lr 1.10e-03 | (9885.48 ms | 53036 tok/s) step 4334/76294 | train loss 3.389089 | norm 0.1534 | lr 1.10e-03 | (9872.90 ms | 53104 tok/s) step 4335/76294 | train loss 3.375294 | norm 0.1600 | lr 1.10e-03 | (9884.38 ms | 53042 tok/s) step 4336/76294 | train loss 3.450201 | norm 0.1548 | lr 1.10e-03 | (9904.29 ms | 52935 tok/s) step 4337/76294 | train loss 3.412676 | norm 0.1579 | lr 1.10e-03 | (9881.24 ms | 53059 tok/s) step 4338/76294 | train loss 3.418943 | norm 0.1690 | lr 1.10e-03 | (10629.04 ms | 49326 tok/s) step 4339/76294 | train loss 3.417247 | norm 0.1370 | lr 1.10e-03 | (9886.44 ms | 53031 tok/s) step 4340/76294 | train loss 3.398923 | norm 0.1568 | lr 1.10e-03 | (9878.30 ms | 53075 tok/s) step 4341/76294 | train loss 3.406892 | norm 0.1429 | lr 1.10e-03 | (9879.16 ms | 53070 tok/s) step 4342/76294 | train loss 3.395119 | norm 0.1571 | lr 1.10e-03 | (9879.65 ms | 53067 tok/s) step 4343/76294 | train loss 3.401265 | norm 0.1482 | lr 1.10e-03 | (9876.63 ms | 53084 tok/s) step 4344/76294 | train loss 3.393995 | norm 0.1501 | lr 1.10e-03 | (9911.77 ms | 52895 tok/s) step 4345/76294 | train loss 3.428899 | norm 0.1369 | lr 1.10e-03 | (9886.80 ms | 53029 tok/s) step 4346/76294 | train loss 3.591843 | norm 0.1533 | lr 1.10e-03 | (9889.93 ms | 53012 tok/s) step 4347/76294 | train loss 3.442516 | norm 0.1382 | lr 1.10e-03 | (9891.94 ms | 53002 tok/s) step 4348/76294 | train loss 3.395543 | norm 0.1399 | lr 1.10e-03 | (9875.08 ms | 53092 tok/s) step 4349/76294 | train loss 3.385627 | norm 0.1484 | lr 1.10e-03 | (9917.93 ms | 52863 tok/s) step 4350/76294 | train loss 3.444815 | norm 0.1436 | lr 1.10e-03 | (9897.66 ms | 52971 tok/s) step 4351/76294 | train loss 3.441294 | norm 0.1368 | lr 1.10e-03 | (9888.02 ms | 53023 tok/s) step 4352/76294 | train loss 3.385839 | norm 0.1449 | lr 1.10e-03 | (9919.14 ms | 52856 tok/s) step 4353/76294 | train loss 3.443233 | norm 0.1622 | lr 1.10e-03 | (9883.27 ms | 53048 tok/s) step 4354/76294 | train loss 3.415387 | norm 0.1567 | lr 1.10e-03 | (9921.67 ms | 52843 tok/s) step 4355/76294 | train loss 3.445410 | norm 0.1421 | lr 1.10e-03 | (9885.91 ms | 53034 tok/s) step 4356/76294 | train loss 3.399077 | norm 0.1818 | lr 1.10e-03 | (9900.16 ms | 52958 tok/s) step 4357/76294 | train loss 3.389594 | norm 0.1446 | lr 1.10e-03 | (9883.68 ms | 53046 tok/s) step 4358/76294 | train loss 3.397252 | norm 0.1637 | lr 1.10e-03 | (9881.32 ms | 53059 tok/s) step 4359/76294 | train loss 3.355380 | norm 0.1729 | lr 1.10e-03 | (9883.85 ms | 53045 tok/s) step 4360/76294 | train loss 3.415093 | norm 0.1459 | lr 1.10e-03 | (9895.06 ms | 52985 tok/s) step 4361/76294 | train loss 3.488607 | norm 0.1687 | lr 1.10e-03 | (9900.44 ms | 52956 tok/s) step 4362/76294 | train loss 3.407491 | norm 0.1474 | lr 1.10e-03 | (9881.15 ms | 53059 tok/s) step 4363/76294 | train loss 3.417282 | norm 0.1499 | lr 1.10e-03 | (9890.33 ms | 53010 tok/s) step 4364/76294 | train loss 3.380232 | norm 0.1518 | lr 1.10e-03 | (9894.87 ms | 52986 tok/s) step 4365/76294 | train loss 3.450660 | norm 0.1756 | lr 1.10e-03 | (9922.83 ms | 52837 tok/s) step 4366/76294 | train loss 3.454076 | norm 0.1533 | lr 1.10e-03 | (9884.10 ms | 53044 tok/s) step 4367/76294 | train loss 3.382462 | norm 0.1562 | lr 1.10e-03 | (9885.39 ms | 53037 tok/s) step 4368/76294 | train loss 3.412267 | norm 0.1721 | lr 1.10e-03 | (9897.07 ms | 52974 tok/s) step 4369/76294 | train loss 3.375320 | norm 0.1499 | lr 1.10e-03 | (9924.50 ms | 52828 tok/s) step 4370/76294 | train loss 3.376892 | norm 0.1901 | lr 1.10e-03 | (9884.74 ms | 53040 tok/s) step 4371/76294 | train loss 3.422693 | norm 0.1598 | lr 1.10e-03 | (9901.82 ms | 52949 tok/s) step 4372/76294 | train loss 3.458139 | norm 0.1694 | lr 1.10e-03 | (9883.19 ms | 53048 tok/s) step 4373/76294 | train loss 3.593299 | norm 0.1644 | lr 1.10e-03 | (9923.95 ms | 52831 tok/s) step 4374/76294 | train loss 3.364193 | norm 0.1475 | lr 1.10e-03 | (9886.60 ms | 53030 tok/s) step 4375/76294 | train loss 3.474504 | norm 0.1582 | lr 1.10e-03 | (9892.25 ms | 53000 tok/s) step 4376/76294 | train loss 3.434656 | norm 0.1447 | lr 1.10e-03 | (9886.94 ms | 53028 tok/s) step 4377/76294 | train loss 3.325818 | norm 0.1600 | lr 1.10e-03 | (9918.43 ms | 52860 tok/s) step 4378/76294 | train loss 3.418225 | norm 0.1506 | lr 1.10e-03 | (9885.86 ms | 53034 tok/s) step 4379/76294 | train loss 3.315187 | norm 0.1564 | lr 1.10e-03 | (9981.43 ms | 52526 tok/s) step 4380/76294 | train loss 3.424430 | norm 0.1465 | lr 1.10e-03 | (9884.63 ms | 53041 tok/s) step 4381/76294 | train loss 3.437977 | norm 0.1352 | lr 1.10e-03 | (9900.25 ms | 52957 tok/s) step 4382/76294 | train loss 3.414368 | norm 0.1482 | lr 1.10e-03 | (9883.83 ms | 53045 tok/s) step 4383/76294 | train loss 3.414958 | norm 0.1483 | lr 1.10e-03 | (9892.47 ms | 52999 tok/s) step 4384/76294 | train loss 3.452787 | norm 0.1698 | lr 1.10e-03 | (9883.99 ms | 53044 tok/s) step 4385/76294 | train loss 3.419615 | norm 0.1583 | lr 1.10e-03 | (9895.35 ms | 52983 tok/s) step 4386/76294 | train loss 3.352613 | norm 0.1557 | lr 1.10e-03 | (9885.71 ms | 53035 tok/s) step 4387/76294 | train loss 3.431436 | norm 0.1867 | lr 1.10e-03 | (9889.05 ms | 53017 tok/s) step 4388/76294 | train loss 3.423846 | norm 0.1519 | lr 1.10e-03 | (9883.81 ms | 53045 tok/s) step 4389/76294 | train loss 3.324278 | norm 0.1653 | lr 1.10e-03 | (9927.08 ms | 52814 tok/s) step 4390/76294 | train loss 3.443765 | norm 0.1338 | lr 1.10e-03 | (9892.78 ms | 52997 tok/s) step 4391/76294 | train loss 3.423917 | norm 0.1583 | lr 1.10e-03 | (9919.69 ms | 52853 tok/s) step 4392/76294 | train loss 3.399075 | norm 0.1558 | lr 1.10e-03 | (11135.89 ms | 47081 tok/s) step 4393/76294 | train loss 3.420423 | norm 0.1535 | lr 1.10e-03 | (9942.06 ms | 52734 tok/s) step 4394/76294 | train loss 3.517743 | norm 0.1599 | lr 1.10e-03 | (9891.21 ms | 53005 tok/s) step 4395/76294 | train loss 3.418031 | norm 0.1644 | lr 1.10e-03 | (9973.03 ms | 52571 tok/s) step 4396/76294 | train loss 3.423695 | norm 0.1670 | lr 1.10e-03 | (9887.68 ms | 53024 tok/s) step 4397/76294 | train loss 3.400955 | norm 0.1555 | lr 1.10e-03 | (9894.12 ms | 52990 tok/s) step 4398/76294 | train loss 3.414012 | norm 0.1673 | lr 1.10e-03 | (9956.61 ms | 52657 tok/s) step 4399/76294 | train loss 3.366320 | norm 0.1555 | lr 1.10e-03 | (9905.48 ms | 52929 tok/s) step 4400/76294 | train loss 3.485176 | norm 0.1663 | lr 1.10e-03 | (9897.05 ms | 52974 tok/s) step 4401/76294 | train loss 3.471319 | norm 0.1619 | lr 1.10e-03 | (9932.90 ms | 52783 tok/s) step 4402/76294 | train loss 3.433797 | norm 0.1720 | lr 1.10e-03 | (9892.00 ms | 53001 tok/s) step 4403/76294 | train loss 3.342674 | norm 0.1454 | lr 1.10e-03 | (9906.55 ms | 52923 tok/s) step 4404/76294 | train loss 3.482177 | norm 0.1768 | lr 1.10e-03 | (9888.72 ms | 53019 tok/s) step 4405/76294 | train loss 3.430011 | norm 0.1904 | lr 1.10e-03 | (9896.05 ms | 52980 tok/s) step 4406/76294 | train loss 3.472358 | norm 0.1699 | lr 1.10e-03 | (9895.56 ms | 52982 tok/s) step 4407/76294 | train loss 3.512174 | norm 0.1966 | lr 1.10e-03 | (9894.45 ms | 52988 tok/s) step 4408/76294 | train loss 3.363975 | norm 0.2425 | lr 1.10e-03 | (9957.03 ms | 52655 tok/s) step 4409/76294 | train loss 3.371751 | norm 0.1619 | lr 1.10e-03 | (9948.35 ms | 52701 tok/s) step 4410/76294 | train loss 3.434838 | norm 0.1797 | lr 1.10e-03 | (9889.80 ms | 53013 tok/s) step 4411/76294 | train loss 3.537011 | norm 0.1854 | lr 1.10e-03 | (9904.40 ms | 52935 tok/s) step 4412/76294 | train loss 3.416848 | norm 0.1795 | lr 1.10e-03 | (9889.50 ms | 53015 tok/s) step 4413/76294 | train loss 3.397909 | norm 0.1667 | lr 1.10e-03 | (9904.43 ms | 52935 tok/s) step 4414/76294 | train loss 3.408093 | norm 0.1640 | lr 1.10e-03 | (9885.78 ms | 53035 tok/s) step 4415/76294 | train loss 3.383679 | norm 0.1633 | lr 1.10e-03 | (9893.56 ms | 52993 tok/s) step 4416/76294 | train loss 3.406611 | norm 0.1517 | lr 1.10e-03 | (9893.25 ms | 52995 tok/s) step 4417/76294 | train loss 3.532969 | norm 0.1645 | lr 1.10e-03 | (9890.68 ms | 53008 tok/s) step 4418/76294 | train loss 3.411068 | norm 0.1499 | lr 1.10e-03 | (9952.72 ms | 52678 tok/s) step 4419/76294 | train loss 3.386896 | norm 0.1484 | lr 1.10e-03 | (9898.81 ms | 52965 tok/s) step 4420/76294 | train loss 3.422462 | norm 0.1698 | lr 1.10e-03 | (9895.81 ms | 52981 tok/s) step 4421/76294 | train loss 3.457045 | norm 0.1932 | lr 1.10e-03 | (9933.54 ms | 52780 tok/s) step 4422/76294 | train loss 3.384937 | norm 0.1559 | lr 1.10e-03 | (9891.67 ms | 53003 tok/s) step 4423/76294 | train loss 3.420328 | norm 0.2404 | lr 1.10e-03 | (9899.42 ms | 52962 tok/s) step 4424/76294 | train loss 3.419124 | norm 0.2473 | lr 1.10e-03 | (9889.59 ms | 53014 tok/s) step 4425/76294 | train loss 3.392663 | norm 0.2241 | lr 1.10e-03 | (9900.29 ms | 52957 tok/s) step 4426/76294 | train loss 3.397145 | norm 0.1766 | lr 1.10e-03 | (9890.66 ms | 53008 tok/s) step 4427/76294 | train loss 3.405998 | norm 0.1786 | lr 1.10e-03 | (9926.10 ms | 52819 tok/s) step 4428/76294 | train loss 3.436364 | norm 0.1670 | lr 1.10e-03 | (9886.76 ms | 53029 tok/s) step 4429/76294 | train loss 3.447150 | norm 0.1629 | lr 1.10e-03 | (9894.98 ms | 52985 tok/s) step 4430/76294 | train loss 3.539380 | norm 0.1650 | lr 1.10e-03 | (9889.53 ms | 53014 tok/s) step 4431/76294 | train loss 3.441781 | norm 0.1581 | lr 1.10e-03 | (9930.73 ms | 52795 tok/s) step 4432/76294 | train loss 3.407362 | norm 0.1933 | lr 1.10e-03 | (9884.88 ms | 53039 tok/s) step 4433/76294 | train loss 3.360094 | norm 0.1604 | lr 1.10e-03 | (9954.26 ms | 52670 tok/s) step 4434/76294 | train loss 3.457053 | norm 0.1911 | lr 1.10e-03 | (9888.13 ms | 53022 tok/s) step 4435/76294 | train loss 3.575187 | norm 0.1507 | lr 1.10e-03 | (9894.26 ms | 52989 tok/s) step 4436/76294 | train loss 3.381516 | norm 0.1931 | lr 1.10e-03 | (9885.25 ms | 53037 tok/s) step 4437/76294 | train loss 3.411216 | norm 0.1481 | lr 1.10e-03 | (9893.09 ms | 52995 tok/s) step 4438/76294 | train loss 3.349373 | norm 0.1637 | lr 1.10e-03 | (9885.22 ms | 53038 tok/s) step 4439/76294 | train loss 3.386044 | norm 0.1662 | lr 1.10e-03 | (9905.62 ms | 52928 tok/s) step 4440/76294 | train loss 3.360490 | norm 0.1601 | lr 1.10e-03 | (9889.48 ms | 53015 tok/s) step 4441/76294 | train loss 3.408961 | norm 0.1610 | lr 1.10e-03 | (9899.06 ms | 52963 tok/s) step 4442/76294 | train loss 3.428002 | norm 0.1539 | lr 1.10e-03 | (9890.38 ms | 53010 tok/s) step 4443/76294 | train loss 3.337967 | norm 0.1507 | lr 1.10e-03 | (9900.10 ms | 52958 tok/s) step 4444/76294 | train loss 3.425587 | norm 0.1524 | lr 1.10e-03 | (9890.44 ms | 53010 tok/s) step 4445/76294 | train loss 3.360409 | norm 0.1495 | lr 1.10e-03 | (9929.95 ms | 52799 tok/s) step 4446/76294 | train loss 3.434933 | norm 0.1589 | lr 1.10e-03 | (9890.19 ms | 53011 tok/s) step 4447/76294 | train loss 3.397035 | norm 0.1462 | lr 1.10e-03 | (9957.64 ms | 52652 tok/s) step 4448/76294 | train loss 3.385276 | norm 0.1511 | lr 1.10e-03 | (9889.54 ms | 53014 tok/s) step 4449/76294 | train loss 3.717146 | norm 0.1734 | lr 1.10e-03 | (9887.11 ms | 53027 tok/s) step 4450/76294 | train loss 3.414692 | norm 0.1702 | lr 1.10e-03 | (9943.24 ms | 52728 tok/s) step 4451/76294 | train loss 3.420805 | norm 0.1437 | lr 1.10e-03 | (9894.95 ms | 52985 tok/s) step 4452/76294 | train loss 3.414115 | norm 0.1712 | lr 1.10e-03 | (9904.47 ms | 52935 tok/s) step 4453/76294 | train loss 3.367868 | norm 0.1425 | lr 1.10e-03 | (9930.39 ms | 52796 tok/s) step 4454/76294 | train loss 3.410371 | norm 0.1609 | lr 1.10e-03 | (9889.38 ms | 53015 tok/s) step 4455/76294 | train loss 3.379352 | norm 0.1426 | lr 1.10e-03 | (9898.47 ms | 52967 tok/s) step 4456/76294 | train loss 3.419014 | norm 0.1431 | lr 1.10e-03 | (9939.22 ms | 52749 tok/s) step 4457/76294 | train loss 3.342787 | norm 0.1438 | lr 1.10e-03 | (9901.41 ms | 52951 tok/s) step 4458/76294 | train loss 3.360622 | norm 0.1370 | lr 1.10e-03 | (9904.61 ms | 52934 tok/s) step 4459/76294 | train loss 3.427350 | norm 0.1427 | lr 1.10e-03 | (9949.12 ms | 52697 tok/s) step 4460/76294 | train loss 3.398055 | norm 0.1320 | lr 1.10e-03 | (9957.03 ms | 52655 tok/s) step 4461/76294 | train loss 3.359877 | norm 0.1446 | lr 1.10e-03 | (9893.19 ms | 52995 tok/s) step 4462/76294 | train loss 3.366244 | norm 0.1532 | lr 1.10e-03 | (9883.37 ms | 53048 tok/s) step 4463/76294 | train loss 3.467267 | norm 0.1477 | lr 1.10e-03 | (9910.95 ms | 52900 tok/s) step 4464/76294 | train loss 3.390537 | norm 0.1384 | lr 1.10e-03 | (9886.57 ms | 53030 tok/s) step 4465/76294 | train loss 3.390322 | norm 0.1459 | lr 1.10e-03 | (9912.93 ms | 52889 tok/s) step 4466/76294 | train loss 3.424909 | norm 0.1295 | lr 1.10e-03 | (9890.46 ms | 53009 tok/s) step 4467/76294 | train loss 3.407320 | norm 0.1528 | lr 1.10e-03 | (9891.72 ms | 53003 tok/s) step 4468/76294 | train loss 3.348109 | norm 0.1609 | lr 1.10e-03 | (9904.42 ms | 52935 tok/s) step 4469/76294 | train loss 3.331896 | norm 0.1745 | lr 1.10e-03 | (9892.31 ms | 53000 tok/s) step 4470/76294 | train loss 3.422751 | norm 0.1614 | lr 1.10e-03 | (9886.49 ms | 53031 tok/s) step 4471/76294 | train loss 3.429706 | norm 0.1475 | lr 1.10e-03 | (9954.82 ms | 52667 tok/s) step 4472/76294 | train loss 3.384048 | norm 0.1710 | lr 1.10e-03 | (9892.91 ms | 52996 tok/s) step 4473/76294 | train loss 3.420658 | norm 0.1465 | lr 1.10e-03 | (9898.23 ms | 52968 tok/s) step 4474/76294 | train loss 3.387447 | norm 0.1741 | lr 1.10e-03 | (9931.06 ms | 52793 tok/s) step 4475/76294 | train loss 3.428105 | norm 0.1655 | lr 1.10e-03 | (9894.30 ms | 52989 tok/s) step 4476/76294 | train loss 3.387555 | norm 0.1672 | lr 1.10e-03 | (9898.12 ms | 52968 tok/s) step 4477/76294 | train loss 3.429060 | norm 0.1642 | lr 1.10e-03 | (9889.83 ms | 53013 tok/s) step 4478/76294 | train loss 3.376280 | norm 0.1452 | lr 1.10e-03 | (9895.39 ms | 52983 tok/s) step 4479/76294 | train loss 3.410439 | norm 0.1584 | lr 1.10e-03 | (9899.58 ms | 52961 tok/s) step 4480/76294 | train loss 3.322730 | norm 0.1668 | lr 1.10e-03 | (9900.13 ms | 52958 tok/s) step 4481/76294 | train loss 3.386753 | norm 0.1589 | lr 1.10e-03 | (9893.27 ms | 52994 tok/s) step 4482/76294 | train loss 3.416375 | norm 0.1584 | lr 1.10e-03 | (9932.95 ms | 52783 tok/s) step 4483/76294 | train loss 3.436002 | norm 0.1390 | lr 1.10e-03 | (9933.55 ms | 52780 tok/s) step 4484/76294 | train loss 3.433979 | norm 0.1484 | lr 1.10e-03 | (9887.24 ms | 53027 tok/s) step 4485/76294 | train loss 3.433522 | norm 0.1684 | lr 1.10e-03 | (9903.97 ms | 52937 tok/s) step 4486/76294 | train loss 3.423102 | norm 0.1522 | lr 1.10e-03 | (9902.34 ms | 52946 tok/s) step 4487/76294 | train loss 3.467134 | norm 0.1660 | lr 1.10e-03 | (9898.70 ms | 52965 tok/s) step 4488/76294 | train loss 3.400236 | norm 0.1789 | lr 1.10e-03 | (9933.98 ms | 52777 tok/s) step 4489/76294 | train loss 3.427616 | norm 0.1447 | lr 1.10e-03 | (9897.96 ms | 52969 tok/s) step 4490/76294 | train loss 3.431105 | norm 0.1907 | lr 1.10e-03 | (10766.42 ms | 48697 tok/s) step 4491/76294 | train loss 3.381490 | norm 0.1570 | lr 1.10e-03 | (9886.05 ms | 53033 tok/s) step 4492/76294 | train loss 3.378340 | norm 0.1659 | lr 1.10e-03 | (9886.56 ms | 53030 tok/s) step 4493/76294 | train loss 3.396926 | norm 0.1525 | lr 1.10e-03 | (10719.36 ms | 48910 tok/s) step 4494/76294 | train loss 3.369910 | norm 0.1455 | lr 1.10e-03 | (9883.29 ms | 53048 tok/s) step 4495/76294 | train loss 3.435500 | norm 0.1676 | lr 1.10e-03 | (9939.30 ms | 52749 tok/s) step 4496/76294 | train loss 3.326376 | norm 0.1603 | lr 1.10e-03 | (9888.40 ms | 53020 tok/s) step 4497/76294 | train loss 3.378317 | norm 0.1374 | lr 1.10e-03 | (9893.03 ms | 52996 tok/s) step 4498/76294 | train loss 3.389144 | norm 0.1657 | lr 1.10e-03 | (9889.74 ms | 53013 tok/s) step 4499/76294 | train loss 3.318259 | norm 0.1679 | lr 1.10e-03 | (9894.80 ms | 52986 tok/s) step 4500/76294 | train loss 3.444010 | norm 0.1708 | lr 1.10e-03 | (9951.56 ms | 52684 tok/s) val loss: 3.374764 saving model checkpoint to ./results/gpt2-350M-gqa/step_4500.pth step 4501/76294 | train loss 3.333516 | norm 0.1588 | lr 1.10e-03 | (9971.73 ms | 52577 tok/s) step 4502/76294 | train loss 3.380828 | norm 0.1502 | lr 1.10e-03 | (9871.17 ms | 53113 tok/s) step 4503/76294 | train loss 3.330438 | norm 0.1456 | lr 1.10e-03 | (9882.40 ms | 53053 tok/s) step 4504/76294 | train loss 3.361475 | norm 0.1418 | lr 1.10e-03 | (9915.93 ms | 52873 tok/s) step 4505/76294 | train loss 3.309536 | norm 0.1504 | lr 1.10e-03 | (9888.57 ms | 53020 tok/s) step 4506/76294 | train loss 3.420297 | norm 0.1571 | lr 1.10e-03 | (9893.68 ms | 52992 tok/s) step 4507/76294 | train loss 3.359754 | norm 0.1591 | lr 1.10e-03 | (9886.37 ms | 53031 tok/s) step 4508/76294 | train loss 3.340803 | norm 0.1642 | lr 1.10e-03 | (9945.97 ms | 52714 tok/s) step 4509/76294 | train loss 3.329222 | norm 0.1552 | lr 1.10e-03 | (9887.11 ms | 53027 tok/s) step 4510/76294 | train loss 3.417356 | norm 0.1990 | lr 1.10e-03 | (9884.35 ms | 53042 tok/s) step 4511/76294 | train loss 3.349677 | norm 0.1602 | lr 1.10e-03 | (9894.57 ms | 52987 tok/s) step 4512/76294 | train loss 3.342579 | norm 0.1706 | lr 1.10e-03 | (9926.76 ms | 52816 tok/s) step 4513/76294 | train loss 3.325578 | norm 0.1521 | lr 1.10e-03 | (9892.57 ms | 52998 tok/s) step 4514/76294 | train loss 3.314032 | norm 0.1651 | lr 1.10e-03 | (9898.58 ms | 52966 tok/s) step 4515/76294 | train loss 3.360452 | norm 0.1519 | lr 1.10e-03 | (9930.32 ms | 52797 tok/s) step 4516/76294 | train loss 3.354993 | norm 0.1616 | lr 1.10e-03 | (9903.12 ms | 52942 tok/s) step 4517/76294 | train loss 3.406947 | norm 0.1508 | lr 1.10e-03 | (9953.76 ms | 52672 tok/s) step 4518/76294 | train loss 3.330962 | norm 0.1517 | lr 1.10e-03 | (9918.29 ms | 52861 tok/s) step 4519/76294 | train loss 3.408597 | norm 0.1323 | lr 1.10e-03 | (9955.10 ms | 52665 tok/s) step 4520/76294 | train loss 3.324436 | norm 0.1411 | lr 1.10e-03 | (9900.56 ms | 52955 tok/s) step 4521/76294 | train loss 3.454388 | norm 0.1507 | lr 1.09e-03 | (9889.49 ms | 53015 tok/s) step 4522/76294 | train loss 3.343888 | norm 0.1385 | lr 1.09e-03 | (9908.04 ms | 52915 tok/s) step 4523/76294 | train loss 3.396806 | norm 0.1546 | lr 1.09e-03 | (9932.38 ms | 52786 tok/s) step 4524/76294 | train loss 3.336776 | norm 0.1507 | lr 1.09e-03 | (9887.88 ms | 53023 tok/s) step 4525/76294 | train loss 3.354720 | norm 0.1425 | lr 1.09e-03 | (9896.84 ms | 52975 tok/s) step 4526/76294 | train loss 3.315019 | norm 0.1737 | lr 1.09e-03 | (9889.99 ms | 53012 tok/s) step 4527/76294 | train loss 3.393557 | norm 0.1420 | lr 1.09e-03 | (9900.77 ms | 52954 tok/s) step 4528/76294 | train loss 3.360754 | norm 0.1386 | lr 1.09e-03 | (9892.09 ms | 53001 tok/s) step 4529/76294 | train loss 3.394957 | norm 0.1631 | lr 1.09e-03 | (9899.49 ms | 52961 tok/s) step 4530/76294 | train loss 3.353729 | norm 0.1359 | lr 1.09e-03 | (9914.43 ms | 52881 tok/s) step 4531/76294 | train loss 3.382809 | norm 0.1560 | lr 1.09e-03 | (9892.19 ms | 53000 tok/s) step 4532/76294 | train loss 3.331724 | norm 0.1434 | lr 1.09e-03 | (9892.63 ms | 52998 tok/s) step 4533/76294 | train loss 3.393227 | norm 0.1740 | lr 1.09e-03 | (9898.90 ms | 52964 tok/s) step 4534/76294 | train loss 3.385811 | norm 0.1832 | lr 1.09e-03 | (9894.84 ms | 52986 tok/s) step 4535/76294 | train loss 3.352045 | norm 0.1649 | lr 1.09e-03 | (9900.45 ms | 52956 tok/s) step 4536/76294 | train loss 3.396020 | norm 0.1533 | lr 1.09e-03 | (9889.53 ms | 53014 tok/s) step 4537/76294 | train loss 3.306282 | norm 0.1344 | lr 1.09e-03 | (9912.23 ms | 52893 tok/s) step 4538/76294 | train loss 3.435543 | norm 0.1503 | lr 1.09e-03 | (9899.04 ms | 52964 tok/s) step 4539/76294 | train loss 3.250914 | norm 0.1369 | lr 1.09e-03 | (9895.38 ms | 52983 tok/s) step 4540/76294 | train loss 3.466855 | norm 0.1532 | lr 1.09e-03 | (9892.75 ms | 52997 tok/s) step 4541/76294 | train loss 3.346597 | norm 0.1558 | lr 1.09e-03 | (9891.84 ms | 53002 tok/s) step 4542/76294 | train loss 3.451218 | norm 0.1558 | lr 1.09e-03 | (9921.45 ms | 52844 tok/s) step 4543/76294 | train loss 3.341232 | norm 0.1923 | lr 1.09e-03 | (9893.62 ms | 52993 tok/s) step 4544/76294 | train loss 3.401171 | norm 0.1662 | lr 1.09e-03 | (9930.91 ms | 52794 tok/s) step 4545/76294 | train loss 3.343905 | norm 0.1674 | lr 1.09e-03 | (9924.00 ms | 52830 tok/s) step 4546/76294 | train loss 3.399020 | norm 0.1774 | lr 1.09e-03 | (9887.72 ms | 53024 tok/s) step 4547/76294 | train loss 3.326734 | norm 0.1683 | lr 1.09e-03 | (9892.87 ms | 52997 tok/s) step 4548/76294 | train loss 3.449715 | norm 0.1835 | lr 1.09e-03 | (9890.03 ms | 53012 tok/s) step 4549/76294 | train loss 3.396497 | norm 0.1705 | lr 1.09e-03 | (9901.49 ms | 52950 tok/s) step 4550/76294 | train loss 3.382401 | norm 0.1896 | lr 1.09e-03 | (9896.30 ms | 52978 tok/s) step 4551/76294 | train loss 3.432050 | norm 0.1689 | lr 1.09e-03 | (9897.84 ms | 52970 tok/s) step 4552/76294 | train loss 3.497955 | norm 0.1988 | lr 1.09e-03 | (9889.31 ms | 53016 tok/s) step 4553/76294 | train loss 3.346280 | norm 0.1965 | lr 1.09e-03 | (9895.71 ms | 52981 tok/s) step 4554/76294 | train loss 3.366645 | norm 0.1564 | lr 1.09e-03 | (9935.47 ms | 52769 tok/s) step 4555/76294 | train loss 3.413035 | norm 0.1825 | lr 1.09e-03 | (9930.33 ms | 52797 tok/s) step 4556/76294 | train loss 3.358314 | norm 0.2145 | lr 1.09e-03 | (9918.87 ms | 52858 tok/s) step 4557/76294 | train loss 3.365408 | norm 0.1650 | lr 1.09e-03 | (9961.82 ms | 52630 tok/s) step 4558/76294 | train loss 3.363381 | norm 0.1856 | lr 1.09e-03 | (9898.45 ms | 52967 tok/s) step 4559/76294 | train loss 3.382442 | norm 0.1697 | lr 1.09e-03 | (9889.20 ms | 53016 tok/s) step 4560/76294 | train loss 3.367422 | norm 0.1725 | lr 1.09e-03 | (9946.52 ms | 52711 tok/s) step 4561/76294 | train loss 3.439696 | norm 0.1581 | lr 1.09e-03 | (9895.04 ms | 52985 tok/s) step 4562/76294 | train loss 3.324129 | norm 0.1560 | lr 1.09e-03 | (9892.21 ms | 53000 tok/s) step 4563/76294 | train loss 3.369854 | norm 0.1707 | lr 1.09e-03 | (9934.15 ms | 52776 tok/s) step 4564/76294 | train loss 3.338168 | norm 0.1667 | lr 1.09e-03 | (9891.19 ms | 53006 tok/s) step 4565/76294 | train loss 3.324442 | norm 0.1595 | lr 1.09e-03 | (9920.19 ms | 52851 tok/s) step 4566/76294 | train loss 3.362132 | norm 0.1538 | lr 1.09e-03 | (9886.32 ms | 53032 tok/s) step 4567/76294 | train loss 3.358769 | norm 0.1446 | lr 1.09e-03 | (9935.24 ms | 52771 tok/s) step 4568/76294 | train loss 3.393773 | norm 0.1554 | lr 1.09e-03 | (9892.63 ms | 52998 tok/s) step 4569/76294 | train loss 3.313056 | norm 0.1628 | lr 1.09e-03 | (9937.16 ms | 52760 tok/s) step 4570/76294 | train loss 3.386992 | norm 0.1361 | lr 1.09e-03 | (9891.36 ms | 53005 tok/s) step 4571/76294 | train loss 3.362311 | norm 0.1592 | lr 1.09e-03 | (9890.14 ms | 53011 tok/s) step 4572/76294 | train loss 3.399796 | norm 0.1648 | lr 1.09e-03 | (9921.74 ms | 52842 tok/s) step 4573/76294 | train loss 3.436853 | norm 0.1440 | lr 1.09e-03 | (9909.41 ms | 52908 tok/s) step 4574/76294 | train loss 3.364603 | norm 0.1570 | lr 1.09e-03 | (9896.17 ms | 52979 tok/s) step 4575/76294 | train loss 3.253269 | norm 0.1622 | lr 1.09e-03 | (9896.19 ms | 52979 tok/s) step 4576/76294 | train loss 3.534874 | norm 0.1559 | lr 1.09e-03 | (9890.27 ms | 53011 tok/s) step 4577/76294 | train loss 3.338172 | norm 0.1720 | lr 1.09e-03 | (9894.98 ms | 52985 tok/s) step 4578/76294 | train loss 3.417486 | norm 0.1378 | lr 1.09e-03 | (9891.79 ms | 53002 tok/s) step 4579/76294 | train loss 3.243839 | norm 0.1715 | lr 1.09e-03 | (9895.20 ms | 52984 tok/s) step 4580/76294 | train loss 3.394851 | norm 0.1509 | lr 1.09e-03 | (9884.94 ms | 53039 tok/s) step 4581/76294 | train loss 3.356161 | norm 0.1666 | lr 1.09e-03 | (9905.02 ms | 52932 tok/s) step 4582/76294 | train loss 3.371603 | norm 0.1602 | lr 1.09e-03 | (9928.30 ms | 52807 tok/s) step 4583/76294 | train loss 3.326584 | norm 0.1870 | lr 1.09e-03 | (9887.97 ms | 53023 tok/s) step 4584/76294 | train loss 3.426436 | norm 0.1633 | lr 1.09e-03 | (9910.22 ms | 52904 tok/s) step 4585/76294 | train loss 3.350269 | norm 0.1607 | lr 1.09e-03 | (9889.21 ms | 53016 tok/s) step 4586/76294 | train loss 3.423413 | norm 0.1619 | lr 1.09e-03 | (9912.12 ms | 52894 tok/s) step 4587/76294 | train loss 3.360586 | norm 0.1421 | lr 1.09e-03 | (10722.91 ms | 48894 tok/s) step 4588/76294 | train loss 3.357824 | norm 0.1613 | lr 1.09e-03 | (9880.47 ms | 53063 tok/s) step 4589/76294 | train loss 3.318950 | norm 0.1426 | lr 1.09e-03 | (11342.22 ms | 46224 tok/s) step 4590/76294 | train loss 3.375419 | norm 0.1505 | lr 1.09e-03 | (9916.46 ms | 52870 tok/s) step 4591/76294 | train loss 3.363044 | norm 0.1669 | lr 1.09e-03 | (9879.80 ms | 53067 tok/s) step 4592/76294 | train loss 3.372057 | norm 0.1646 | lr 1.09e-03 | (9902.59 ms | 52945 tok/s) step 4593/76294 | train loss 3.443950 | norm 0.1512 | lr 1.09e-03 | (9941.45 ms | 52738 tok/s) step 4594/76294 | train loss 3.369854 | norm 0.1493 | lr 1.09e-03 | (9878.30 ms | 53075 tok/s) step 4595/76294 | train loss 3.383282 | norm 0.1624 | lr 1.09e-03 | (9882.07 ms | 53054 tok/s) step 4596/76294 | train loss 3.278076 | norm 0.1824 | lr 1.09e-03 | (9887.96 ms | 53023 tok/s) step 4597/76294 | train loss 3.489820 | norm 0.2006 | lr 1.09e-03 | (9889.37 ms | 53015 tok/s) step 4598/76294 | train loss 3.274211 | norm 0.1772 | lr 1.09e-03 | (9875.55 ms | 53090 tok/s) step 4599/76294 | train loss 3.392787 | norm 0.1614 | lr 1.09e-03 | (9893.64 ms | 52992 tok/s) step 4600/76294 | train loss 3.349853 | norm 0.1526 | lr 1.09e-03 | (9881.74 ms | 53056 tok/s) step 4601/76294 | train loss 3.385047 | norm 0.1742 | lr 1.09e-03 | (9935.30 ms | 52770 tok/s) step 4602/76294 | train loss 3.353271 | norm 0.1625 | lr 1.09e-03 | (9887.39 ms | 53026 tok/s) step 4603/76294 | train loss 3.363334 | norm 0.1689 | lr 1.09e-03 | (9886.11 ms | 53033 tok/s) step 4604/76294 | train loss 3.320319 | norm 0.1699 | lr 1.09e-03 | (9889.40 ms | 53015 tok/s) step 4605/76294 | train loss 3.408832 | norm 0.1585 | lr 1.09e-03 | (9889.09 ms | 53017 tok/s) step 4606/76294 | train loss 3.341535 | norm 0.1855 | lr 1.09e-03 | (9888.97 ms | 53017 tok/s) step 4607/76294 | train loss 3.371595 | norm 0.1702 | lr 1.09e-03 | (9923.22 ms | 52834 tok/s) step 4608/76294 | train loss 3.337375 | norm 0.1556 | lr 1.09e-03 | (9889.16 ms | 53016 tok/s) step 4609/76294 | train loss 3.348243 | norm 0.1556 | lr 1.09e-03 | (9886.45 ms | 53031 tok/s) step 4610/76294 | train loss 3.331267 | norm 0.1616 | lr 1.09e-03 | (9971.43 ms | 52579 tok/s) step 4611/76294 | train loss 3.408970 | norm 0.1628 | lr 1.09e-03 | (9962.65 ms | 52625 tok/s) step 4612/76294 | train loss 3.341755 | norm 0.1326 | lr 1.09e-03 | (9885.10 ms | 53038 tok/s) step 4613/76294 | train loss 3.448336 | norm 0.1843 | lr 1.09e-03 | (9892.24 ms | 53000 tok/s) step 4614/76294 | train loss 3.329550 | norm 0.1610 | lr 1.09e-03 | (9891.14 ms | 53006 tok/s) step 4615/76294 | train loss 3.390769 | norm 0.1684 | lr 1.09e-03 | (9888.77 ms | 53019 tok/s) step 4616/76294 | train loss 3.355044 | norm 0.1869 | lr 1.09e-03 | (9891.12 ms | 53006 tok/s) step 4617/76294 | train loss 3.408497 | norm 0.1496 | lr 1.09e-03 | (9888.28 ms | 53021 tok/s) step 4618/76294 | train loss 3.304795 | norm 0.1672 | lr 1.09e-03 | (9894.59 ms | 52987 tok/s) step 4619/76294 | train loss 3.395816 | norm 0.1486 | lr 1.09e-03 | (9889.90 ms | 53012 tok/s) step 4620/76294 | train loss 3.352068 | norm 0.1572 | lr 1.09e-03 | (9891.92 ms | 53002 tok/s) step 4621/76294 | train loss 3.411868 | norm 0.1740 | lr 1.09e-03 | (9891.67 ms | 53003 tok/s) step 4622/76294 | train loss 3.382501 | norm 0.1381 | lr 1.09e-03 | (9924.90 ms | 52826 tok/s) step 4623/76294 | train loss 3.388913 | norm 0.1491 | lr 1.09e-03 | (9893.56 ms | 52993 tok/s) step 4624/76294 | train loss 3.374985 | norm 0.1498 | lr 1.09e-03 | (9895.48 ms | 52983 tok/s) step 4625/76294 | train loss 3.388504 | norm 0.1265 | lr 1.09e-03 | (9894.15 ms | 52990 tok/s) step 4626/76294 | train loss 3.365590 | norm 0.1593 | lr 1.09e-03 | (9894.63 ms | 52987 tok/s) step 4627/76294 | train loss 3.448916 | norm 0.1573 | lr 1.09e-03 | (9955.22 ms | 52665 tok/s) step 4628/76294 | train loss 3.285195 | norm 0.1676 | lr 1.09e-03 | (9886.68 ms | 53030 tok/s) step 4629/76294 | train loss 3.311394 | norm 0.1548 | lr 1.09e-03 | (9900.42 ms | 52956 tok/s) step 4630/76294 | train loss 3.429039 | norm 0.1637 | lr 1.09e-03 | (9887.86 ms | 53023 tok/s) step 4631/76294 | train loss 3.451763 | norm 0.1703 | lr 1.09e-03 | (9901.14 ms | 52952 tok/s) step 4632/76294 | train loss 3.368444 | norm 0.1969 | lr 1.09e-03 | (9889.00 ms | 53017 tok/s) step 4633/76294 | train loss 3.422331 | norm 0.1515 | lr 1.09e-03 | (9891.96 ms | 53001 tok/s) step 4634/76294 | train loss 3.384058 | norm 0.1645 | lr 1.09e-03 | (9886.62 ms | 53030 tok/s) step 4635/76294 | train loss 3.421548 | norm 0.1595 | lr 1.09e-03 | (9909.84 ms | 52906 tok/s) step 4636/76294 | train loss 3.510560 | norm 0.1712 | lr 1.09e-03 | (9897.95 ms | 52969 tok/s) step 4637/76294 | train loss 3.488222 | norm 0.2392 | lr 1.09e-03 | (9890.49 ms | 53009 tok/s) step 4638/76294 | train loss 3.375383 | norm 0.1906 | lr 1.09e-03 | (9902.41 ms | 52946 tok/s) step 4639/76294 | train loss 3.330912 | norm 0.1755 | lr 1.09e-03 | (9889.24 ms | 53016 tok/s) step 4640/76294 | train loss 3.382143 | norm 0.1602 | lr 1.09e-03 | (9924.30 ms | 52829 tok/s) step 4641/76294 | train loss 3.338947 | norm 0.1822 | lr 1.09e-03 | (9890.64 ms | 53008 tok/s) step 4642/76294 | train loss 3.403421 | norm 0.1613 | lr 1.09e-03 | (9891.29 ms | 53005 tok/s) step 4643/76294 | train loss 3.379008 | norm 0.1374 | lr 1.09e-03 | (9932.94 ms | 52783 tok/s) step 4644/76294 | train loss 3.528522 | norm 0.1723 | lr 1.09e-03 | (9891.29 ms | 53005 tok/s) step 4645/76294 | train loss 3.386034 | norm 0.1928 | lr 1.09e-03 | (9929.24 ms | 52802 tok/s) step 4646/76294 | train loss 3.481088 | norm 0.1558 | lr 1.09e-03 | (9960.18 ms | 52638 tok/s) step 4647/76294 | train loss 3.364587 | norm 0.1860 | lr 1.09e-03 | (9892.16 ms | 53000 tok/s) step 4648/76294 | train loss 3.329255 | norm 0.1720 | lr 1.09e-03 | (9884.81 ms | 53040 tok/s) step 4649/76294 | train loss 3.323677 | norm 0.1641 | lr 1.09e-03 | (9894.43 ms | 52988 tok/s) step 4650/76294 | train loss 3.387496 | norm 0.1707 | lr 1.09e-03 | (9924.96 ms | 52825 tok/s) step 4651/76294 | train loss 3.314993 | norm 0.1447 | lr 1.09e-03 | (9892.46 ms | 52999 tok/s) step 4652/76294 | train loss 3.406800 | norm 0.1579 | lr 1.09e-03 | (9899.75 ms | 52960 tok/s) step 4653/76294 | train loss 3.368454 | norm 0.1575 | lr 1.09e-03 | (9888.94 ms | 53018 tok/s) step 4654/76294 | train loss 3.428024 | norm 0.1552 | lr 1.09e-03 | (9891.01 ms | 53006 tok/s) step 4655/76294 | train loss 3.338549 | norm 0.1615 | lr 1.09e-03 | (9921.72 ms | 52842 tok/s) step 4656/76294 | train loss 3.390012 | norm 0.1645 | lr 1.09e-03 | (9891.49 ms | 53004 tok/s) step 4657/76294 | train loss 3.400106 | norm 0.1402 | lr 1.09e-03 | (9889.98 ms | 53012 tok/s) step 4658/76294 | train loss 3.345006 | norm 0.1429 | lr 1.09e-03 | (9894.07 ms | 52990 tok/s) step 4659/76294 | train loss 3.316733 | norm 0.1399 | lr 1.09e-03 | (9929.66 ms | 52800 tok/s) step 4660/76294 | train loss 3.346915 | norm 0.1468 | lr 1.09e-03 | (9886.46 ms | 53031 tok/s) step 4661/76294 | train loss 3.367131 | norm 0.1294 | lr 1.09e-03 | (9895.11 ms | 52985 tok/s) step 4662/76294 | train loss 3.442356 | norm 0.1407 | lr 1.09e-03 | (9885.26 ms | 53037 tok/s) step 4663/76294 | train loss 3.325779 | norm 0.1598 | lr 1.09e-03 | (9937.08 ms | 52761 tok/s) step 4664/76294 | train loss 3.462661 | norm 0.1587 | lr 1.09e-03 | (9886.60 ms | 53030 tok/s) step 4665/76294 | train loss 3.323140 | norm 0.1790 | lr 1.09e-03 | (9893.20 ms | 52995 tok/s) step 4666/76294 | train loss 3.387826 | norm 0.1797 | lr 1.09e-03 | (9889.15 ms | 53017 tok/s) step 4667/76294 | train loss 3.429895 | norm 0.2094 | lr 1.09e-03 | (9897.99 ms | 52969 tok/s) step 4668/76294 | train loss 3.360438 | norm 0.2180 | lr 1.09e-03 | (9886.18 ms | 53032 tok/s) step 4669/76294 | train loss 3.426522 | norm 0.2018 | lr 1.09e-03 | (9907.68 ms | 52917 tok/s) step 4670/76294 | train loss 3.330088 | norm 0.1563 | lr 1.09e-03 | (9894.29 ms | 52989 tok/s) step 4671/76294 | train loss 3.398340 | norm 0.1796 | lr 1.09e-03 | (9891.68 ms | 53003 tok/s) step 4672/76294 | train loss 3.326036 | norm 0.1339 | lr 1.09e-03 | (9890.91 ms | 53007 tok/s) step 4673/76294 | train loss 3.356817 | norm 0.1560 | lr 1.09e-03 | (9887.19 ms | 53027 tok/s) step 4674/76294 | train loss 3.305351 | norm 0.1462 | lr 1.09e-03 | (9890.09 ms | 53011 tok/s) step 4675/76294 | train loss 3.353131 | norm 0.1314 | lr 1.09e-03 | (9887.22 ms | 53027 tok/s) step 4676/76294 | train loss 3.362493 | norm 0.1441 | lr 1.09e-03 | (9889.35 ms | 53015 tok/s) step 4677/76294 | train loss 3.367498 | norm 0.1459 | lr 1.09e-03 | (9890.82 ms | 53008 tok/s) step 4678/76294 | train loss 3.324055 | norm 0.1371 | lr 1.09e-03 | (9893.00 ms | 52996 tok/s) step 4679/76294 | train loss 3.414800 | norm 0.1340 | lr 1.09e-03 | (9893.99 ms | 52991 tok/s) step 4680/76294 | train loss 3.311035 | norm 0.1411 | lr 1.09e-03 | (9892.74 ms | 52997 tok/s) step 4681/76294 | train loss 3.377411 | norm 0.1463 | lr 1.09e-03 | (9889.73 ms | 53013 tok/s) step 4682/76294 | train loss 3.351938 | norm 0.1498 | lr 1.09e-03 | (9955.43 ms | 52664 tok/s) step 4683/76294 | train loss 3.366692 | norm 0.1370 | lr 1.09e-03 | (9893.39 ms | 52994 tok/s) step 4684/76294 | train loss 3.321249 | norm 0.1444 | lr 1.09e-03 | (10187.34 ms | 51465 tok/s) step 4685/76294 | train loss 3.401533 | norm 0.1426 | lr 1.09e-03 | (10771.25 ms | 48675 tok/s) step 4686/76294 | train loss 3.363775 | norm 0.1490 | lr 1.09e-03 | (9876.79 ms | 53083 tok/s) step 4687/76294 | train loss 3.366358 | norm 0.1421 | lr 1.09e-03 | (9928.02 ms | 52809 tok/s) step 4688/76294 | train loss 3.433753 | norm 0.1457 | lr 1.09e-03 | (9881.55 ms | 53057 tok/s) step 4689/76294 | train loss 3.334035 | norm 0.1518 | lr 1.09e-03 | (9882.61 ms | 53052 tok/s) step 4690/76294 | train loss 3.443695 | norm 0.1728 | lr 1.09e-03 | (9890.03 ms | 53012 tok/s) step 4691/76294 | train loss 3.379269 | norm 0.1420 | lr 1.09e-03 | (9886.68 ms | 53030 tok/s) step 4692/76294 | train loss 3.369322 | norm 0.1562 | lr 1.09e-03 | (9892.62 ms | 52998 tok/s) step 4693/76294 | train loss 3.442505 | norm 0.1760 | lr 1.09e-03 | (9944.78 ms | 52720 tok/s) step 4694/76294 | train loss 3.422543 | norm 0.1575 | lr 1.09e-03 | (9880.24 ms | 53064 tok/s) step 4695/76294 | train loss 3.435255 | norm 0.1431 | lr 1.09e-03 | (9898.73 ms | 52965 tok/s) step 4696/76294 | train loss 3.418914 | norm 0.1499 | lr 1.09e-03 | (9904.04 ms | 52937 tok/s) step 4697/76294 | train loss 3.395542 | norm 0.1638 | lr 1.09e-03 | (9889.59 ms | 53014 tok/s) step 4698/76294 | train loss 3.422769 | norm 0.1588 | lr 1.09e-03 | (9887.31 ms | 53026 tok/s) step 4699/76294 | train loss 3.387074 | norm 0.1519 | lr 1.09e-03 | (9885.48 ms | 53036 tok/s) step 4700/76294 | train loss 3.397125 | norm 0.1365 | lr 1.09e-03 | (9891.14 ms | 53006 tok/s) step 4701/76294 | train loss 3.404397 | norm 0.1372 | lr 1.09e-03 | (9903.22 ms | 52941 tok/s) step 4702/76294 | train loss 3.365201 | norm 0.1258 | lr 1.09e-03 | (10632.78 ms | 49309 tok/s) step 4703/76294 | train loss 3.382989 | norm 0.1381 | lr 1.09e-03 | (9881.07 ms | 53060 tok/s) step 4704/76294 | train loss 3.381197 | norm 0.1386 | lr 1.08e-03 | (9939.47 ms | 52748 tok/s) step 4705/76294 | train loss 3.421849 | norm 0.1583 | lr 1.08e-03 | (9893.14 ms | 52995 tok/s) step 4706/76294 | train loss 3.511405 | norm 0.1327 | lr 1.08e-03 | (9881.39 ms | 53058 tok/s) step 4707/76294 | train loss 3.368200 | norm 0.1465 | lr 1.08e-03 | (9883.10 ms | 53049 tok/s) step 4708/76294 | train loss 3.376534 | norm 0.1348 | lr 1.08e-03 | (9887.96 ms | 53023 tok/s) step 4709/76294 | train loss 3.439842 | norm 0.1425 | lr 1.08e-03 | (9927.32 ms | 52813 tok/s) step 4710/76294 | train loss 3.394285 | norm 0.1460 | lr 1.08e-03 | (9885.69 ms | 53035 tok/s) step 4711/76294 | train loss 3.341475 | norm 0.1634 | lr 1.08e-03 | (12589.43 ms | 41645 tok/s) step 4712/76294 | train loss 3.356538 | norm 0.1452 | lr 1.08e-03 | (9865.47 ms | 53144 tok/s) step 4713/76294 | train loss 3.405646 | norm 0.1530 | lr 1.08e-03 | (9922.85 ms | 52836 tok/s) step 4714/76294 | train loss 3.382856 | norm 0.1400 | lr 1.08e-03 | (9879.58 ms | 53068 tok/s) step 4715/76294 | train loss 3.370805 | norm 0.1313 | lr 1.08e-03 | (9884.69 ms | 53040 tok/s) step 4716/76294 | train loss 3.426343 | norm 0.1540 | lr 1.08e-03 | (9893.90 ms | 52991 tok/s) step 4717/76294 | train loss 3.360688 | norm 0.1393 | lr 1.08e-03 | (9890.31 ms | 53010 tok/s) step 4718/76294 | train loss 3.442509 | norm 0.1567 | lr 1.08e-03 | (9895.03 ms | 52985 tok/s) step 4719/76294 | train loss 3.344191 | norm 0.1572 | lr 1.08e-03 | (9896.88 ms | 52975 tok/s) step 4720/76294 | train loss 3.400742 | norm 0.1446 | lr 1.08e-03 | (9898.17 ms | 52968 tok/s) step 4721/76294 | train loss 3.430918 | norm 0.1355 | lr 1.08e-03 | (9895.29 ms | 52984 tok/s) step 4722/76294 | train loss 3.415129 | norm 0.1374 | lr 1.08e-03 | (9897.69 ms | 52971 tok/s) step 4723/76294 | train loss 3.386837 | norm 0.1434 | lr 1.08e-03 | (9898.95 ms | 52964 tok/s) step 4724/76294 | train loss 3.371026 | norm 0.1479 | lr 1.08e-03 | (9894.10 ms | 52990 tok/s) step 4725/76294 | train loss 3.368411 | norm 0.1582 | lr 1.08e-03 | (9935.02 ms | 52772 tok/s) step 4726/76294 | train loss 3.380792 | norm 0.1572 | lr 1.08e-03 | (9890.74 ms | 53008 tok/s) step 4727/76294 | train loss 3.382889 | norm 0.1472 | lr 1.08e-03 | (9902.47 ms | 52945 tok/s) step 4728/76294 | train loss 3.356972 | norm 0.1652 | lr 1.08e-03 | (9887.41 ms | 53026 tok/s) step 4729/76294 | train loss 3.373164 | norm 0.1863 | lr 1.08e-03 | (9898.88 ms | 52964 tok/s) step 4730/76294 | train loss 3.435081 | norm 0.1386 | lr 1.08e-03 | (9893.22 ms | 52995 tok/s) step 4731/76294 | train loss 3.367649 | norm 0.1784 | lr 1.08e-03 | (9897.05 ms | 52974 tok/s) step 4732/76294 | train loss 3.400824 | norm 0.1666 | lr 1.08e-03 | (9893.38 ms | 52994 tok/s) step 4733/76294 | train loss 3.389817 | norm 0.1792 | lr 1.08e-03 | (9897.94 ms | 52969 tok/s) step 4734/76294 | train loss 3.427655 | norm 0.1958 | lr 1.08e-03 | (9892.15 ms | 53000 tok/s) step 4735/76294 | train loss 3.363062 | norm 0.1784 | lr 1.08e-03 | (9898.82 ms | 52965 tok/s) step 4736/76294 | train loss 3.344580 | norm 0.1670 | lr 1.08e-03 | (9925.21 ms | 52824 tok/s) step 4737/76294 | train loss 3.346267 | norm 0.1769 | lr 1.08e-03 | (9894.55 ms | 52988 tok/s) step 4738/76294 | train loss 3.404085 | norm 0.1730 | lr 1.08e-03 | (9920.92 ms | 52847 tok/s) step 4739/76294 | train loss 3.402966 | norm 0.1851 | lr 1.08e-03 | (9958.56 ms | 52647 tok/s) step 4740/76294 | train loss 3.318105 | norm 0.1346 | lr 1.08e-03 | (9894.79 ms | 52986 tok/s) step 4741/76294 | train loss 3.393844 | norm 0.1728 | lr 1.08e-03 | (9893.28 ms | 52994 tok/s) step 4742/76294 | train loss 3.418764 | norm 0.1447 | lr 1.08e-03 | (9943.67 ms | 52726 tok/s) step 4743/76294 | train loss 3.381029 | norm 0.1734 | lr 1.08e-03 | (9909.50 ms | 52908 tok/s) step 4744/76294 | train loss 3.416544 | norm 0.1437 | lr 1.08e-03 | (9894.79 ms | 52986 tok/s) step 4745/76294 | train loss 3.441916 | norm 0.1426 | lr 1.08e-03 | (9931.20 ms | 52792 tok/s) step 4746/76294 | train loss 3.390611 | norm 0.1581 | lr 1.08e-03 | (9890.75 ms | 53008 tok/s) step 4747/76294 | train loss 3.414271 | norm 0.1485 | lr 1.08e-03 | (9898.26 ms | 52968 tok/s) step 4748/76294 | train loss 3.370261 | norm 0.1554 | lr 1.08e-03 | (9955.86 ms | 52661 tok/s) step 4749/76294 | train loss 3.337446 | norm 0.1543 | lr 1.08e-03 | (9901.20 ms | 52952 tok/s) step 4750/76294 | train loss 3.349731 | norm 0.1454 | lr 1.08e-03 | (9906.65 ms | 52923 tok/s) val loss: 3.363554 saving model checkpoint to ./results/gpt2-350M-gqa/step_4750.pth step 4751/76294 | train loss 3.454092 | norm 0.1607 | lr 1.08e-03 | (9947.33 ms | 52706 tok/s) step 4752/76294 | train loss 3.411433 | norm 0.1708 | lr 1.08e-03 | (9866.91 ms | 53136 tok/s) step 4753/76294 | train loss 3.413557 | norm 0.1323 | lr 1.08e-03 | (9907.78 ms | 52917 tok/s) step 4754/76294 | train loss 3.361310 | norm 0.1665 | lr 1.08e-03 | (9877.96 ms | 53077 tok/s) step 4755/76294 | train loss 3.362536 | norm 0.1353 | lr 1.08e-03 | (9909.76 ms | 52906 tok/s) step 4756/76294 | train loss 3.348272 | norm 0.1667 | lr 1.08e-03 | (9878.83 ms | 53072 tok/s) step 4757/76294 | train loss 3.377316 | norm 0.1471 | lr 1.08e-03 | (9890.43 ms | 53010 tok/s) step 4758/76294 | train loss 3.416653 | norm 0.1369 | lr 1.08e-03 | (9880.65 ms | 53062 tok/s) step 4759/76294 | train loss 3.340507 | norm 0.1462 | lr 1.08e-03 | (9889.05 ms | 53017 tok/s) step 4760/76294 | train loss 3.353754 | norm 0.1643 | lr 1.08e-03 | (9885.51 ms | 53036 tok/s) step 4761/76294 | train loss 3.365779 | norm 0.1418 | lr 1.08e-03 | (9884.19 ms | 53043 tok/s) step 4762/76294 | train loss 3.425283 | norm 0.1760 | lr 1.08e-03 | (9944.09 ms | 52724 tok/s) step 4763/76294 | train loss 3.385043 | norm 0.1628 | lr 1.08e-03 | (9891.01 ms | 53007 tok/s) step 4764/76294 | train loss 3.401302 | norm 0.1452 | lr 1.08e-03 | (9923.60 ms | 52832 tok/s) step 4765/76294 | train loss 3.359104 | norm 0.1584 | lr 1.08e-03 | (9925.09 ms | 52824 tok/s) step 4766/76294 | train loss 3.391576 | norm 0.1469 | lr 1.08e-03 | (9887.03 ms | 53028 tok/s) step 4767/76294 | train loss 3.402941 | norm 0.1753 | lr 1.08e-03 | (9893.44 ms | 52994 tok/s) step 4768/76294 | train loss 3.434727 | norm 0.1801 | lr 1.08e-03 | (9885.68 ms | 53035 tok/s) step 4769/76294 | train loss 3.361960 | norm 0.1613 | lr 1.08e-03 | (9902.13 ms | 52947 tok/s) step 4770/76294 | train loss 3.358864 | norm 0.1835 | lr 1.08e-03 | (9890.17 ms | 53011 tok/s) step 4771/76294 | train loss 3.358909 | norm 0.1446 | lr 1.08e-03 | (9899.41 ms | 52962 tok/s) step 4772/76294 | train loss 3.378456 | norm 0.1968 | lr 1.08e-03 | (9923.73 ms | 52832 tok/s) step 4773/76294 | train loss 3.406285 | norm 0.1665 | lr 1.08e-03 | (9912.45 ms | 52892 tok/s) step 4774/76294 | train loss 3.337557 | norm 0.1810 | lr 1.08e-03 | (9887.32 ms | 53026 tok/s) step 4775/76294 | train loss 3.397308 | norm 0.1880 | lr 1.08e-03 | (9927.04 ms | 52814 tok/s) step 4776/76294 | train loss 3.396019 | norm 0.1593 | lr 1.08e-03 | (9886.34 ms | 53032 tok/s) step 4777/76294 | train loss 3.384553 | norm 0.1515 | lr 1.08e-03 | (9899.82 ms | 52959 tok/s) step 4778/76294 | train loss 3.383596 | norm 0.1739 | lr 1.08e-03 | (9889.60 ms | 53014 tok/s) step 4779/76294 | train loss 3.463342 | norm 0.1615 | lr 1.08e-03 | (9902.42 ms | 52945 tok/s) step 4780/76294 | train loss 3.443360 | norm 0.1827 | lr 1.08e-03 | (9886.33 ms | 53032 tok/s) step 4781/76294 | train loss 3.451591 | norm 0.1685 | lr 1.08e-03 | (9898.89 ms | 52964 tok/s) step 4782/76294 | train loss 3.413336 | norm 0.1416 | lr 1.08e-03 | (11313.72 ms | 46341 tok/s) step 4783/76294 | train loss 3.408427 | norm 0.1754 | lr 1.08e-03 | (9884.99 ms | 53039 tok/s) step 4784/76294 | train loss 3.427998 | norm 0.1816 | lr 1.08e-03 | (9890.63 ms | 53009 tok/s) step 4785/76294 | train loss 3.377034 | norm 0.1918 | lr 1.08e-03 | (9889.90 ms | 53012 tok/s) step 4786/76294 | train loss 3.410428 | norm 0.1472 | lr 1.08e-03 | (9891.65 ms | 53003 tok/s) step 4787/76294 | train loss 3.368036 | norm 0.1603 | lr 1.08e-03 | (9897.75 ms | 52970 tok/s) step 4788/76294 | train loss 3.405643 | norm 0.1444 | lr 1.08e-03 | (9895.87 ms | 52981 tok/s) step 4789/76294 | train loss 3.428071 | norm 0.1642 | lr 1.08e-03 | (9893.13 ms | 52995 tok/s) step 4790/76294 | train loss 3.342464 | norm 0.1476 | lr 1.08e-03 | (9935.95 ms | 52767 tok/s) step 4791/76294 | train loss 3.362666 | norm 0.1501 | lr 1.08e-03 | (9893.94 ms | 52991 tok/s) step 4792/76294 | train loss 3.418551 | norm 0.1339 | lr 1.08e-03 | (9898.73 ms | 52965 tok/s) step 4793/76294 | train loss 3.507997 | norm 0.1466 | lr 1.08e-03 | (9899.50 ms | 52961 tok/s) step 4794/76294 | train loss 3.373714 | norm 0.1538 | lr 1.08e-03 | (9895.64 ms | 52982 tok/s) step 4795/76294 | train loss 3.347852 | norm 0.1500 | lr 1.08e-03 | (9892.74 ms | 52997 tok/s) step 4796/76294 | train loss 3.412204 | norm 0.1389 | lr 1.08e-03 | (9927.99 ms | 52809 tok/s) step 4797/76294 | train loss 3.391601 | norm 0.1565 | lr 1.08e-03 | (9961.83 ms | 52630 tok/s) step 4798/76294 | train loss 3.339196 | norm 0.1444 | lr 1.08e-03 | (9915.78 ms | 52874 tok/s) step 4799/76294 | train loss 3.421431 | norm 0.1427 | lr 1.08e-03 | (9936.05 ms | 52766 tok/s) step 4800/76294 | train loss 3.356463 | norm 0.1489 | lr 1.08e-03 | (9893.11 ms | 52995 tok/s) step 4801/76294 | train loss 3.389480 | norm 0.1568 | lr 1.08e-03 | (9900.56 ms | 52955 tok/s) step 4802/76294 | train loss 3.408689 | norm 0.1481 | lr 1.08e-03 | (9896.20 ms | 52979 tok/s) step 4803/76294 | train loss 3.405482 | norm 0.1545 | lr 1.08e-03 | (9895.96 ms | 52980 tok/s) step 4804/76294 | train loss 3.419477 | norm 0.1630 | lr 1.08e-03 | (9973.96 ms | 52566 tok/s) step 4805/76294 | train loss 3.509468 | norm 0.1517 | lr 1.08e-03 | (9895.52 ms | 52982 tok/s) step 4806/76294 | train loss 3.385827 | norm 0.1511 | lr 1.08e-03 | (9958.84 ms | 52646 tok/s) step 4807/76294 | train loss 3.347697 | norm 0.1548 | lr 1.08e-03 | (9894.99 ms | 52985 tok/s) step 4808/76294 | train loss 3.441786 | norm 0.1352 | lr 1.08e-03 | (9928.75 ms | 52805 tok/s) step 4809/76294 | train loss 3.384761 | norm 0.1555 | lr 1.08e-03 | (9911.09 ms | 52899 tok/s) step 4810/76294 | train loss 3.444257 | norm 0.1334 | lr 1.08e-03 | (9891.52 ms | 53004 tok/s) step 4811/76294 | train loss 3.329244 | norm 0.1348 | lr 1.08e-03 | (9901.24 ms | 52952 tok/s) step 4812/76294 | train loss 3.441333 | norm 0.1295 | lr 1.08e-03 | (9892.44 ms | 52999 tok/s) step 4813/76294 | train loss 3.472684 | norm 0.1502 | lr 1.08e-03 | (9900.68 ms | 52955 tok/s) step 4814/76294 | train loss 3.333629 | norm 0.1388 | lr 1.08e-03 | (9893.03 ms | 52996 tok/s) step 4815/76294 | train loss 3.405837 | norm 0.1435 | lr 1.08e-03 | (9906.30 ms | 52925 tok/s) step 4816/76294 | train loss 3.403156 | norm 0.1755 | lr 1.08e-03 | (9887.33 ms | 53026 tok/s) step 4817/76294 | train loss 3.431344 | norm 0.1335 | lr 1.08e-03 | (9893.34 ms | 52994 tok/s) step 4818/76294 | train loss 3.351067 | norm 0.1686 | lr 1.08e-03 | (9897.38 ms | 52972 tok/s) step 4819/76294 | train loss 3.417687 | norm 0.1643 | lr 1.08e-03 | (9935.58 ms | 52769 tok/s) step 4820/76294 | train loss 3.342099 | norm 0.1392 | lr 1.08e-03 | (9890.97 ms | 53007 tok/s) step 4821/76294 | train loss 3.374767 | norm 0.1498 | lr 1.08e-03 | (9898.51 ms | 52966 tok/s) step 4822/76294 | train loss 3.375085 | norm 0.1856 | lr 1.08e-03 | (9889.09 ms | 53017 tok/s) step 4823/76294 | train loss 3.396016 | norm 0.2323 | lr 1.08e-03 | (9897.29 ms | 52973 tok/s) step 4824/76294 | train loss 3.418653 | norm 0.1782 | lr 1.08e-03 | (9889.22 ms | 53016 tok/s) step 4825/76294 | train loss 3.379602 | norm 0.1787 | lr 1.08e-03 | (9893.70 ms | 52992 tok/s) step 4826/76294 | train loss 3.351968 | norm 0.1836 | lr 1.08e-03 | (9923.06 ms | 52835 tok/s) step 4827/76294 | train loss 3.424823 | norm 0.1599 | lr 1.08e-03 | (9909.96 ms | 52905 tok/s) step 4828/76294 | train loss 3.403911 | norm 0.1589 | lr 1.08e-03 | (9893.79 ms | 52992 tok/s) step 4829/76294 | train loss 3.421691 | norm 0.1793 | lr 1.08e-03 | (9906.35 ms | 52924 tok/s) step 4830/76294 | train loss 3.339896 | norm 0.1538 | lr 1.08e-03 | (9891.42 ms | 53004 tok/s) step 4831/76294 | train loss 3.392359 | norm 0.1662 | lr 1.08e-03 | (9889.47 ms | 53015 tok/s) step 4832/76294 | train loss 3.412222 | norm 0.1402 | lr 1.08e-03 | (9891.81 ms | 53002 tok/s) step 4833/76294 | train loss 3.375990 | norm 0.1472 | lr 1.08e-03 | (9893.83 ms | 52991 tok/s) step 4834/76294 | train loss 3.401926 | norm 0.1313 | lr 1.08e-03 | (9889.96 ms | 53012 tok/s) step 4835/76294 | train loss 3.391334 | norm 0.1359 | lr 1.08e-03 | (9890.29 ms | 53010 tok/s) step 4836/76294 | train loss 3.454031 | norm 0.1554 | lr 1.08e-03 | (9893.02 ms | 52996 tok/s) step 4837/76294 | train loss 3.417658 | norm 0.1448 | lr 1.08e-03 | (9893.07 ms | 52995 tok/s) step 4838/76294 | train loss 3.441676 | norm 0.1583 | lr 1.08e-03 | (9927.98 ms | 52809 tok/s) step 4839/76294 | train loss 3.312378 | norm 0.1386 | lr 1.08e-03 | (9892.62 ms | 52998 tok/s) step 4840/76294 | train loss 3.403794 | norm 0.1584 | lr 1.08e-03 | (9971.11 ms | 52581 tok/s) step 4841/76294 | train loss 3.387174 | norm 0.1439 | lr 1.08e-03 | (9894.18 ms | 52990 tok/s) step 4842/76294 | train loss 3.356641 | norm 0.1605 | lr 1.08e-03 | (9888.51 ms | 53020 tok/s) step 4843/76294 | train loss 3.298448 | norm 0.1787 | lr 1.08e-03 | (9888.81 ms | 53018 tok/s) step 4844/76294 | train loss 3.345783 | norm 0.1632 | lr 1.08e-03 | (9936.52 ms | 52764 tok/s) step 4845/76294 | train loss 3.384710 | norm 0.1653 | lr 1.08e-03 | (9895.19 ms | 52984 tok/s) step 4846/76294 | train loss 3.399538 | norm 0.1580 | lr 1.08e-03 | (9888.33 ms | 53021 tok/s) step 4847/76294 | train loss 3.368117 | norm 0.1497 | lr 1.08e-03 | (9895.01 ms | 52985 tok/s) step 4848/76294 | train loss 3.417270 | norm 0.1402 | lr 1.08e-03 | (9887.53 ms | 53025 tok/s) step 4849/76294 | train loss 3.364732 | norm 0.1459 | lr 1.08e-03 | (9898.84 ms | 52965 tok/s) step 4850/76294 | train loss 3.398209 | norm 0.1292 | lr 1.08e-03 | (9959.27 ms | 52643 tok/s) step 4851/76294 | train loss 3.462945 | norm 0.1441 | lr 1.08e-03 | (9897.92 ms | 52970 tok/s) step 4852/76294 | train loss 3.410861 | norm 0.1634 | lr 1.08e-03 | (9952.71 ms | 52678 tok/s) step 4853/76294 | train loss 3.307843 | norm 0.1458 | lr 1.08e-03 | (9893.64 ms | 52992 tok/s) step 4854/76294 | train loss 3.359871 | norm 0.1589 | lr 1.08e-03 | (9905.99 ms | 52926 tok/s) step 4855/76294 | train loss 3.400722 | norm 0.1431 | lr 1.08e-03 | (9886.03 ms | 53033 tok/s) step 4856/76294 | train loss 3.374585 | norm 0.1534 | lr 1.08e-03 | (9889.68 ms | 53014 tok/s) step 4857/76294 | train loss 3.379327 | norm 0.1495 | lr 1.08e-03 | (9893.14 ms | 52995 tok/s) step 4858/76294 | train loss 3.375053 | norm 0.1532 | lr 1.08e-03 | (9888.76 ms | 53019 tok/s) step 4859/76294 | train loss 3.349779 | norm 0.1560 | lr 1.08e-03 | (9899.66 ms | 52960 tok/s) step 4860/76294 | train loss 3.379005 | norm 0.1339 | lr 1.08e-03 | (9888.56 ms | 53020 tok/s) step 4861/76294 | train loss 3.380267 | norm 0.1409 | lr 1.08e-03 | (9902.27 ms | 52946 tok/s) step 4862/76294 | train loss 3.394265 | norm 0.1538 | lr 1.08e-03 | (9889.41 ms | 53015 tok/s) step 4863/76294 | train loss 3.437684 | norm 0.1556 | lr 1.08e-03 | (10101.93 ms | 51900 tok/s) step 4864/76294 | train loss 3.417915 | norm 0.1481 | lr 1.08e-03 | (9905.59 ms | 52929 tok/s) step 4865/76294 | train loss 3.368464 | norm 0.1420 | lr 1.08e-03 | (9919.01 ms | 52857 tok/s) step 4866/76294 | train loss 3.356729 | norm 0.1802 | lr 1.08e-03 | (9893.06 ms | 52996 tok/s) step 4867/76294 | train loss 3.399539 | norm 0.1629 | lr 1.08e-03 | (9893.74 ms | 52992 tok/s) step 4868/76294 | train loss 3.384526 | norm 0.1561 | lr 1.08e-03 | (9897.36 ms | 52973 tok/s) step 4869/76294 | train loss 3.364614 | norm 0.1437 | lr 1.08e-03 | (9892.14 ms | 53000 tok/s) step 4870/76294 | train loss 3.430236 | norm 0.1374 | lr 1.08e-03 | (9886.23 ms | 53032 tok/s) step 4871/76294 | train loss 3.351549 | norm 0.1443 | lr 1.08e-03 | (9888.96 ms | 53018 tok/s) step 4872/76294 | train loss 3.411710 | norm 0.1497 | lr 1.08e-03 | (9896.65 ms | 52976 tok/s) step 4873/76294 | train loss 3.436101 | norm 0.1332 | lr 1.08e-03 | (9888.87 ms | 53018 tok/s) step 4874/76294 | train loss 3.406063 | norm 0.1648 | lr 1.08e-03 | (10699.82 ms | 49000 tok/s) step 4875/76294 | train loss 3.294297 | norm 0.1440 | lr 1.08e-03 | (9883.19 ms | 53048 tok/s) step 4876/76294 | train loss 3.416580 | norm 0.1607 | lr 1.08e-03 | (9890.99 ms | 53007 tok/s) step 4877/76294 | train loss 3.333535 | norm 0.1459 | lr 1.08e-03 | (9888.18 ms | 53022 tok/s) step 4878/76294 | train loss 3.333866 | norm 0.1483 | lr 1.08e-03 | (9886.79 ms | 53029 tok/s) step 4879/76294 | train loss 3.360190 | norm 0.1426 | lr 1.08e-03 | (9888.95 ms | 53018 tok/s) step 4880/76294 | train loss 3.350170 | norm 0.1525 | lr 1.08e-03 | (10853.14 ms | 48307 tok/s) step 4881/76294 | train loss 3.418123 | norm 0.1500 | lr 1.07e-03 | (9899.59 ms | 52961 tok/s) step 4882/76294 | train loss 3.332217 | norm 0.1474 | lr 1.07e-03 | (9881.12 ms | 53060 tok/s) step 4883/76294 | train loss 3.352986 | norm 0.1415 | lr 1.07e-03 | (9890.69 ms | 53008 tok/s) step 4884/76294 | train loss 3.408158 | norm 0.1374 | lr 1.07e-03 | (9880.06 ms | 53065 tok/s) step 4885/76294 | train loss 3.367062 | norm 0.1413 | lr 1.07e-03 | (9884.97 ms | 53039 tok/s) step 4886/76294 | train loss 3.362179 | norm 0.1529 | lr 1.07e-03 | (9929.66 ms | 52800 tok/s) step 4887/76294 | train loss 3.331515 | norm 0.1547 | lr 1.07e-03 | (9896.85 ms | 52975 tok/s) step 4888/76294 | train loss 3.403280 | norm 0.1692 | lr 1.07e-03 | (9881.72 ms | 53056 tok/s) step 4889/76294 | train loss 3.335501 | norm 0.1771 | lr 1.07e-03 | (9882.66 ms | 53051 tok/s) step 4890/76294 | train loss 3.378467 | norm 0.1545 | lr 1.07e-03 | (9894.94 ms | 52985 tok/s) step 4891/76294 | train loss 3.309940 | norm 0.1637 | lr 1.07e-03 | (9887.54 ms | 53025 tok/s) step 4892/76294 | train loss 3.351382 | norm 0.1542 | lr 1.07e-03 | (9887.01 ms | 53028 tok/s) step 4893/76294 | train loss 3.412562 | norm 0.1605 | lr 1.07e-03 | (10827.02 ms | 48424 tok/s) step 4894/76294 | train loss 3.354516 | norm 0.1526 | lr 1.07e-03 | (9902.14 ms | 52947 tok/s) step 4895/76294 | train loss 3.403513 | norm 0.1455 | lr 1.07e-03 | (9950.04 ms | 52692 tok/s) step 4896/76294 | train loss 3.392563 | norm 0.1756 | lr 1.07e-03 | (9887.58 ms | 53025 tok/s) step 4897/76294 | train loss 3.350817 | norm 0.1454 | lr 1.07e-03 | (9893.98 ms | 52991 tok/s) step 4898/76294 | train loss 3.354504 | norm 0.1473 | lr 1.07e-03 | (9932.17 ms | 52787 tok/s) step 4899/76294 | train loss 3.375712 | norm 0.1436 | lr 1.07e-03 | (9949.63 ms | 52694 tok/s) step 4900/76294 | train loss 3.336390 | norm 0.1664 | lr 1.07e-03 | (9883.36 ms | 53048 tok/s) step 4901/76294 | train loss 3.287485 | norm 0.1574 | lr 1.07e-03 | (9889.00 ms | 53017 tok/s) step 4902/76294 | train loss 3.403529 | norm 0.1508 | lr 1.07e-03 | (9882.17 ms | 53054 tok/s) step 4903/76294 | train loss 3.381609 | norm 0.1377 | lr 1.07e-03 | (9896.59 ms | 52977 tok/s) step 4904/76294 | train loss 3.385093 | norm 0.1528 | lr 1.07e-03 | (9884.75 ms | 53040 tok/s) step 4905/76294 | train loss 3.395783 | norm 0.1558 | lr 1.07e-03 | (9931.10 ms | 52793 tok/s) step 4906/76294 | train loss 3.324700 | norm 0.1390 | lr 1.07e-03 | (9886.50 ms | 53031 tok/s) step 4907/76294 | train loss 3.381877 | norm 0.1617 | lr 1.07e-03 | (9917.63 ms | 52864 tok/s) step 4908/76294 | train loss 3.343192 | norm 0.1370 | lr 1.07e-03 | (9916.65 ms | 52869 tok/s) step 4909/76294 | train loss 3.405975 | norm 0.1501 | lr 1.07e-03 | (9888.97 ms | 53017 tok/s) step 4910/76294 | train loss 3.469310 | norm 0.1478 | lr 1.07e-03 | (9895.75 ms | 52981 tok/s) step 4911/76294 | train loss 3.367671 | norm 0.1500 | lr 1.07e-03 | (9889.57 ms | 53014 tok/s) step 4912/76294 | train loss 3.360284 | norm 0.1405 | lr 1.07e-03 | (9892.39 ms | 52999 tok/s) step 4913/76294 | train loss 3.422749 | norm 0.1435 | lr 1.07e-03 | (9894.81 ms | 52986 tok/s) step 4914/76294 | train loss 3.356193 | norm 0.1474 | lr 1.07e-03 | (9896.23 ms | 52979 tok/s) step 4915/76294 | train loss 3.383971 | norm 0.1493 | lr 1.07e-03 | (9892.25 ms | 53000 tok/s) step 4916/76294 | train loss 3.313115 | norm 0.1807 | lr 1.07e-03 | (9891.35 ms | 53005 tok/s) step 4917/76294 | train loss 3.320968 | norm 0.1673 | lr 1.07e-03 | (9930.40 ms | 52796 tok/s) step 4918/76294 | train loss 3.352975 | norm 0.1657 | lr 1.07e-03 | (9886.57 ms | 53030 tok/s) step 4919/76294 | train loss 3.328157 | norm 0.1782 | lr 1.07e-03 | (9947.90 ms | 52703 tok/s) step 4920/76294 | train loss 3.383101 | norm 0.1658 | lr 1.07e-03 | (9890.01 ms | 53012 tok/s) step 4921/76294 | train loss 3.363653 | norm 0.1418 | lr 1.07e-03 | (9905.67 ms | 52928 tok/s) step 4922/76294 | train loss 3.355594 | norm 0.1508 | lr 1.07e-03 | (9886.80 ms | 53029 tok/s) step 4923/76294 | train loss 3.430588 | norm 0.1454 | lr 1.07e-03 | (9948.41 ms | 52701 tok/s) step 4924/76294 | train loss 3.381905 | norm 0.1471 | lr 1.07e-03 | (9890.95 ms | 53007 tok/s) step 4925/76294 | train loss 3.347158 | norm 0.1404 | lr 1.07e-03 | (9890.58 ms | 53009 tok/s) step 4926/76294 | train loss 3.543356 | norm 0.1741 | lr 1.07e-03 | (9895.14 ms | 52984 tok/s) step 4927/76294 | train loss 3.493542 | norm 0.1890 | lr 1.07e-03 | (9964.71 ms | 52614 tok/s) step 4928/76294 | train loss 3.359592 | norm 0.1547 | lr 1.07e-03 | (9887.88 ms | 53023 tok/s) step 4929/76294 | train loss 3.425235 | norm 0.1383 | lr 1.07e-03 | (9891.60 ms | 53003 tok/s) step 4930/76294 | train loss 3.414898 | norm 0.1660 | lr 1.07e-03 | (9955.89 ms | 52661 tok/s) step 4931/76294 | train loss 3.414418 | norm 0.1637 | lr 1.07e-03 | (9890.49 ms | 53009 tok/s) step 4932/76294 | train loss 3.305724 | norm 0.1409 | lr 1.07e-03 | (9956.31 ms | 52659 tok/s) step 4933/76294 | train loss 3.344376 | norm 0.1642 | lr 1.07e-03 | (9898.59 ms | 52966 tok/s) step 4934/76294 | train loss 3.341812 | norm 0.1403 | lr 1.07e-03 | (9953.62 ms | 52673 tok/s) step 4935/76294 | train loss 3.430212 | norm 0.1619 | lr 1.07e-03 | (9984.71 ms | 52509 tok/s) step 4936/76294 | train loss 3.352154 | norm 0.1369 | lr 1.07e-03 | (9891.21 ms | 53005 tok/s) step 4937/76294 | train loss 3.396348 | norm 0.1470 | lr 1.07e-03 | (9887.32 ms | 53026 tok/s) step 4938/76294 | train loss 3.362777 | norm 0.1821 | lr 1.07e-03 | (9898.97 ms | 52964 tok/s) step 4939/76294 | train loss 3.365623 | norm 0.1178 | lr 1.07e-03 | (9953.10 ms | 52676 tok/s) step 4940/76294 | train loss 3.379429 | norm 0.1579 | lr 1.07e-03 | (9889.96 ms | 53012 tok/s) step 4941/76294 | train loss 3.405442 | norm 0.1457 | lr 1.07e-03 | (9900.59 ms | 52955 tok/s) step 4942/76294 | train loss 3.376039 | norm 0.1591 | lr 1.07e-03 | (9890.47 ms | 53009 tok/s) step 4943/76294 | train loss 3.446324 | norm 0.1319 | lr 1.07e-03 | (9899.38 ms | 52962 tok/s) step 4944/76294 | train loss 3.431459 | norm 0.1746 | lr 1.07e-03 | (9946.36 ms | 52712 tok/s) step 4945/76294 | train loss 3.613350 | norm 0.1722 | lr 1.07e-03 | (9891.21 ms | 53005 tok/s) step 4946/76294 | train loss 3.368695 | norm 0.1704 | lr 1.07e-03 | (9889.00 ms | 53017 tok/s) step 4947/76294 | train loss 3.380567 | norm 0.1873 | lr 1.07e-03 | (9969.13 ms | 52591 tok/s) step 4948/76294 | train loss 3.365477 | norm 0.1705 | lr 1.07e-03 | (9898.11 ms | 52968 tok/s) step 4949/76294 | train loss 3.356967 | norm 0.1665 | lr 1.07e-03 | (9892.15 ms | 53000 tok/s) step 4950/76294 | train loss 3.404647 | norm 0.1906 | lr 1.07e-03 | (9933.86 ms | 52778 tok/s) step 4951/76294 | train loss 3.348486 | norm 0.1760 | lr 1.07e-03 | (9892.04 ms | 53001 tok/s) step 4952/76294 | train loss 3.390040 | norm 0.1545 | lr 1.07e-03 | (9899.05 ms | 52963 tok/s) step 4953/76294 | train loss 3.289801 | norm 0.1674 | lr 1.07e-03 | (9930.83 ms | 52794 tok/s) step 4954/76294 | train loss 3.324421 | norm 0.1440 | lr 1.07e-03 | (9899.39 ms | 52962 tok/s) step 4955/76294 | train loss 3.386473 | norm 0.1778 | lr 1.07e-03 | (9901.31 ms | 52951 tok/s) step 4956/76294 | train loss 3.418071 | norm 0.1578 | lr 1.07e-03 | (9890.56 ms | 53009 tok/s) step 4957/76294 | train loss 3.427773 | norm 0.1746 | lr 1.07e-03 | (9902.28 ms | 52946 tok/s) step 4958/76294 | train loss 3.355373 | norm 0.1296 | lr 1.07e-03 | (9944.02 ms | 52724 tok/s) step 4959/76294 | train loss 3.381661 | norm 0.1628 | lr 1.07e-03 | (9895.99 ms | 52980 tok/s) step 4960/76294 | train loss 3.363534 | norm 0.1403 | lr 1.07e-03 | (9899.20 ms | 52963 tok/s) step 4961/76294 | train loss 3.368485 | norm 0.1548 | lr 1.07e-03 | (9917.60 ms | 52864 tok/s) step 4962/76294 | train loss 3.315909 | norm 0.1365 | lr 1.07e-03 | (9888.52 ms | 53020 tok/s) step 4963/76294 | train loss 3.355738 | norm 0.1696 | lr 1.07e-03 | (9910.65 ms | 52901 tok/s) step 4964/76294 | train loss 3.370627 | norm 0.1358 | lr 1.07e-03 | (9886.64 ms | 53030 tok/s) step 4965/76294 | train loss 3.360271 | norm 0.1739 | lr 1.07e-03 | (9891.17 ms | 53006 tok/s) step 4966/76294 | train loss 3.364868 | norm 0.1449 | lr 1.07e-03 | (9907.98 ms | 52916 tok/s) step 4967/76294 | train loss 3.342051 | norm 0.1541 | lr 1.07e-03 | (9893.21 ms | 52995 tok/s) step 4968/76294 | train loss 3.364182 | norm 0.1561 | lr 1.07e-03 | (9887.26 ms | 53027 tok/s) step 4969/76294 | train loss 3.349271 | norm 0.1437 | lr 1.07e-03 | (9896.67 ms | 52976 tok/s) step 4970/76294 | train loss 3.366272 | norm 0.1611 | lr 1.07e-03 | (9890.30 ms | 53010 tok/s) step 4971/76294 | train loss 3.392315 | norm 0.1847 | lr 1.07e-03 | (9928.94 ms | 52804 tok/s) step 4972/76294 | train loss 3.372694 | norm 0.1710 | lr 1.07e-03 | (9888.70 ms | 53019 tok/s) step 4973/76294 | train loss 3.372671 | norm 0.1675 | lr 1.07e-03 | (9913.94 ms | 52884 tok/s) step 4974/76294 | train loss 3.388789 | norm 0.1712 | lr 1.07e-03 | (9886.89 ms | 53029 tok/s) step 4975/76294 | train loss 3.415209 | norm 0.1694 | lr 1.07e-03 | (9890.43 ms | 53010 tok/s) step 4976/76294 | train loss 3.377528 | norm 0.1462 | lr 1.07e-03 | (9887.28 ms | 53026 tok/s) step 4977/76294 | train loss 3.402636 | norm 0.1493 | lr 1.07e-03 | (9893.37 ms | 52994 tok/s) step 4978/76294 | train loss 3.361020 | norm 0.1527 | lr 1.07e-03 | (11117.00 ms | 47161 tok/s) step 4979/76294 | train loss 3.285351 | norm 0.1328 | lr 1.07e-03 | (9878.69 ms | 53073 tok/s) step 4980/76294 | train loss 3.360997 | norm 0.1413 | lr 1.07e-03 | (9883.68 ms | 53046 tok/s) step 4981/76294 | train loss 3.377333 | norm 0.1593 | lr 1.07e-03 | (9890.37 ms | 53010 tok/s) step 4982/76294 | train loss 3.422251 | norm 0.1570 | lr 1.07e-03 | (9920.13 ms | 52851 tok/s) step 4983/76294 | train loss 3.381634 | norm 0.1668 | lr 1.07e-03 | (9882.64 ms | 53051 tok/s) step 4984/76294 | train loss 3.393679 | norm 0.1458 | lr 1.07e-03 | (9899.65 ms | 52960 tok/s) step 4985/76294 | train loss 3.326406 | norm 0.1568 | lr 1.07e-03 | (9949.82 ms | 52693 tok/s) step 4986/76294 | train loss 3.456980 | norm 0.1805 | lr 1.07e-03 | (9885.75 ms | 53035 tok/s) step 4987/76294 | train loss 3.485720 | norm 0.1499 | lr 1.07e-03 | (9908.96 ms | 52911 tok/s) step 4988/76294 | train loss 3.387046 | norm 0.1529 | lr 1.07e-03 | (9950.34 ms | 52690 tok/s) step 4989/76294 | train loss 3.393514 | norm 0.1536 | lr 1.07e-03 | (9885.05 ms | 53038 tok/s) step 4990/76294 | train loss 3.356518 | norm 0.1519 | lr 1.07e-03 | (9884.21 ms | 53043 tok/s) step 4991/76294 | train loss 3.366525 | norm 0.1656 | lr 1.07e-03 | (9906.30 ms | 52925 tok/s) step 4992/76294 | train loss 3.304131 | norm 0.1510 | lr 1.07e-03 | (9897.05 ms | 52974 tok/s) step 4993/76294 | train loss 3.411608 | norm 0.1593 | lr 1.07e-03 | (9935.19 ms | 52771 tok/s) step 4994/76294 | train loss 3.370186 | norm 0.1423 | lr 1.07e-03 | (9886.96 ms | 53028 tok/s) step 4995/76294 | train loss 3.365443 | norm 0.1528 | lr 1.07e-03 | (9894.17 ms | 52990 tok/s) step 4996/76294 | train loss 3.395054 | norm 0.1458 | lr 1.07e-03 | (9886.16 ms | 53033 tok/s) step 4997/76294 | train loss 3.390768 | norm 0.1635 | lr 1.07e-03 | (9927.61 ms | 52811 tok/s) step 4998/76294 | train loss 3.379878 | norm 0.1400 | lr 1.07e-03 | (9880.67 ms | 53062 tok/s) step 4999/76294 | train loss 3.391409 | norm 0.1534 | lr 1.07e-03 | (9964.33 ms | 52617 tok/s) step 5000/76294 | train loss 3.405902 | norm 0.1408 | lr 1.07e-03 | (9885.12 ms | 53038 tok/s) val loss: 3.352793 saving model checkpoint to ./results/gpt2-350M-gqa/step_5000.pth step 5001/76294 | train loss 3.353499 | norm 0.1537 | lr 1.07e-03 | (9949.40 ms | 52695 tok/s) step 5002/76294 | train loss 3.381565 | norm 0.1616 | lr 1.07e-03 | (9867.96 ms | 53130 tok/s) step 5003/76294 | train loss 3.377208 | norm 0.1564 | lr 1.07e-03 | (9885.34 ms | 53037 tok/s) step 5004/76294 | train loss 3.472360 | norm 0.1575 | lr 1.07e-03 | (9876.09 ms | 53087 tok/s) step 5005/76294 | train loss 3.379018 | norm 0.1685 | lr 1.07e-03 | (9900.36 ms | 52956 tok/s) step 5006/76294 | train loss 3.311073 | norm 0.1400 | lr 1.07e-03 | (9883.51 ms | 53047 tok/s) step 5007/76294 | train loss 3.350877 | norm 0.1615 | lr 1.07e-03 | (9929.88 ms | 52799 tok/s) step 5008/76294 | train loss 3.384348 | norm 0.1481 | lr 1.07e-03 | (9882.46 ms | 53052 tok/s) step 5009/76294 | train loss 3.380843 | norm 0.1394 | lr 1.07e-03 | (9902.72 ms | 52944 tok/s) step 5010/76294 | train loss 3.372877 | norm 0.1582 | lr 1.07e-03 | (9932.85 ms | 52783 tok/s) step 5011/76294 | train loss 3.357368 | norm 0.1323 | lr 1.07e-03 | (9890.94 ms | 53007 tok/s) step 5012/76294 | train loss 3.386451 | norm 0.1302 | lr 1.07e-03 | (10001.23 ms | 52422 tok/s) step 5013/76294 | train loss 3.402241 | norm 0.1682 | lr 1.07e-03 | (9891.34 ms | 53005 tok/s) step 5014/76294 | train loss 3.365887 | norm 0.1731 | lr 1.07e-03 | (9895.71 ms | 52981 tok/s) step 5015/76294 | train loss 3.456266 | norm 0.1933 | lr 1.07e-03 | (9890.81 ms | 53008 tok/s) step 5016/76294 | train loss 3.348914 | norm 0.1516 | lr 1.07e-03 | (9920.61 ms | 52848 tok/s) step 5017/76294 | train loss 3.361318 | norm 0.1745 | lr 1.07e-03 | (9897.27 ms | 52973 tok/s) step 5018/76294 | train loss 3.324695 | norm 0.1459 | lr 1.07e-03 | (9894.74 ms | 52987 tok/s) step 5019/76294 | train loss 3.359606 | norm 0.1541 | lr 1.07e-03 | (9893.23 ms | 52995 tok/s) step 5020/76294 | train loss 3.311211 | norm 0.1762 | lr 1.07e-03 | (9899.05 ms | 52963 tok/s) step 5021/76294 | train loss 3.326879 | norm 0.1700 | lr 1.07e-03 | (9896.87 ms | 52975 tok/s) step 5022/76294 | train loss 3.419445 | norm 0.1537 | lr 1.07e-03 | (9898.96 ms | 52964 tok/s) step 5023/76294 | train loss 3.416656 | norm 0.1540 | lr 1.07e-03 | (9899.01 ms | 52964 tok/s) step 5024/76294 | train loss 3.368266 | norm 0.1355 | lr 1.07e-03 | (9901.11 ms | 52952 tok/s) step 5025/76294 | train loss 3.428647 | norm 0.1607 | lr 1.07e-03 | (9897.69 ms | 52971 tok/s) step 5026/76294 | train loss 3.345053 | norm 0.1609 | lr 1.07e-03 | (9934.39 ms | 52775 tok/s) step 5027/76294 | train loss 3.328486 | norm 0.1401 | lr 1.07e-03 | (9915.87 ms | 52874 tok/s) step 5028/76294 | train loss 3.342412 | norm 0.1688 | lr 1.07e-03 | (9897.08 ms | 52974 tok/s) step 5029/76294 | train loss 3.352185 | norm 0.1711 | lr 1.07e-03 | (9958.85 ms | 52645 tok/s) step 5030/76294 | train loss 3.282276 | norm 0.1391 | lr 1.07e-03 | (9906.70 ms | 52923 tok/s) step 5031/76294 | train loss 3.307462 | norm 0.1483 | lr 1.07e-03 | (9898.51 ms | 52966 tok/s) step 5032/76294 | train loss 3.371984 | norm 0.1403 | lr 1.07e-03 | (9925.34 ms | 52823 tok/s) step 5033/76294 | train loss 3.408057 | norm 0.1573 | lr 1.07e-03 | (9893.93 ms | 52991 tok/s) step 5034/76294 | train loss 3.348374 | norm 0.1408 | lr 1.07e-03 | (9901.91 ms | 52948 tok/s) step 5035/76294 | train loss 3.342803 | norm 0.1347 | lr 1.07e-03 | (9893.11 ms | 52995 tok/s) step 5036/76294 | train loss 3.333840 | norm 0.1328 | lr 1.07e-03 | (9906.67 ms | 52923 tok/s) step 5037/76294 | train loss 3.364765 | norm 0.1371 | lr 1.07e-03 | (9899.44 ms | 52961 tok/s) step 5038/76294 | train loss 3.365744 | norm 0.1455 | lr 1.07e-03 | (9885.28 ms | 53037 tok/s) step 5039/76294 | train loss 3.346252 | norm 0.1401 | lr 1.07e-03 | (9891.82 ms | 53002 tok/s) step 5040/76294 | train loss 3.407243 | norm 0.1438 | lr 1.07e-03 | (9901.72 ms | 52949 tok/s) step 5041/76294 | train loss 3.347778 | norm 0.1386 | lr 1.07e-03 | (9929.41 ms | 52802 tok/s) step 5042/76294 | train loss 3.303947 | norm 0.1510 | lr 1.07e-03 | (9890.97 ms | 53007 tok/s) step 5043/76294 | train loss 3.416032 | norm 0.1421 | lr 1.07e-03 | (9900.04 ms | 52958 tok/s) step 5044/76294 | train loss 3.456942 | norm 0.1360 | lr 1.07e-03 | (9928.12 ms | 52808 tok/s) step 5045/76294 | train loss 3.354074 | norm 0.1464 | lr 1.07e-03 | (9890.34 ms | 53010 tok/s) step 5046/76294 | train loss 3.402011 | norm 0.1390 | lr 1.07e-03 | (9896.86 ms | 52975 tok/s) step 5047/76294 | train loss 3.334351 | norm 0.1494 | lr 1.07e-03 | (9901.93 ms | 52948 tok/s) step 5048/76294 | train loss 3.381124 | norm 0.1314 | lr 1.07e-03 | (9894.77 ms | 52986 tok/s) step 5049/76294 | train loss 3.397738 | norm 0.1629 | lr 1.07e-03 | (9894.96 ms | 52985 tok/s) step 5050/76294 | train loss 3.420906 | norm 0.1312 | lr 1.07e-03 | (9932.13 ms | 52787 tok/s) step 5051/76294 | train loss 3.304997 | norm 0.1683 | lr 1.07e-03 | (9891.94 ms | 53002 tok/s) step 5052/76294 | train loss 3.457237 | norm 0.1991 | lr 1.06e-03 | (9893.64 ms | 52992 tok/s) step 5053/76294 | train loss 3.374957 | norm 0.1566 | lr 1.06e-03 | (9894.44 ms | 52988 tok/s) step 5054/76294 | train loss 3.425284 | norm 0.1664 | lr 1.06e-03 | (9897.21 ms | 52973 tok/s) step 5055/76294 | train loss 3.459110 | norm 0.1401 | lr 1.06e-03 | (9896.52 ms | 52977 tok/s) step 5056/76294 | train loss 3.417381 | norm 0.1458 | lr 1.06e-03 | (9896.82 ms | 52975 tok/s) step 5057/76294 | train loss 3.423973 | norm 0.1427 | lr 1.06e-03 | (9891.77 ms | 53002 tok/s) step 5058/76294 | train loss 3.370497 | norm 0.1566 | lr 1.06e-03 | (9896.86 ms | 52975 tok/s) step 5059/76294 | train loss 3.401741 | norm 0.1475 | lr 1.06e-03 | (9890.63 ms | 53009 tok/s) step 5060/76294 | train loss 3.434335 | norm 0.1500 | lr 1.06e-03 | (9916.50 ms | 52870 tok/s) step 5061/76294 | train loss 3.408741 | norm 0.1389 | lr 1.06e-03 | (9894.15 ms | 52990 tok/s) step 5062/76294 | train loss 3.356347 | norm 0.1565 | lr 1.06e-03 | (9924.53 ms | 52827 tok/s) step 5063/76294 | train loss 3.468012 | norm 0.1713 | lr 1.06e-03 | (9919.14 ms | 52856 tok/s) step 5064/76294 | train loss 3.365421 | norm 0.1505 | lr 1.06e-03 | (9897.20 ms | 52973 tok/s) step 5065/76294 | train loss 3.315986 | norm 0.1321 | lr 1.06e-03 | (10348.03 ms | 50666 tok/s) step 5066/76294 | train loss 3.406896 | norm 0.1475 | lr 1.06e-03 | (9903.12 ms | 52942 tok/s) step 5067/76294 | train loss 3.374649 | norm 0.1393 | lr 1.06e-03 | (9893.43 ms | 52994 tok/s) step 5068/76294 | train loss 3.393772 | norm 0.1466 | lr 1.06e-03 | (9885.96 ms | 53034 tok/s) step 5069/76294 | train loss 3.469741 | norm 0.1515 | lr 1.06e-03 | (9899.14 ms | 52963 tok/s) step 5070/76294 | train loss 3.402233 | norm 0.1556 | lr 1.06e-03 | (9885.78 ms | 53035 tok/s) step 5071/76294 | train loss 3.347110 | norm 0.1359 | lr 1.06e-03 | (9896.37 ms | 52978 tok/s) step 5072/76294 | train loss 3.451294 | norm 0.1694 | lr 1.06e-03 | (9885.67 ms | 53035 tok/s) step 5073/76294 | train loss 3.336005 | norm 0.1407 | lr 1.06e-03 | (9891.68 ms | 53003 tok/s) step 5074/76294 | train loss 3.379189 | norm 0.1990 | lr 1.06e-03 | (9892.03 ms | 53001 tok/s) step 5075/76294 | train loss 3.458125 | norm 0.1763 | lr 1.06e-03 | (12572.80 ms | 41700 tok/s) step 5076/76294 | train loss 3.421292 | norm 0.2127 | lr 1.06e-03 | (9930.80 ms | 52794 tok/s) step 5077/76294 | train loss 3.416044 | norm 0.1865 | lr 1.06e-03 | (9873.58 ms | 53100 tok/s) step 5078/76294 | train loss 3.363507 | norm 0.1427 | lr 1.06e-03 | (9871.34 ms | 53112 tok/s) step 5079/76294 | train loss 3.403928 | norm 0.1783 | lr 1.06e-03 | (9894.45 ms | 52988 tok/s) step 5080/76294 | train loss 3.450114 | norm 0.1531 | lr 1.06e-03 | (9949.38 ms | 52696 tok/s) step 5081/76294 | train loss 3.366680 | norm 0.1584 | lr 1.06e-03 | (9880.64 ms | 53062 tok/s) step 5082/76294 | train loss 3.384475 | norm 0.1587 | lr 1.06e-03 | (9888.06 ms | 53022 tok/s) step 5083/76294 | train loss 3.453941 | norm 0.1566 | lr 1.06e-03 | (9949.30 ms | 52696 tok/s) step 5084/76294 | train loss 3.344449 | norm 0.1431 | lr 1.06e-03 | (9886.29 ms | 53032 tok/s) step 5085/76294 | train loss 3.371587 | norm 0.1759 | lr 1.06e-03 | (9892.64 ms | 52998 tok/s) step 5086/76294 | train loss 3.348778 | norm 0.1653 | lr 1.06e-03 | (9928.08 ms | 52809 tok/s) step 5087/76294 | train loss 3.436916 | norm 0.1386 | lr 1.06e-03 | (9887.69 ms | 53024 tok/s) step 5088/76294 | train loss 3.354125 | norm 0.1642 | lr 1.06e-03 | (9880.95 ms | 53060 tok/s) step 5089/76294 | train loss 3.363076 | norm 0.1388 | lr 1.06e-03 | (9882.57 ms | 53052 tok/s) step 5090/76294 | train loss 3.388963 | norm 0.1573 | lr 1.06e-03 | (9893.88 ms | 52991 tok/s) step 5091/76294 | train loss 3.354398 | norm 0.1321 | lr 1.06e-03 | (9888.50 ms | 53020 tok/s) step 5092/76294 | train loss 3.377982 | norm 0.1454 | lr 1.06e-03 | (9889.19 ms | 53016 tok/s) step 5093/76294 | train loss 3.335519 | norm 0.1328 | lr 1.06e-03 | (9883.63 ms | 53046 tok/s) step 5094/76294 | train loss 3.416924 | norm 0.1394 | lr 1.06e-03 | (10887.45 ms | 48155 tok/s) step 5095/76294 | train loss 3.475842 | norm 0.1357 | lr 1.06e-03 | (9877.81 ms | 53077 tok/s) step 5096/76294 | train loss 3.409551 | norm 0.1759 | lr 1.06e-03 | (9942.34 ms | 52733 tok/s) step 5097/76294 | train loss 3.395133 | norm 0.1490 | lr 1.06e-03 | (9882.82 ms | 53050 tok/s) step 5098/76294 | train loss 3.408901 | norm 0.1586 | lr 1.06e-03 | (9916.86 ms | 52868 tok/s) step 5099/76294 | train loss 3.433535 | norm 0.1427 | lr 1.06e-03 | (9884.97 ms | 53039 tok/s) step 5100/76294 | train loss 3.384503 | norm 0.1572 | lr 1.06e-03 | (9877.46 ms | 53079 tok/s) step 5101/76294 | train loss 3.324785 | norm 0.1586 | lr 1.06e-03 | (9895.31 ms | 52983 tok/s) step 5102/76294 | train loss 3.386935 | norm 0.1427 | lr 1.06e-03 | (9961.75 ms | 52630 tok/s) step 5103/76294 | train loss 3.385298 | norm 0.1670 | lr 1.06e-03 | (12168.60 ms | 43085 tok/s) step 5104/76294 | train loss 3.391459 | norm 0.1449 | lr 1.06e-03 | (9904.61 ms | 52934 tok/s) step 5105/76294 | train loss 3.438215 | norm 0.1469 | lr 1.06e-03 | (9886.36 ms | 53031 tok/s) step 5106/76294 | train loss 3.448055 | norm 0.1529 | lr 1.06e-03 | (9887.54 ms | 53025 tok/s) step 5107/76294 | train loss 3.467598 | norm 0.1459 | lr 1.06e-03 | (9890.87 ms | 53007 tok/s) step 5108/76294 | train loss 3.484816 | norm 0.1466 | lr 1.06e-03 | (9890.91 ms | 53007 tok/s) step 5109/76294 | train loss 3.370320 | norm 0.1427 | lr 1.06e-03 | (9890.20 ms | 53011 tok/s) step 5110/76294 | train loss 3.458787 | norm 0.1392 | lr 1.06e-03 | (9904.01 ms | 52937 tok/s) step 5111/76294 | train loss 3.331947 | norm 0.1479 | lr 1.06e-03 | (9896.04 ms | 52980 tok/s) step 5112/76294 | train loss 3.420819 | norm 0.1733 | lr 1.06e-03 | (9893.74 ms | 52992 tok/s) step 5113/76294 | train loss 3.371394 | norm 0.1881 | lr 1.06e-03 | (9894.36 ms | 52989 tok/s) step 5114/76294 | train loss 3.326149 | norm 0.1420 | lr 1.06e-03 | (9897.02 ms | 52974 tok/s) step 5115/76294 | train loss 3.371049 | norm 0.1543 | lr 1.06e-03 | (9896.05 ms | 52980 tok/s) step 5116/76294 | train loss 3.311675 | norm 0.1348 | lr 1.06e-03 | (9942.56 ms | 52732 tok/s) step 5117/76294 | train loss 3.413557 | norm 0.1459 | lr 1.06e-03 | (9899.28 ms | 52962 tok/s) step 5118/76294 | train loss 3.393894 | norm 0.1283 | lr 1.06e-03 | (9899.06 ms | 52963 tok/s) step 5119/76294 | train loss 3.461981 | norm 0.1594 | lr 1.06e-03 | (9894.92 ms | 52986 tok/s) step 5120/76294 | train loss 3.334651 | norm 0.1537 | lr 1.06e-03 | (9963.22 ms | 52622 tok/s) step 5121/76294 | train loss 3.453084 | norm 0.1552 | lr 1.06e-03 | (9898.69 ms | 52965 tok/s) step 5122/76294 | train loss 3.407156 | norm 0.1359 | lr 1.06e-03 | (9889.60 ms | 53014 tok/s) step 5123/76294 | train loss 3.347678 | norm 0.1506 | lr 1.06e-03 | (9897.78 ms | 52970 tok/s) step 5124/76294 | train loss 3.346331 | norm 0.1356 | lr 1.06e-03 | (9934.06 ms | 52777 tok/s) step 5125/76294 | train loss 3.384692 | norm 0.1374 | lr 1.06e-03 | (9893.53 ms | 52993 tok/s) step 5126/76294 | train loss 3.368380 | norm 0.1326 | lr 1.06e-03 | (9896.29 ms | 52978 tok/s) step 5127/76294 | train loss 3.314668 | norm 0.1462 | lr 1.06e-03 | (9902.78 ms | 52944 tok/s) step 5128/76294 | train loss 3.416009 | norm 0.1256 | lr 1.06e-03 | (9928.54 ms | 52806 tok/s) step 5129/76294 | train loss 3.487734 | norm 0.1508 | lr 1.06e-03 | (9898.22 ms | 52968 tok/s) step 5130/76294 | train loss 3.481902 | norm 0.1454 | lr 1.06e-03 | (9888.01 ms | 53023 tok/s) step 5131/76294 | train loss 3.318113 | norm 0.1290 | lr 1.06e-03 | (9913.93 ms | 52884 tok/s) step 5132/76294 | train loss 3.404759 | norm 0.1387 | lr 1.06e-03 | (9891.49 ms | 53004 tok/s) step 5133/76294 | train loss 3.350071 | norm 0.1254 | lr 1.06e-03 | (9932.13 ms | 52787 tok/s) step 5134/76294 | train loss 3.378086 | norm 0.1410 | lr 1.06e-03 | (9919.24 ms | 52856 tok/s) step 5135/76294 | train loss 3.348039 | norm 0.1454 | lr 1.06e-03 | (9905.71 ms | 52928 tok/s) step 5136/76294 | train loss 3.328421 | norm 0.1676 | lr 1.06e-03 | (9895.19 ms | 52984 tok/s) step 5137/76294 | train loss 3.351396 | norm 0.1357 | lr 1.06e-03 | (9895.41 ms | 52983 tok/s) step 5138/76294 | train loss 3.368850 | norm 0.1600 | lr 1.06e-03 | (9894.90 ms | 52986 tok/s) step 5139/76294 | train loss 3.338974 | norm 0.1526 | lr 1.06e-03 | (9960.09 ms | 52639 tok/s) step 5140/76294 | train loss 3.352408 | norm 0.1690 | lr 1.06e-03 | (9896.06 ms | 52979 tok/s) step 5141/76294 | train loss 3.323344 | norm 0.1649 | lr 1.06e-03 | (9960.34 ms | 52638 tok/s) step 5142/76294 | train loss 3.416230 | norm 0.1431 | lr 1.06e-03 | (9892.96 ms | 52996 tok/s) step 5143/76294 | train loss 3.374672 | norm 0.1801 | lr 1.06e-03 | (9894.49 ms | 52988 tok/s) step 5144/76294 | train loss 3.402123 | norm 0.1319 | lr 1.06e-03 | (9893.50 ms | 52993 tok/s) step 5145/76294 | train loss 3.377872 | norm 0.1481 | lr 1.06e-03 | (9895.73 ms | 52981 tok/s) step 5146/76294 | train loss 3.415770 | norm 0.1287 | lr 1.06e-03 | (9893.05 ms | 52996 tok/s) step 5147/76294 | train loss 3.530061 | norm 0.1468 | lr 1.06e-03 | (9890.64 ms | 53008 tok/s) step 5148/76294 | train loss 3.493946 | norm 0.1741 | lr 1.06e-03 | (9894.70 ms | 52987 tok/s) step 5149/76294 | train loss 3.349014 | norm 0.1453 | lr 1.06e-03 | (9894.22 ms | 52989 tok/s) step 5150/76294 | train loss 3.355787 | norm 0.1489 | lr 1.06e-03 | (9917.00 ms | 52868 tok/s) step 5151/76294 | train loss 3.401009 | norm 0.1457 | lr 1.06e-03 | (9893.68 ms | 52992 tok/s) step 5152/76294 | train loss 3.397521 | norm 0.1514 | lr 1.06e-03 | (9934.89 ms | 52772 tok/s) step 5153/76294 | train loss 3.435428 | norm 0.1624 | lr 1.06e-03 | (9893.01 ms | 52996 tok/s) step 5154/76294 | train loss 3.385344 | norm 0.1769 | lr 1.06e-03 | (9915.64 ms | 52875 tok/s) step 5155/76294 | train loss 3.390908 | norm 0.1563 | lr 1.06e-03 | (9893.03 ms | 52996 tok/s) step 5156/76294 | train loss 3.342362 | norm 0.1444 | lr 1.06e-03 | (9896.12 ms | 52979 tok/s) step 5157/76294 | train loss 3.363986 | norm 0.1500 | lr 1.06e-03 | (9892.09 ms | 53001 tok/s) step 5158/76294 | train loss 3.356028 | norm 0.1354 | lr 1.06e-03 | (9900.60 ms | 52955 tok/s) step 5159/76294 | train loss 3.393997 | norm 0.1384 | lr 1.06e-03 | (9895.03 ms | 52985 tok/s) step 5160/76294 | train loss 3.461353 | norm 0.1459 | lr 1.06e-03 | (9896.03 ms | 52980 tok/s) step 5161/76294 | train loss 3.457919 | norm 0.1566 | lr 1.06e-03 | (9893.80 ms | 52992 tok/s) step 5162/76294 | train loss 3.439454 | norm 0.1714 | lr 1.06e-03 | (9896.52 ms | 52977 tok/s) step 5163/76294 | train loss 3.524466 | norm 0.1976 | lr 1.06e-03 | (9892.33 ms | 52999 tok/s) step 5164/76294 | train loss 3.392287 | norm 0.1968 | lr 1.06e-03 | (9893.29 ms | 52994 tok/s) step 5165/76294 | train loss 3.534079 | norm 0.2157 | lr 1.06e-03 | (9892.73 ms | 52997 tok/s) step 5166/76294 | train loss 3.398215 | norm 0.1683 | lr 1.06e-03 | (9913.78 ms | 52885 tok/s) step 5167/76294 | train loss 3.358430 | norm 0.1541 | lr 1.06e-03 | (9893.25 ms | 52995 tok/s) step 5168/76294 | train loss 3.460259 | norm 0.1700 | lr 1.06e-03 | (9888.37 ms | 53021 tok/s) step 5169/76294 | train loss 3.389471 | norm 0.1666 | lr 1.06e-03 | (9958.93 ms | 52645 tok/s) step 5170/76294 | train loss 3.409371 | norm 0.1627 | lr 1.06e-03 | (9889.91 ms | 53012 tok/s) step 5171/76294 | train loss 3.400628 | norm 0.1617 | lr 1.06e-03 | (9925.82 ms | 52821 tok/s) step 5172/76294 | train loss 3.365139 | norm 0.1319 | lr 1.06e-03 | (9890.74 ms | 53008 tok/s) step 5173/76294 | train loss 3.379864 | norm 0.1423 | lr 1.06e-03 | (11137.34 ms | 47075 tok/s) step 5174/76294 | train loss 3.383089 | norm 0.1343 | lr 1.06e-03 | (9906.81 ms | 52922 tok/s) step 5175/76294 | train loss 3.392442 | norm 0.1344 | lr 1.06e-03 | (9889.24 ms | 53016 tok/s) step 5176/76294 | train loss 3.378070 | norm 0.1293 | lr 1.06e-03 | (9929.90 ms | 52799 tok/s) step 5177/76294 | train loss 3.375834 | norm 0.1253 | lr 1.06e-03 | (9891.53 ms | 53004 tok/s) step 5178/76294 | train loss 3.369046 | norm 0.1402 | lr 1.06e-03 | (9898.99 ms | 52964 tok/s) step 5179/76294 | train loss 3.425495 | norm 0.1381 | lr 1.06e-03 | (9895.71 ms | 52981 tok/s) step 5180/76294 | train loss 3.391167 | norm 0.1500 | lr 1.06e-03 | (9907.90 ms | 52916 tok/s) step 5181/76294 | train loss 3.383329 | norm 0.2035 | lr 1.06e-03 | (9964.07 ms | 52618 tok/s) step 5182/76294 | train loss 3.302707 | norm 0.1806 | lr 1.06e-03 | (9894.09 ms | 52990 tok/s) step 5183/76294 | train loss 3.407599 | norm 0.1512 | lr 1.06e-03 | (9951.67 ms | 52683 tok/s) step 5184/76294 | train loss 3.415754 | norm 0.1703 | lr 1.06e-03 | (9890.16 ms | 53011 tok/s) step 5185/76294 | train loss 3.408459 | norm 0.1566 | lr 1.06e-03 | (9923.79 ms | 52831 tok/s) step 5186/76294 | train loss 3.357056 | norm 0.1402 | lr 1.06e-03 | (9935.43 ms | 52770 tok/s) step 5187/76294 | train loss 3.411268 | norm 0.1345 | lr 1.06e-03 | (9893.80 ms | 52992 tok/s) step 5188/76294 | train loss 3.325353 | norm 0.1584 | lr 1.06e-03 | (9894.09 ms | 52990 tok/s) step 5189/76294 | train loss 3.428542 | norm 0.1731 | lr 1.06e-03 | (9925.17 ms | 52824 tok/s) step 5190/76294 | train loss 3.374683 | norm 0.1478 | lr 1.06e-03 | (9890.34 ms | 53010 tok/s) step 5191/76294 | train loss 3.385408 | norm 0.1415 | lr 1.06e-03 | (9909.51 ms | 52908 tok/s) step 5192/76294 | train loss 3.399370 | norm 0.1376 | lr 1.06e-03 | (9891.05 ms | 53006 tok/s) step 5193/76294 | train loss 3.388802 | norm 0.1358 | lr 1.06e-03 | (9895.40 ms | 52983 tok/s) step 5194/76294 | train loss 3.392518 | norm 0.1554 | lr 1.06e-03 | (9885.79 ms | 53035 tok/s) step 5195/76294 | train loss 3.391214 | norm 0.1510 | lr 1.06e-03 | (9906.33 ms | 52925 tok/s) step 5196/76294 | train loss 3.503675 | norm 0.1551 | lr 1.06e-03 | (9928.66 ms | 52806 tok/s) step 5197/76294 | train loss 3.374771 | norm 0.1688 | lr 1.06e-03 | (9889.79 ms | 53013 tok/s) step 5198/76294 | train loss 3.396584 | norm 0.1655 | lr 1.06e-03 | (9899.33 ms | 52962 tok/s) step 5199/76294 | train loss 3.388590 | norm 0.1568 | lr 1.06e-03 | (9889.00 ms | 53017 tok/s) step 5200/76294 | train loss 3.413427 | norm 0.1814 | lr 1.06e-03 | (9894.99 ms | 52985 tok/s) step 5201/76294 | train loss 3.333056 | norm 0.1405 | lr 1.06e-03 | (9888.96 ms | 53018 tok/s) step 5202/76294 | train loss 3.349748 | norm 0.1525 | lr 1.06e-03 | (9892.49 ms | 52999 tok/s) step 5203/76294 | train loss 3.352389 | norm 0.1315 | lr 1.06e-03 | (9890.53 ms | 53009 tok/s) step 5204/76294 | train loss 3.344368 | norm 0.1385 | lr 1.06e-03 | (9892.60 ms | 52998 tok/s) step 5205/76294 | train loss 3.406457 | norm 0.1525 | lr 1.06e-03 | (9890.35 ms | 53010 tok/s) step 5206/76294 | train loss 3.494242 | norm 0.1453 | lr 1.06e-03 | (9894.08 ms | 52990 tok/s) step 5207/76294 | train loss 3.305167 | norm 0.1595 | lr 1.06e-03 | (9920.80 ms | 52847 tok/s) step 5208/76294 | train loss 3.396162 | norm 0.1443 | lr 1.06e-03 | (9931.82 ms | 52789 tok/s) step 5209/76294 | train loss 3.388313 | norm 0.1527 | lr 1.06e-03 | (9888.32 ms | 53021 tok/s) step 5210/76294 | train loss 3.358736 | norm 0.1493 | lr 1.06e-03 | (9885.43 ms | 53036 tok/s) step 5211/76294 | train loss 3.373905 | norm 0.1416 | lr 1.06e-03 | (9954.24 ms | 52670 tok/s) step 5212/76294 | train loss 3.369459 | norm 0.1409 | lr 1.06e-03 | (9895.35 ms | 52983 tok/s) step 5213/76294 | train loss 3.301504 | norm 0.1357 | lr 1.06e-03 | (9901.79 ms | 52949 tok/s) step 5214/76294 | train loss 3.404315 | norm 0.1467 | lr 1.06e-03 | (9925.48 ms | 52822 tok/s) step 5215/76294 | train loss 3.400434 | norm 0.1432 | lr 1.06e-03 | (9887.43 ms | 53026 tok/s) step 5216/76294 | train loss 3.427176 | norm 0.1563 | lr 1.06e-03 | (9895.56 ms | 52982 tok/s) step 5217/76294 | train loss 3.376461 | norm 0.1526 | lr 1.05e-03 | (9890.76 ms | 53008 tok/s) step 5218/76294 | train loss 3.395873 | norm 0.1378 | lr 1.05e-03 | (9896.61 ms | 52977 tok/s) step 5219/76294 | train loss 3.343542 | norm 0.1527 | lr 1.05e-03 | (9952.51 ms | 52679 tok/s) step 5220/76294 | train loss 3.344300 | norm 0.1503 | lr 1.05e-03 | (9898.54 ms | 52966 tok/s) step 5221/76294 | train loss 3.387059 | norm 0.1344 | lr 1.05e-03 | (9887.48 ms | 53025 tok/s) step 5222/76294 | train loss 3.379784 | norm 0.1646 | lr 1.05e-03 | (9889.49 ms | 53015 tok/s) step 5223/76294 | train loss 3.303409 | norm 0.1370 | lr 1.05e-03 | (9945.96 ms | 52714 tok/s) step 5224/76294 | train loss 3.378680 | norm 0.1451 | lr 1.05e-03 | (9884.28 ms | 53043 tok/s) step 5225/76294 | train loss 3.359244 | norm 0.1602 | lr 1.05e-03 | (9929.12 ms | 52803 tok/s) step 5226/76294 | train loss 3.359653 | norm 0.1422 | lr 1.05e-03 | (9885.51 ms | 53036 tok/s) step 5227/76294 | train loss 3.363068 | norm 0.1435 | lr 1.05e-03 | (9896.50 ms | 52977 tok/s) step 5228/76294 | train loss 3.403399 | norm 0.1492 | lr 1.05e-03 | (9886.67 ms | 53030 tok/s) step 5229/76294 | train loss 3.362241 | norm 0.1582 | lr 1.05e-03 | (9895.63 ms | 52982 tok/s) step 5230/76294 | train loss 3.336980 | norm 0.1379 | lr 1.05e-03 | (9896.28 ms | 52978 tok/s) step 5231/76294 | train loss 3.389520 | norm 0.1270 | lr 1.05e-03 | (9901.46 ms | 52951 tok/s) step 5232/76294 | train loss 3.336015 | norm 0.1999 | lr 1.05e-03 | (9895.20 ms | 52984 tok/s) step 5233/76294 | train loss 3.386575 | norm 0.2374 | lr 1.05e-03 | (9904.29 ms | 52935 tok/s) step 5234/76294 | train loss 3.370508 | norm 0.1551 | lr 1.05e-03 | (9885.13 ms | 53038 tok/s) step 5235/76294 | train loss 3.401329 | norm 0.1709 | lr 1.05e-03 | (9974.96 ms | 52560 tok/s) step 5236/76294 | train loss 3.379822 | norm 0.1642 | lr 1.05e-03 | (9886.63 ms | 53030 tok/s) step 5237/76294 | train loss 3.356317 | norm 0.1614 | lr 1.05e-03 | (9895.71 ms | 52981 tok/s) step 5238/76294 | train loss 3.363138 | norm 0.1506 | lr 1.05e-03 | (9885.69 ms | 53035 tok/s) step 5239/76294 | train loss 3.333769 | norm 0.1560 | lr 1.05e-03 | (9913.27 ms | 52887 tok/s) step 5240/76294 | train loss 3.355795 | norm 0.1648 | lr 1.05e-03 | (9888.67 ms | 53019 tok/s) step 5241/76294 | train loss 3.317462 | norm 0.1358 | lr 1.05e-03 | (9935.68 ms | 52768 tok/s) step 5242/76294 | train loss 3.414431 | norm 0.1551 | lr 1.05e-03 | (9884.64 ms | 53041 tok/s) step 5243/76294 | train loss 3.388095 | norm 0.1432 | lr 1.05e-03 | (9894.15 ms | 52990 tok/s) step 5244/76294 | train loss 3.459178 | norm 0.1951 | lr 1.05e-03 | (9910.42 ms | 52903 tok/s) step 5245/76294 | train loss 3.436808 | norm 0.1595 | lr 1.05e-03 | (9896.60 ms | 52977 tok/s) step 5246/76294 | train loss 3.392777 | norm 0.1443 | lr 1.05e-03 | (9897.56 ms | 52971 tok/s) step 5247/76294 | train loss 3.352370 | norm 0.2863 | lr 1.05e-03 | (9889.64 ms | 53014 tok/s) step 5248/76294 | train loss 3.411746 | norm 0.1659 | lr 1.05e-03 | (9894.65 ms | 52987 tok/s) step 5249/76294 | train loss 3.336957 | norm 0.1686 | lr 1.05e-03 | (9890.18 ms | 53011 tok/s) step 5250/76294 | train loss 3.457479 | norm 0.1889 | lr 1.05e-03 | (9893.46 ms | 52993 tok/s) val loss: 3.350519 saving model checkpoint to ./results/gpt2-350M-gqa/step_5250.pth step 5251/76294 | train loss 3.371698 | norm 0.1596 | lr 1.05e-03 | (9942.99 ms | 52729 tok/s) step 5252/76294 | train loss 3.386646 | norm 0.1661 | lr 1.05e-03 | (9870.33 ms | 53118 tok/s) step 5253/76294 | train loss 3.337866 | norm 0.1587 | lr 1.05e-03 | (9892.86 ms | 52997 tok/s) step 5254/76294 | train loss 3.339972 | norm 0.1648 | lr 1.05e-03 | (9873.95 ms | 53098 tok/s) step 5255/76294 | train loss 3.340529 | norm 0.1489 | lr 1.05e-03 | (9892.28 ms | 53000 tok/s) step 5256/76294 | train loss 3.402706 | norm 0.1745 | lr 1.05e-03 | (10471.72 ms | 50067 tok/s) step 5257/76294 | train loss 3.380641 | norm 0.1482 | lr 1.05e-03 | (9883.44 ms | 53047 tok/s) step 5258/76294 | train loss 3.333998 | norm 0.1661 | lr 1.05e-03 | (9887.29 ms | 53026 tok/s) step 5259/76294 | train loss 3.486145 | norm 0.1490 | lr 1.05e-03 | (9891.45 ms | 53004 tok/s) step 5260/76294 | train loss 3.348371 | norm 0.1568 | lr 1.05e-03 | (9893.07 ms | 52995 tok/s) step 5261/76294 | train loss 3.477444 | norm 0.1428 | lr 1.05e-03 | (9924.56 ms | 52827 tok/s) step 5262/76294 | train loss 3.330912 | norm 0.1696 | lr 1.05e-03 | (9891.28 ms | 53005 tok/s) step 5263/76294 | train loss 3.383986 | norm 0.1819 | lr 1.05e-03 | (9898.73 ms | 52965 tok/s) step 5264/76294 | train loss 3.308527 | norm 0.1719 | lr 1.05e-03 | (9890.08 ms | 53011 tok/s) step 5265/76294 | train loss 3.350828 | norm 0.1799 | lr 1.05e-03 | (9888.45 ms | 53020 tok/s) step 5266/76294 | train loss 3.343992 | norm 0.1880 | lr 1.05e-03 | (9917.07 ms | 52867 tok/s) step 5267/76294 | train loss 3.351708 | norm 0.1881 | lr 1.05e-03 | (9955.09 ms | 52665 tok/s) step 5268/76294 | train loss 3.349381 | norm 0.1808 | lr 1.05e-03 | (9896.31 ms | 52978 tok/s) step 5269/76294 | train loss 3.365372 | norm 0.1411 | lr 1.05e-03 | (9894.46 ms | 52988 tok/s) step 5270/76294 | train loss 3.311439 | norm 0.1884 | lr 1.05e-03 | (9930.91 ms | 52794 tok/s) step 5271/76294 | train loss 3.370434 | norm 0.1593 | lr 1.05e-03 | (11053.04 ms | 47434 tok/s) step 5272/76294 | train loss 3.363081 | norm 0.1927 | lr 1.05e-03 | (9882.19 ms | 53054 tok/s) step 5273/76294 | train loss 3.400691 | norm 0.2644 | lr 1.05e-03 | (9892.07 ms | 53001 tok/s) step 5274/76294 | train loss 3.374749 | norm 0.3761 | lr 1.05e-03 | (9880.69 ms | 53062 tok/s) step 5275/76294 | train loss 3.333533 | norm 0.1956 | lr 1.05e-03 | (9887.88 ms | 53023 tok/s) step 5276/76294 | train loss 3.629810 | norm 0.2165 | lr 1.05e-03 | (9944.62 ms | 52721 tok/s) step 5277/76294 | train loss 3.362089 | norm 0.1787 | lr 1.05e-03 | (9885.86 ms | 53034 tok/s) step 5278/76294 | train loss 3.335807 | norm 0.1640 | lr 1.05e-03 | (9913.18 ms | 52888 tok/s) step 5279/76294 | train loss 3.339163 | norm 0.1468 | lr 1.05e-03 | (9889.44 ms | 53015 tok/s) step 5280/76294 | train loss 3.349003 | norm 0.1411 | lr 1.05e-03 | (9908.38 ms | 52914 tok/s) step 5281/76294 | train loss 3.344827 | norm 0.1587 | lr 1.05e-03 | (9897.78 ms | 52970 tok/s) step 5282/76294 | train loss 3.371123 | norm 0.1298 | lr 1.05e-03 | (9886.58 ms | 53030 tok/s) step 5283/76294 | train loss 3.446015 | norm 0.1484 | lr 1.05e-03 | (9898.17 ms | 52968 tok/s) step 5284/76294 | train loss 3.445761 | norm 0.1331 | lr 1.05e-03 | (9889.58 ms | 53014 tok/s) step 5285/76294 | train loss 3.369180 | norm 0.1414 | lr 1.05e-03 | (9894.25 ms | 52989 tok/s) step 5286/76294 | train loss 3.364221 | norm 0.1476 | lr 1.05e-03 | (9894.97 ms | 52985 tok/s) step 5287/76294 | train loss 3.405889 | norm 0.1386 | lr 1.05e-03 | (9889.30 ms | 53016 tok/s) step 5288/76294 | train loss 3.363994 | norm 0.1402 | lr 1.05e-03 | (9890.18 ms | 53011 tok/s) step 5289/76294 | train loss 3.320122 | norm 0.1582 | lr 1.05e-03 | (9931.89 ms | 52788 tok/s) step 5290/76294 | train loss 3.333298 | norm 0.1300 | lr 1.05e-03 | (9892.06 ms | 53001 tok/s) step 5291/76294 | train loss 3.405061 | norm 0.1720 | lr 1.05e-03 | (9900.26 ms | 52957 tok/s) step 5292/76294 | train loss 3.365507 | norm 0.1320 | lr 1.05e-03 | (9893.02 ms | 52996 tok/s) step 5293/76294 | train loss 3.327732 | norm 0.1792 | lr 1.05e-03 | (9899.27 ms | 52962 tok/s) step 5294/76294 | train loss 3.244766 | norm 0.1356 | lr 1.05e-03 | (9895.11 ms | 52985 tok/s) step 5295/76294 | train loss 3.395823 | norm 0.1526 | lr 1.05e-03 | (9899.90 ms | 52959 tok/s) step 5296/76294 | train loss 3.411637 | norm 0.1511 | lr 1.05e-03 | (9894.00 ms | 52991 tok/s) step 5297/76294 | train loss 3.407261 | norm 0.1888 | lr 1.05e-03 | (9946.18 ms | 52712 tok/s) step 5298/76294 | train loss 3.395250 | norm 0.1510 | lr 1.05e-03 | (9892.81 ms | 52997 tok/s) step 5299/76294 | train loss 3.367933 | norm 0.1685 | lr 1.05e-03 | (9917.27 ms | 52866 tok/s) step 5300/76294 | train loss 3.463061 | norm 0.1780 | lr 1.05e-03 | (9895.21 ms | 52984 tok/s) step 5301/76294 | train loss 3.362625 | norm 0.1600 | lr 1.05e-03 | (9895.61 ms | 52982 tok/s) step 5302/76294 | train loss 3.328555 | norm 0.1643 | lr 1.05e-03 | (9894.59 ms | 52987 tok/s) step 5303/76294 | train loss 3.364010 | norm 0.1522 | lr 1.05e-03 | (9890.37 ms | 53010 tok/s) step 5304/76294 | train loss 3.353548 | norm 0.1607 | lr 1.05e-03 | (9898.78 ms | 52965 tok/s) step 5305/76294 | train loss 3.348275 | norm 0.1698 | lr 1.05e-03 | (9892.40 ms | 52999 tok/s) step 5306/76294 | train loss 3.332000 | norm 0.1317 | lr 1.05e-03 | (9898.66 ms | 52966 tok/s) step 5307/76294 | train loss 3.321741 | norm 0.1565 | lr 1.05e-03 | (9890.73 ms | 53008 tok/s) step 5308/76294 | train loss 3.306137 | norm 0.1345 | lr 1.05e-03 | (9894.83 ms | 52986 tok/s) step 5309/76294 | train loss 3.467006 | norm 0.2051 | lr 1.05e-03 | (9895.17 ms | 52984 tok/s) step 5310/76294 | train loss 3.352191 | norm 0.1571 | lr 1.05e-03 | (9889.85 ms | 53013 tok/s) step 5311/76294 | train loss 3.390041 | norm 0.1676 | lr 1.05e-03 | (10056.30 ms | 52135 tok/s) step 5312/76294 | train loss 3.362390 | norm 0.1413 | lr 1.05e-03 | (9948.40 ms | 52701 tok/s) step 5313/76294 | train loss 3.441952 | norm 0.2028 | lr 1.05e-03 | (9886.83 ms | 53029 tok/s) step 5314/76294 | train loss 3.371387 | norm 0.2192 | lr 1.05e-03 | (9952.07 ms | 52681 tok/s) step 5315/76294 | train loss 3.324619 | norm 0.1884 | lr 1.05e-03 | (9899.09 ms | 52963 tok/s) step 5316/76294 | train loss 3.354220 | norm 0.1743 | lr 1.05e-03 | (9915.32 ms | 52877 tok/s) step 5317/76294 | train loss 3.383507 | norm 0.1727 | lr 1.05e-03 | (9894.74 ms | 52987 tok/s) step 5318/76294 | train loss 3.333196 | norm 0.1912 | lr 1.05e-03 | (9885.20 ms | 53038 tok/s) step 5319/76294 | train loss 3.431304 | norm 0.1575 | lr 1.05e-03 | (9893.00 ms | 52996 tok/s) step 5320/76294 | train loss 3.304574 | norm 0.1428 | lr 1.05e-03 | (9928.24 ms | 52808 tok/s) step 5321/76294 | train loss 3.336571 | norm 0.1569 | lr 1.05e-03 | (9890.28 ms | 53010 tok/s) step 5322/76294 | train loss 3.352589 | norm 0.1318 | lr 1.05e-03 | (9893.72 ms | 52992 tok/s) step 5323/76294 | train loss 3.433598 | norm 0.1527 | lr 1.05e-03 | (9888.90 ms | 53018 tok/s) step 5324/76294 | train loss 3.540482 | norm 0.1696 | lr 1.05e-03 | (9896.14 ms | 52979 tok/s) step 5325/76294 | train loss 3.422935 | norm 0.1647 | lr 1.05e-03 | (9902.58 ms | 52945 tok/s) step 5326/76294 | train loss 3.366175 | norm 0.1552 | lr 1.05e-03 | (9907.06 ms | 52921 tok/s) step 5327/76294 | train loss 3.349309 | norm 0.1722 | lr 1.05e-03 | (9955.10 ms | 52665 tok/s) step 5328/76294 | train loss 3.401947 | norm 0.1403 | lr 1.05e-03 | (9900.46 ms | 52956 tok/s) step 5329/76294 | train loss 3.312268 | norm 0.1442 | lr 1.05e-03 | (9900.51 ms | 52956 tok/s) step 5330/76294 | train loss 3.413349 | norm 0.1466 | lr 1.05e-03 | (9925.98 ms | 52820 tok/s) step 5331/76294 | train loss 3.311256 | norm 0.1675 | lr 1.05e-03 | (9888.37 ms | 53021 tok/s) step 5332/76294 | train loss 3.375674 | norm 0.1258 | lr 1.05e-03 | (9898.16 ms | 52968 tok/s) step 5333/76294 | train loss 3.311122 | norm 0.1563 | lr 1.05e-03 | (9890.66 ms | 53008 tok/s) step 5334/76294 | train loss 3.377136 | norm 0.1263 | lr 1.05e-03 | (9921.81 ms | 52842 tok/s) step 5335/76294 | train loss 3.352656 | norm 0.1446 | lr 1.05e-03 | (9889.47 ms | 53015 tok/s) step 5336/76294 | train loss 3.324802 | norm 0.1398 | lr 1.05e-03 | (9897.85 ms | 52970 tok/s) step 5337/76294 | train loss 3.353738 | norm 0.1422 | lr 1.05e-03 | (9887.33 ms | 53026 tok/s) step 5338/76294 | train loss 3.381594 | norm 0.1275 | lr 1.05e-03 | (9887.84 ms | 53024 tok/s) step 5339/76294 | train loss 3.350638 | norm 0.1653 | lr 1.05e-03 | (9983.07 ms | 52518 tok/s) step 5340/76294 | train loss 3.337632 | norm 0.1339 | lr 1.05e-03 | (9887.46 ms | 53026 tok/s) step 5341/76294 | train loss 3.365958 | norm 0.1435 | lr 1.05e-03 | (9928.95 ms | 52804 tok/s) step 5342/76294 | train loss 3.389218 | norm 0.1304 | lr 1.05e-03 | (9887.73 ms | 53024 tok/s) step 5343/76294 | train loss 3.372407 | norm 0.1377 | lr 1.05e-03 | (9899.24 ms | 52962 tok/s) step 5344/76294 | train loss 3.309174 | norm 0.1368 | lr 1.05e-03 | (9887.91 ms | 53023 tok/s) step 5345/76294 | train loss 3.431821 | norm 0.1540 | lr 1.05e-03 | (9894.36 ms | 52989 tok/s) step 5346/76294 | train loss 3.386747 | norm 0.1550 | lr 1.05e-03 | (9888.21 ms | 53022 tok/s) step 5347/76294 | train loss 3.366224 | norm 0.1482 | lr 1.05e-03 | (9895.55 ms | 52982 tok/s) step 5348/76294 | train loss 3.353088 | norm 0.1487 | lr 1.05e-03 | (9890.81 ms | 53008 tok/s) step 5349/76294 | train loss 3.369694 | norm 0.1479 | lr 1.05e-03 | (9918.42 ms | 52860 tok/s) step 5350/76294 | train loss 3.388527 | norm 0.1425 | lr 1.05e-03 | (9899.56 ms | 52961 tok/s) step 5351/76294 | train loss 3.319406 | norm 0.1591 | lr 1.05e-03 | (9893.28 ms | 52994 tok/s) step 5352/76294 | train loss 3.388156 | norm 0.1570 | lr 1.05e-03 | (9889.07 ms | 53017 tok/s) step 5353/76294 | train loss 3.307144 | norm 0.1459 | lr 1.05e-03 | (9911.30 ms | 52898 tok/s) step 5354/76294 | train loss 3.344537 | norm 0.1388 | lr 1.05e-03 | (9891.26 ms | 53005 tok/s) step 5355/76294 | train loss 3.290834 | norm 0.1494 | lr 1.05e-03 | (9889.73 ms | 53013 tok/s) step 5356/76294 | train loss 3.332889 | norm 0.2002 | lr 1.05e-03 | (9892.00 ms | 53001 tok/s) step 5357/76294 | train loss 3.347333 | norm 0.1776 | lr 1.05e-03 | (9890.29 ms | 53010 tok/s) step 5358/76294 | train loss 3.371041 | norm 0.1446 | lr 1.05e-03 | (9892.77 ms | 52997 tok/s) step 5359/76294 | train loss 3.331970 | norm 0.1690 | lr 1.05e-03 | (9890.37 ms | 53010 tok/s) step 5360/76294 | train loss 3.338269 | norm 0.1496 | lr 1.05e-03 | (9893.73 ms | 52992 tok/s) step 5361/76294 | train loss 3.408536 | norm 0.1589 | lr 1.05e-03 | (9894.57 ms | 52987 tok/s) step 5362/76294 | train loss 3.318613 | norm 0.1354 | lr 1.05e-03 | (9910.90 ms | 52900 tok/s) step 5363/76294 | train loss 3.451890 | norm 0.1533 | lr 1.05e-03 | (9951.73 ms | 52683 tok/s) step 5364/76294 | train loss 3.312415 | norm 0.1363 | lr 1.05e-03 | (9883.93 ms | 53045 tok/s) step 5365/76294 | train loss 3.394891 | norm 0.1391 | lr 1.05e-03 | (9914.53 ms | 52881 tok/s) step 5366/76294 | train loss 3.390193 | norm 0.1385 | lr 1.05e-03 | (9952.71 ms | 52678 tok/s) step 5367/76294 | train loss 3.368518 | norm 0.1450 | lr 1.05e-03 | (9894.86 ms | 52986 tok/s) step 5368/76294 | train loss 3.323034 | norm 0.1235 | lr 1.05e-03 | (10883.22 ms | 48174 tok/s) step 5369/76294 | train loss 3.336688 | norm 0.1269 | lr 1.05e-03 | (9881.00 ms | 53060 tok/s) step 5370/76294 | train loss 3.352254 | norm 0.1394 | lr 1.05e-03 | (9924.28 ms | 52829 tok/s) step 5371/76294 | train loss 3.351817 | norm 0.1509 | lr 1.05e-03 | (9903.72 ms | 52938 tok/s) step 5372/76294 | train loss 3.363282 | norm 0.1179 | lr 1.05e-03 | (9890.83 ms | 53007 tok/s) step 5373/76294 | train loss 3.363306 | norm 0.1520 | lr 1.05e-03 | (9953.44 ms | 52674 tok/s) step 5374/76294 | train loss 3.399884 | norm 0.1365 | lr 1.05e-03 | (9883.51 ms | 53047 tok/s) step 5375/76294 | train loss 3.372755 | norm 0.1384 | lr 1.05e-03 | (9886.30 ms | 53032 tok/s) step 5376/76294 | train loss 3.409316 | norm 0.1834 | lr 1.05e-03 | (9932.84 ms | 52783 tok/s) step 5377/76294 | train loss 3.309584 | norm 0.1575 | lr 1.05e-03 | (9891.98 ms | 53001 tok/s) step 5378/76294 | train loss 3.314713 | norm 0.1713 | lr 1.04e-03 | (9888.36 ms | 53021 tok/s) step 5379/76294 | train loss 3.376360 | norm 0.1800 | lr 1.04e-03 | (9899.54 ms | 52961 tok/s) step 5380/76294 | train loss 3.394868 | norm 0.1545 | lr 1.04e-03 | (9885.77 ms | 53035 tok/s) step 5381/76294 | train loss 3.432037 | norm 0.1533 | lr 1.04e-03 | (9894.66 ms | 52987 tok/s) step 5382/76294 | train loss 3.324633 | norm 0.1511 | lr 1.04e-03 | (9886.31 ms | 53032 tok/s) step 5383/76294 | train loss 3.399845 | norm 0.1733 | lr 1.04e-03 | (9894.45 ms | 52988 tok/s) step 5384/76294 | train loss 3.384553 | norm 0.2102 | lr 1.04e-03 | (9889.82 ms | 53013 tok/s) step 5385/76294 | train loss 3.438850 | norm 0.1792 | lr 1.04e-03 | (9891.22 ms | 53005 tok/s) step 5386/76294 | train loss 3.285699 | norm 0.1656 | lr 1.04e-03 | (9896.80 ms | 52976 tok/s) step 5387/76294 | train loss 3.357925 | norm 0.1852 | lr 1.04e-03 | (9932.42 ms | 52786 tok/s) step 5388/76294 | train loss 3.389894 | norm 0.1560 | lr 1.04e-03 | (9925.58 ms | 52822 tok/s) step 5389/76294 | train loss 3.329399 | norm 0.1652 | lr 1.04e-03 | (9915.12 ms | 52878 tok/s) step 5390/76294 | train loss 3.380998 | norm 0.1280 | lr 1.04e-03 | (9890.27 ms | 53010 tok/s) step 5391/76294 | train loss 3.339825 | norm 0.1726 | lr 1.04e-03 | (9910.60 ms | 52902 tok/s) step 5392/76294 | train loss 3.418848 | norm 0.2069 | lr 1.04e-03 | (9892.28 ms | 53000 tok/s) step 5393/76294 | train loss 3.357905 | norm 0.1589 | lr 1.04e-03 | (10643.25 ms | 49260 tok/s) step 5394/76294 | train loss 3.319532 | norm 0.1708 | lr 1.04e-03 | (9895.50 ms | 52982 tok/s) step 5395/76294 | train loss 3.361033 | norm 0.1906 | lr 1.04e-03 | (9896.85 ms | 52975 tok/s) step 5396/76294 | train loss 3.322924 | norm 0.1517 | lr 1.04e-03 | (9904.02 ms | 52937 tok/s) step 5397/76294 | train loss 3.364223 | norm 0.1606 | lr 1.04e-03 | (9897.56 ms | 52971 tok/s) step 5398/76294 | train loss 3.332429 | norm 0.1562 | lr 1.04e-03 | (9884.08 ms | 53044 tok/s) step 5399/76294 | train loss 3.408323 | norm 0.1466 | lr 1.04e-03 | (9897.87 ms | 52970 tok/s) step 5400/76294 | train loss 3.337869 | norm 0.1512 | lr 1.04e-03 | (9887.28 ms | 53026 tok/s) step 5401/76294 | train loss 3.350051 | norm 0.1495 | lr 1.04e-03 | (9900.96 ms | 52953 tok/s) step 5402/76294 | train loss 3.332687 | norm 0.1424 | lr 1.04e-03 | (9950.44 ms | 52690 tok/s) step 5403/76294 | train loss 3.344753 | norm 0.1423 | lr 1.04e-03 | (9892.83 ms | 52997 tok/s) step 5404/76294 | train loss 3.397677 | norm 0.1526 | lr 1.04e-03 | (9885.94 ms | 53034 tok/s) step 5405/76294 | train loss 3.358358 | norm 0.1473 | lr 1.04e-03 | (9897.46 ms | 52972 tok/s) step 5406/76294 | train loss 3.379516 | norm 0.1434 | lr 1.04e-03 | (9885.59 ms | 53036 tok/s) step 5407/76294 | train loss 3.332487 | norm 0.1585 | lr 1.04e-03 | (9952.43 ms | 52679 tok/s) step 5408/76294 | train loss 3.365224 | norm 0.1428 | lr 1.04e-03 | (9910.57 ms | 52902 tok/s) step 5409/76294 | train loss 3.388286 | norm 0.1494 | lr 1.04e-03 | (9958.81 ms | 52646 tok/s) step 5410/76294 | train loss 3.352675 | norm 0.1537 | lr 1.04e-03 | (9887.91 ms | 53023 tok/s) step 5411/76294 | train loss 3.361643 | norm 0.1903 | lr 1.04e-03 | (9888.14 ms | 53022 tok/s) step 5412/76294 | train loss 3.341614 | norm 0.1453 | lr 1.04e-03 | (9886.72 ms | 53030 tok/s) step 5413/76294 | train loss 3.360750 | norm 0.1785 | lr 1.04e-03 | (9897.47 ms | 52972 tok/s) step 5414/76294 | train loss 3.364672 | norm 0.1611 | lr 1.04e-03 | (9940.35 ms | 52743 tok/s) step 5415/76294 | train loss 3.334805 | norm 0.1558 | lr 1.04e-03 | (9894.46 ms | 52988 tok/s) step 5416/76294 | train loss 3.319281 | norm 0.1409 | lr 1.04e-03 | (9928.79 ms | 52805 tok/s) step 5417/76294 | train loss 3.361186 | norm 0.1440 | lr 1.04e-03 | (9891.15 ms | 53006 tok/s) step 5418/76294 | train loss 3.357868 | norm 0.1550 | lr 1.04e-03 | (9899.00 ms | 52964 tok/s) step 5419/76294 | train loss 3.405207 | norm 0.1583 | lr 1.04e-03 | (9892.60 ms | 52998 tok/s) step 5420/76294 | train loss 3.306105 | norm 0.1505 | lr 1.04e-03 | (9919.19 ms | 52856 tok/s) step 5421/76294 | train loss 3.298414 | norm 0.1599 | lr 1.04e-03 | (9896.56 ms | 52977 tok/s) step 5422/76294 | train loss 3.357189 | norm 0.1453 | lr 1.04e-03 | (9887.57 ms | 53025 tok/s) step 5423/76294 | train loss 3.312163 | norm 0.1372 | lr 1.04e-03 | (9889.07 ms | 53017 tok/s) step 5424/76294 | train loss 3.372551 | norm 0.1596 | lr 1.04e-03 | (9892.93 ms | 52996 tok/s) step 5425/76294 | train loss 3.382567 | norm 0.1529 | lr 1.04e-03 | (9929.42 ms | 52801 tok/s) step 5426/76294 | train loss 3.352176 | norm 0.1584 | lr 1.04e-03 | (9888.10 ms | 53022 tok/s) step 5427/76294 | train loss 3.414571 | norm 0.1452 | lr 1.04e-03 | (9919.65 ms | 52853 tok/s) step 5428/76294 | train loss 3.334133 | norm 0.1603 | lr 1.04e-03 | (9891.15 ms | 53006 tok/s) step 5429/76294 | train loss 3.338588 | norm 0.1405 | lr 1.04e-03 | (9899.10 ms | 52963 tok/s) step 5430/76294 | train loss 3.344117 | norm 0.1652 | lr 1.04e-03 | (9891.78 ms | 53002 tok/s) step 5431/76294 | train loss 3.367285 | norm 0.1292 | lr 1.04e-03 | (9899.54 ms | 52961 tok/s) step 5432/76294 | train loss 3.336161 | norm 0.1587 | lr 1.04e-03 | (9894.34 ms | 52989 tok/s) step 5433/76294 | train loss 3.342098 | norm 0.1484 | lr 1.04e-03 | (9980.09 ms | 52533 tok/s) step 5434/76294 | train loss 3.302410 | norm 0.1576 | lr 1.04e-03 | (9890.24 ms | 53011 tok/s) step 5435/76294 | train loss 3.324008 | norm 0.1719 | lr 1.04e-03 | (9920.21 ms | 52850 tok/s) step 5436/76294 | train loss 3.337863 | norm 0.1380 | lr 1.04e-03 | (9890.29 ms | 53010 tok/s) step 5437/76294 | train loss 3.355122 | norm 0.1518 | lr 1.04e-03 | (9915.90 ms | 52873 tok/s) step 5438/76294 | train loss 3.342131 | norm 0.1588 | lr 1.04e-03 | (9930.70 ms | 52795 tok/s) step 5439/76294 | train loss 3.373296 | norm 0.1318 | lr 1.04e-03 | (9894.86 ms | 52986 tok/s) step 5440/76294 | train loss 3.281167 | norm 0.1458 | lr 1.04e-03 | (9890.60 ms | 53009 tok/s) step 5441/76294 | train loss 3.379018 | norm 0.1492 | lr 1.04e-03 | (9900.58 ms | 52955 tok/s) step 5442/76294 | train loss 3.346938 | norm 0.1577 | lr 1.04e-03 | (9908.57 ms | 52913 tok/s) step 5443/76294 | train loss 3.360036 | norm 0.1357 | lr 1.04e-03 | (9988.79 ms | 52488 tok/s) step 5444/76294 | train loss 3.390448 | norm 0.1389 | lr 1.04e-03 | (9890.42 ms | 53010 tok/s) step 5445/76294 | train loss 3.311720 | norm 0.1263 | lr 1.04e-03 | (9896.74 ms | 52976 tok/s) step 5446/76294 | train loss 3.369611 | norm 0.1519 | lr 1.04e-03 | (9892.37 ms | 52999 tok/s) step 5447/76294 | train loss 3.262467 | norm 0.1332 | lr 1.04e-03 | (10268.72 ms | 51057 tok/s) step 5448/76294 | train loss 3.377547 | norm 0.1560 | lr 1.04e-03 | (9916.07 ms | 52873 tok/s) step 5449/76294 | train loss 3.385146 | norm 0.2048 | lr 1.04e-03 | (9894.38 ms | 52988 tok/s) step 5450/76294 | train loss 3.371939 | norm 0.2721 | lr 1.04e-03 | (9897.70 ms | 52971 tok/s) step 5451/76294 | train loss 3.376660 | norm 0.1783 | lr 1.04e-03 | (9893.29 ms | 52994 tok/s) step 5452/76294 | train loss 3.409375 | norm 0.1834 | lr 1.04e-03 | (9957.51 ms | 52653 tok/s) step 5453/76294 | train loss 3.302744 | norm 0.1655 | lr 1.04e-03 | (9891.05 ms | 53006 tok/s) step 5454/76294 | train loss 3.371512 | norm 0.1716 | lr 1.04e-03 | (9977.69 ms | 52546 tok/s) step 5455/76294 | train loss 3.322878 | norm 0.1514 | lr 1.04e-03 | (9902.84 ms | 52943 tok/s) step 5456/76294 | train loss 3.499116 | norm 0.1685 | lr 1.04e-03 | (9950.86 ms | 52688 tok/s) step 5457/76294 | train loss 3.330312 | norm 0.1736 | lr 1.04e-03 | (9899.67 ms | 52960 tok/s) step 5458/76294 | train loss 3.369049 | norm 0.1725 | lr 1.04e-03 | (9888.53 ms | 53020 tok/s) step 5459/76294 | train loss 3.407890 | norm 0.1708 | lr 1.04e-03 | (9897.15 ms | 52974 tok/s) step 5460/76294 | train loss 3.377672 | norm 0.1546 | lr 1.04e-03 | (9889.49 ms | 53015 tok/s) step 5461/76294 | train loss 3.363359 | norm 0.1648 | lr 1.04e-03 | (9938.01 ms | 52756 tok/s) step 5462/76294 | train loss 3.284876 | norm 0.1357 | lr 1.04e-03 | (9886.54 ms | 53030 tok/s) step 5463/76294 | train loss 3.389935 | norm 0.1736 | lr 1.04e-03 | (9893.90 ms | 52991 tok/s) step 5464/76294 | train loss 3.384068 | norm 0.1205 | lr 1.04e-03 | (9885.43 ms | 53036 tok/s) step 5465/76294 | train loss 3.301924 | norm 0.1540 | lr 1.04e-03 | (9891.98 ms | 53001 tok/s) step 5466/76294 | train loss 3.345292 | norm 0.1482 | lr 1.04e-03 | (10937.87 ms | 47933 tok/s) step 5467/76294 | train loss 3.345399 | norm 0.1548 | lr 1.04e-03 | (9890.63 ms | 53009 tok/s) step 5468/76294 | train loss 3.346434 | norm 0.1789 | lr 1.04e-03 | (11559.74 ms | 45355 tok/s) step 5469/76294 | train loss 3.327976 | norm 0.1396 | lr 1.04e-03 | (9882.65 ms | 53051 tok/s) step 5470/76294 | train loss 3.369354 | norm 0.1352 | lr 1.04e-03 | (9886.17 ms | 53032 tok/s) step 5471/76294 | train loss 3.342071 | norm 0.1422 | lr 1.04e-03 | (9878.03 ms | 53076 tok/s) step 5472/76294 | train loss 3.312936 | norm 0.1225 | lr 1.04e-03 | (9892.45 ms | 52999 tok/s) step 5473/76294 | train loss 3.373797 | norm 0.1421 | lr 1.04e-03 | (9886.96 ms | 53028 tok/s) step 5474/76294 | train loss 3.348740 | norm 0.1330 | lr 1.04e-03 | (9887.00 ms | 53028 tok/s) step 5475/76294 | train loss 3.345498 | norm 0.1404 | lr 1.04e-03 | (9891.07 ms | 53006 tok/s) step 5476/76294 | train loss 3.294177 | norm 0.1353 | lr 1.04e-03 | (9894.54 ms | 52988 tok/s) step 5477/76294 | train loss 3.363070 | norm 0.1539 | lr 1.04e-03 | (9890.80 ms | 53008 tok/s) step 5478/76294 | train loss 3.306346 | norm 0.1474 | lr 1.04e-03 | (9930.05 ms | 52798 tok/s) step 5479/76294 | train loss 3.388406 | norm 0.1469 | lr 1.04e-03 | (9938.28 ms | 52754 tok/s) step 5480/76294 | train loss 3.384068 | norm 0.1508 | lr 1.04e-03 | (9898.88 ms | 52964 tok/s) step 5481/76294 | train loss 3.324655 | norm 0.1444 | lr 1.04e-03 | (9897.93 ms | 52969 tok/s) step 5482/76294 | train loss 3.315363 | norm 0.1496 | lr 1.04e-03 | (9888.49 ms | 53020 tok/s) step 5483/76294 | train loss 3.318470 | norm 0.1474 | lr 1.04e-03 | (9895.97 ms | 52980 tok/s) step 5484/76294 | train loss 3.382674 | norm 0.1333 | lr 1.04e-03 | (9885.04 ms | 53039 tok/s) step 5485/76294 | train loss 3.369525 | norm 0.1351 | lr 1.04e-03 | (9896.25 ms | 52978 tok/s) step 5486/76294 | train loss 3.307232 | norm 0.1371 | lr 1.04e-03 | (10533.08 ms | 49775 tok/s) step 5487/76294 | train loss 3.318590 | norm 0.1310 | lr 1.04e-03 | (9901.21 ms | 52952 tok/s) step 5488/76294 | train loss 3.342007 | norm 0.1479 | lr 1.04e-03 | (9937.00 ms | 52761 tok/s) step 5489/76294 | train loss 3.318837 | norm 0.1453 | lr 1.04e-03 | (9906.05 ms | 52926 tok/s) step 5490/76294 | train loss 3.409994 | norm 0.1735 | lr 1.04e-03 | (9882.30 ms | 53053 tok/s) step 5491/76294 | train loss 3.366808 | norm 0.1827 | lr 1.04e-03 | (9887.02 ms | 53028 tok/s) step 5492/76294 | train loss 3.398192 | norm 0.1447 | lr 1.04e-03 | (9890.81 ms | 53008 tok/s) step 5493/76294 | train loss 3.433428 | norm 0.1436 | lr 1.04e-03 | (9929.30 ms | 52802 tok/s) step 5494/76294 | train loss 3.368439 | norm 0.1820 | lr 1.04e-03 | (9886.77 ms | 53029 tok/s) step 5495/76294 | train loss 3.308534 | norm 0.1524 | lr 1.04e-03 | (11769.63 ms | 44546 tok/s) step 5496/76294 | train loss 3.322009 | norm 0.1509 | lr 1.04e-03 | (9877.27 ms | 53080 tok/s) step 5497/76294 | train loss 3.502147 | norm 0.1397 | lr 1.04e-03 | (9919.33 ms | 52855 tok/s) step 5498/76294 | train loss 3.407089 | norm 0.1456 | lr 1.04e-03 | (9879.07 ms | 53071 tok/s) step 5499/76294 | train loss 3.341347 | norm 0.1515 | lr 1.04e-03 | (9880.01 ms | 53066 tok/s) step 5500/76294 | train loss 3.388801 | norm 0.1549 | lr 1.04e-03 | (9894.37 ms | 52988 tok/s) val loss: 3.326933 saving model checkpoint to ./results/gpt2-350M-gqa/step_5500.pth step 5501/76294 | train loss 3.314378 | norm 0.1474 | lr 1.04e-03 | (9978.93 ms | 52540 tok/s) step 5502/76294 | train loss 3.336818 | norm 0.1580 | lr 1.04e-03 | (9864.65 ms | 53148 tok/s) step 5503/76294 | train loss 3.293519 | norm 0.1338 | lr 1.04e-03 | (9871.06 ms | 53114 tok/s) step 5504/76294 | train loss 3.337407 | norm 0.1368 | lr 1.04e-03 | (9926.43 ms | 52817 tok/s) step 5505/76294 | train loss 3.313890 | norm 0.1357 | lr 1.04e-03 | (9879.83 ms | 53067 tok/s) step 5506/76294 | train loss 3.312490 | norm 0.1439 | lr 1.04e-03 | (9887.66 ms | 53024 tok/s) step 5507/76294 | train loss 3.371424 | norm 0.1411 | lr 1.04e-03 | (9890.32 ms | 53010 tok/s) step 5508/76294 | train loss 3.394278 | norm 0.1380 | lr 1.04e-03 | (9897.03 ms | 52974 tok/s) step 5509/76294 | train loss 3.345308 | norm 0.1272 | lr 1.04e-03 | (9898.28 ms | 52968 tok/s) step 5510/76294 | train loss 3.303860 | norm 0.1351 | lr 1.04e-03 | (9928.13 ms | 52808 tok/s) step 5511/76294 | train loss 3.396362 | norm 0.1444 | lr 1.04e-03 | (9901.11 ms | 52952 tok/s) step 5512/76294 | train loss 3.332907 | norm 0.1448 | lr 1.04e-03 | (9903.55 ms | 52939 tok/s) step 5513/76294 | train loss 3.344181 | norm 0.1396 | lr 1.04e-03 | (9938.71 ms | 52752 tok/s) step 5514/76294 | train loss 3.423599 | norm 0.1338 | lr 1.04e-03 | (9895.52 ms | 52982 tok/s) step 5515/76294 | train loss 3.341913 | norm 0.1465 | lr 1.04e-03 | (9934.29 ms | 52776 tok/s) step 5516/76294 | train loss 3.328393 | norm 0.1306 | lr 1.04e-03 | (9897.90 ms | 52970 tok/s) step 5517/76294 | train loss 3.289533 | norm 0.1634 | lr 1.04e-03 | (9905.87 ms | 52927 tok/s) step 5518/76294 | train loss 3.381379 | norm 0.1615 | lr 1.04e-03 | (9891.49 ms | 53004 tok/s) step 5519/76294 | train loss 3.344435 | norm 0.1417 | lr 1.04e-03 | (9897.69 ms | 52971 tok/s) step 5520/76294 | train loss 3.412047 | norm 0.1596 | lr 1.04e-03 | (9893.79 ms | 52992 tok/s) step 5521/76294 | train loss 3.279017 | norm 0.1361 | lr 1.04e-03 | (9898.97 ms | 52964 tok/s) step 5522/76294 | train loss 3.378900 | norm 0.1451 | lr 1.04e-03 | (9893.80 ms | 52992 tok/s) step 5523/76294 | train loss 3.315518 | norm 0.1558 | lr 1.04e-03 | (9934.31 ms | 52776 tok/s) step 5524/76294 | train loss 3.306223 | norm 0.1629 | lr 1.04e-03 | (9890.80 ms | 53008 tok/s) step 5525/76294 | train loss 3.312805 | norm 0.1325 | lr 1.04e-03 | (9899.24 ms | 52962 tok/s) step 5526/76294 | train loss 3.376676 | norm 0.1555 | lr 1.04e-03 | (9888.86 ms | 53018 tok/s) step 5527/76294 | train loss 3.359776 | norm 0.1420 | lr 1.04e-03 | (9899.82 ms | 52959 tok/s) step 5528/76294 | train loss 3.367469 | norm 0.1435 | lr 1.04e-03 | (9893.16 ms | 52995 tok/s) step 5529/76294 | train loss 3.356519 | norm 0.1656 | lr 1.04e-03 | (9895.03 ms | 52985 tok/s) step 5530/76294 | train loss 3.304898 | norm 0.1593 | lr 1.04e-03 | (9890.91 ms | 53007 tok/s) step 5531/76294 | train loss 3.396586 | norm 0.1455 | lr 1.04e-03 | (9896.17 ms | 52979 tok/s) step 5532/76294 | train loss 3.402657 | norm 0.1547 | lr 1.04e-03 | (9887.65 ms | 53025 tok/s) step 5533/76294 | train loss 3.324421 | norm 0.1583 | lr 1.04e-03 | (9899.58 ms | 52961 tok/s) step 5534/76294 | train loss 3.353500 | norm 0.1338 | lr 1.03e-03 | (9906.52 ms | 52924 tok/s) step 5535/76294 | train loss 3.302685 | norm 0.1452 | lr 1.03e-03 | (9894.14 ms | 52990 tok/s) step 5536/76294 | train loss 3.436410 | norm 0.1636 | lr 1.03e-03 | (9890.19 ms | 53011 tok/s) step 5537/76294 | train loss 3.357110 | norm 0.1590 | lr 1.03e-03 | (9959.90 ms | 52640 tok/s) step 5538/76294 | train loss 3.367017 | norm 0.2832 | lr 1.03e-03 | (9880.01 ms | 53066 tok/s) step 5539/76294 | train loss 3.334328 | norm 0.1935 | lr 1.03e-03 | (9888.54 ms | 53020 tok/s) step 5540/76294 | train loss 3.328618 | norm 0.1675 | lr 1.03e-03 | (9953.89 ms | 52672 tok/s) step 5541/76294 | train loss 3.392903 | norm 0.1869 | lr 1.03e-03 | (9894.04 ms | 52990 tok/s) step 5542/76294 | train loss 3.369728 | norm 0.1781 | lr 1.03e-03 | (9899.16 ms | 52963 tok/s) step 5543/76294 | train loss 3.344214 | norm 0.1485 | lr 1.03e-03 | (9929.25 ms | 52802 tok/s) step 5544/76294 | train loss 3.353106 | norm 0.1504 | lr 1.03e-03 | (9886.95 ms | 53028 tok/s) step 5545/76294 | train loss 3.346665 | norm 0.1330 | lr 1.03e-03 | (9894.21 ms | 52989 tok/s) step 5546/76294 | train loss 3.312386 | norm 0.1461 | lr 1.03e-03 | (9917.49 ms | 52865 tok/s) step 5547/76294 | train loss 3.322787 | norm 0.1367 | lr 1.03e-03 | (9896.34 ms | 52978 tok/s) step 5548/76294 | train loss 3.385593 | norm 0.1290 | lr 1.03e-03 | (9897.47 ms | 52972 tok/s) step 5549/76294 | train loss 3.319882 | norm 0.1344 | lr 1.03e-03 | (9888.49 ms | 53020 tok/s) step 5550/76294 | train loss 3.506549 | norm 0.1297 | lr 1.03e-03 | (9952.47 ms | 52679 tok/s) step 5551/76294 | train loss 3.317697 | norm 0.1577 | lr 1.03e-03 | (9933.89 ms | 52778 tok/s) step 5552/76294 | train loss 3.328344 | norm 0.1285 | lr 1.03e-03 | (9898.39 ms | 52967 tok/s) step 5553/76294 | train loss 3.322592 | norm 0.1477 | lr 1.03e-03 | (9894.44 ms | 52988 tok/s) step 5554/76294 | train loss 3.364914 | norm 0.1361 | lr 1.03e-03 | (9897.03 ms | 52974 tok/s) step 5555/76294 | train loss 3.312334 | norm 0.1605 | lr 1.03e-03 | (9893.29 ms | 52994 tok/s) step 5556/76294 | train loss 3.335726 | norm 0.1602 | lr 1.03e-03 | (9895.77 ms | 52981 tok/s) step 5557/76294 | train loss 3.378469 | norm 0.1521 | lr 1.03e-03 | (9891.54 ms | 53004 tok/s) step 5558/76294 | train loss 3.324363 | norm 0.1323 | lr 1.03e-03 | (9896.89 ms | 52975 tok/s) step 5559/76294 | train loss 3.360945 | norm 0.1464 | lr 1.03e-03 | (9948.43 ms | 52701 tok/s) step 5560/76294 | train loss 3.299930 | norm 0.1444 | lr 1.03e-03 | (9895.56 ms | 52982 tok/s) step 5561/76294 | train loss 3.416568 | norm 0.1259 | lr 1.03e-03 | (9913.41 ms | 52887 tok/s) step 5562/76294 | train loss 3.343927 | norm 0.1393 | lr 1.03e-03 | (9894.36 ms | 52989 tok/s) step 5563/76294 | train loss 3.349919 | norm 0.1261 | lr 1.03e-03 | (9902.40 ms | 52946 tok/s) step 5564/76294 | train loss 3.373356 | norm 0.1236 | lr 1.03e-03 | (11117.23 ms | 47160 tok/s) step 5565/76294 | train loss 3.326662 | norm 0.1223 | lr 1.03e-03 | (9886.27 ms | 53032 tok/s) step 5566/76294 | train loss 3.413040 | norm 0.1497 | lr 1.03e-03 | (9891.00 ms | 53007 tok/s) step 5567/76294 | train loss 3.382411 | norm 0.1489 | lr 1.03e-03 | (9892.56 ms | 52998 tok/s) step 5568/76294 | train loss 3.345556 | norm 0.1250 | lr 1.03e-03 | (9892.92 ms | 52996 tok/s) step 5569/76294 | train loss 3.354496 | norm 0.1491 | lr 1.03e-03 | (9924.36 ms | 52828 tok/s) step 5570/76294 | train loss 3.326488 | norm 0.1281 | lr 1.03e-03 | (9910.33 ms | 52903 tok/s) step 5571/76294 | train loss 3.311555 | norm 0.1489 | lr 1.03e-03 | (9896.41 ms | 52978 tok/s) step 5572/76294 | train loss 3.301347 | norm 0.1460 | lr 1.03e-03 | (9898.33 ms | 52967 tok/s) step 5573/76294 | train loss 3.328336 | norm 0.1362 | lr 1.03e-03 | (9896.56 ms | 52977 tok/s) step 5574/76294 | train loss 3.301261 | norm 0.1550 | lr 1.03e-03 | (9896.33 ms | 52978 tok/s) step 5575/76294 | train loss 3.370480 | norm 0.1206 | lr 1.03e-03 | (9935.21 ms | 52771 tok/s) step 5576/76294 | train loss 3.389618 | norm 0.1465 | lr 1.03e-03 | (9892.14 ms | 53000 tok/s) step 5577/76294 | train loss 3.371364 | norm 0.1474 | lr 1.03e-03 | (9912.39 ms | 52892 tok/s) step 5578/76294 | train loss 3.361406 | norm 0.1657 | lr 1.03e-03 | (9894.40 ms | 52988 tok/s) step 5579/76294 | train loss 3.380992 | norm 0.1499 | lr 1.03e-03 | (9896.39 ms | 52978 tok/s) step 5580/76294 | train loss 3.308177 | norm 0.1512 | lr 1.03e-03 | (9889.95 ms | 53012 tok/s) step 5581/76294 | train loss 3.398704 | norm 0.1814 | lr 1.03e-03 | (9901.58 ms | 52950 tok/s) step 5582/76294 | train loss 3.346568 | norm 0.1451 | lr 1.03e-03 | (9894.00 ms | 52990 tok/s) step 5583/76294 | train loss 3.495013 | norm 0.1606 | lr 1.03e-03 | (9901.93 ms | 52948 tok/s) step 5584/76294 | train loss 3.381114 | norm 0.1539 | lr 1.03e-03 | (9891.60 ms | 53003 tok/s) step 5585/76294 | train loss 3.392530 | norm 0.1751 | lr 1.03e-03 | (9928.69 ms | 52805 tok/s) step 5586/76294 | train loss 3.307955 | norm 0.2189 | lr 1.03e-03 | (9887.04 ms | 53028 tok/s) step 5587/76294 | train loss 3.381736 | norm 0.1341 | lr 1.03e-03 | (9898.06 ms | 52969 tok/s) step 5588/76294 | train loss 3.309836 | norm 0.1684 | lr 1.03e-03 | (9889.20 ms | 53016 tok/s) step 5589/76294 | train loss 3.300123 | norm 0.1444 | lr 1.03e-03 | (9920.82 ms | 52847 tok/s) step 5590/76294 | train loss 3.324924 | norm 0.1448 | lr 1.03e-03 | (9896.94 ms | 52975 tok/s) step 5591/76294 | train loss 3.360038 | norm 0.1636 | lr 1.03e-03 | (9889.38 ms | 53015 tok/s) step 5592/76294 | train loss 3.451524 | norm 0.1410 | lr 1.03e-03 | (9884.39 ms | 53042 tok/s) step 5593/76294 | train loss 3.637339 | norm 0.1927 | lr 1.03e-03 | (9905.77 ms | 52928 tok/s) step 5594/76294 | train loss 3.360796 | norm 0.1535 | lr 1.03e-03 | (9896.91 ms | 52975 tok/s) step 5595/76294 | train loss 3.317356 | norm 0.1640 | lr 1.03e-03 | (9895.77 ms | 52981 tok/s) step 5596/76294 | train loss 3.336946 | norm 0.1463 | lr 1.03e-03 | (9892.00 ms | 53001 tok/s) step 5597/76294 | train loss 3.334536 | norm 0.1654 | lr 1.03e-03 | (9923.43 ms | 52833 tok/s) step 5598/76294 | train loss 3.330163 | norm 0.1293 | lr 1.03e-03 | (9914.93 ms | 52879 tok/s) step 5599/76294 | train loss 3.417974 | norm 0.1545 | lr 1.03e-03 | (9889.52 ms | 53015 tok/s) step 5600/76294 | train loss 3.355595 | norm 0.1597 | lr 1.03e-03 | (9892.52 ms | 52998 tok/s) step 5601/76294 | train loss 3.343336 | norm 0.1513 | lr 1.03e-03 | (9897.84 ms | 52970 tok/s) step 5602/76294 | train loss 3.309923 | norm 0.1580 | lr 1.03e-03 | (9924.81 ms | 52826 tok/s) step 5603/76294 | train loss 3.332714 | norm 0.1518 | lr 1.03e-03 | (9894.41 ms | 52988 tok/s) step 5604/76294 | train loss 3.350296 | norm 0.1546 | lr 1.03e-03 | (9896.83 ms | 52975 tok/s) step 5605/76294 | train loss 3.328570 | norm 0.1470 | lr 1.03e-03 | (9890.54 ms | 53009 tok/s) step 5606/76294 | train loss 3.296029 | norm 0.1404 | lr 1.03e-03 | (9896.03 ms | 52980 tok/s) step 5607/76294 | train loss 3.286503 | norm 0.1373 | lr 1.03e-03 | (9918.22 ms | 52861 tok/s) step 5608/76294 | train loss 3.383382 | norm 0.1500 | lr 1.03e-03 | (9893.65 ms | 52992 tok/s) step 5609/76294 | train loss 3.296166 | norm 0.1370 | lr 1.03e-03 | (9896.46 ms | 52977 tok/s) step 5610/76294 | train loss 3.308590 | norm 0.1565 | lr 1.03e-03 | (9888.36 ms | 53021 tok/s) step 5611/76294 | train loss 3.348429 | norm 0.1534 | lr 1.03e-03 | (9924.36 ms | 52828 tok/s) step 5612/76294 | train loss 3.297701 | norm 0.1462 | lr 1.03e-03 | (9924.44 ms | 52828 tok/s) step 5613/76294 | train loss 3.367748 | norm 0.1537 | lr 1.03e-03 | (9890.47 ms | 53009 tok/s) step 5614/76294 | train loss 3.351881 | norm 0.1515 | lr 1.03e-03 | (9896.71 ms | 52976 tok/s) step 5615/76294 | train loss 3.349380 | norm 0.1911 | lr 1.03e-03 | (9896.63 ms | 52976 tok/s) step 5616/76294 | train loss 3.213065 | norm 0.1519 | lr 1.03e-03 | (9891.87 ms | 53002 tok/s) step 5617/76294 | train loss 3.390469 | norm 0.1606 | lr 1.03e-03 | (9892.05 ms | 53001 tok/s) step 5618/76294 | train loss 3.363322 | norm 0.1694 | lr 1.03e-03 | (9894.74 ms | 52987 tok/s) step 5619/76294 | train loss 3.337439 | norm 0.1363 | lr 1.03e-03 | (9892.81 ms | 52997 tok/s) step 5620/76294 | train loss 3.306822 | norm 0.2385 | lr 1.03e-03 | (9898.92 ms | 52964 tok/s) step 5621/76294 | train loss 3.361982 | norm 0.1289 | lr 1.03e-03 | (9933.00 ms | 52782 tok/s) step 5622/76294 | train loss 3.349135 | norm 0.1620 | lr 1.03e-03 | (9897.59 ms | 52971 tok/s) step 5623/76294 | train loss 3.352873 | norm 0.1248 | lr 1.03e-03 | (9892.93 ms | 52996 tok/s) step 5624/76294 | train loss 3.365282 | norm 0.1446 | lr 1.03e-03 | (9916.06 ms | 52873 tok/s) step 5625/76294 | train loss 3.350995 | norm 0.1400 | lr 1.03e-03 | (9911.23 ms | 52898 tok/s) step 5626/76294 | train loss 3.321471 | norm 0.1380 | lr 1.03e-03 | (9898.28 ms | 52968 tok/s) step 5627/76294 | train loss 3.327135 | norm 0.1492 | lr 1.03e-03 | (9888.36 ms | 53021 tok/s) step 5628/76294 | train loss 3.328268 | norm 0.1262 | lr 1.03e-03 | (9888.02 ms | 53023 tok/s) step 5629/76294 | train loss 3.360277 | norm 0.1443 | lr 1.03e-03 | (9898.80 ms | 52965 tok/s) step 5630/76294 | train loss 3.396587 | norm 0.1320 | lr 1.03e-03 | (9939.67 ms | 52747 tok/s) step 5631/76294 | train loss 3.366000 | norm 0.1513 | lr 1.03e-03 | (9888.60 ms | 53019 tok/s) step 5632/76294 | train loss 3.451830 | norm 0.1360 | lr 1.03e-03 | (9895.44 ms | 52983 tok/s) step 5633/76294 | train loss 3.286074 | norm 0.1841 | lr 1.03e-03 | (9892.61 ms | 52998 tok/s) step 5634/76294 | train loss 3.375938 | norm 0.1351 | lr 1.03e-03 | (9932.65 ms | 52784 tok/s) step 5635/76294 | train loss 3.334524 | norm 0.1598 | lr 1.03e-03 | (9901.34 ms | 52951 tok/s) step 5636/76294 | train loss 3.339148 | norm 0.1368 | lr 1.03e-03 | (9887.58 ms | 53025 tok/s) step 5637/76294 | train loss 3.304517 | norm 0.1493 | lr 1.03e-03 | (10623.51 ms | 49352 tok/s) step 5638/76294 | train loss 3.355134 | norm 0.1611 | lr 1.03e-03 | (9880.00 ms | 53066 tok/s) step 5639/76294 | train loss 3.341687 | norm 0.1312 | lr 1.03e-03 | (9894.76 ms | 52986 tok/s) step 5640/76294 | train loss 3.342736 | norm 0.1542 | lr 1.03e-03 | (9885.16 ms | 53038 tok/s) step 5641/76294 | train loss 3.322020 | norm 0.1265 | lr 1.03e-03 | (9893.56 ms | 52993 tok/s) step 5642/76294 | train loss 3.321279 | norm 0.1528 | lr 1.03e-03 | (9922.34 ms | 52839 tok/s) step 5643/76294 | train loss 3.316913 | norm 0.1287 | lr 1.03e-03 | (9937.17 ms | 52760 tok/s) step 5644/76294 | train loss 3.342960 | norm 0.1552 | lr 1.03e-03 | (9890.52 ms | 53009 tok/s) step 5645/76294 | train loss 3.328788 | norm 0.1315 | lr 1.03e-03 | (9894.95 ms | 52985 tok/s) step 5646/76294 | train loss 3.321488 | norm 0.1460 | lr 1.03e-03 | (9890.55 ms | 53009 tok/s) step 5647/76294 | train loss 3.441768 | norm 0.1656 | lr 1.03e-03 | (9888.78 ms | 53018 tok/s) step 5648/76294 | train loss 3.278561 | norm 0.1577 | lr 1.03e-03 | (9895.94 ms | 52980 tok/s) step 5649/76294 | train loss 3.313441 | norm 0.1757 | lr 1.03e-03 | (9889.62 ms | 53014 tok/s) step 5650/76294 | train loss 3.392022 | norm 0.1410 | lr 1.03e-03 | (9896.15 ms | 52979 tok/s) step 5651/76294 | train loss 3.304801 | norm 0.1878 | lr 1.03e-03 | (9890.50 ms | 53009 tok/s) step 5652/76294 | train loss 3.240007 | norm 0.1293 | lr 1.03e-03 | (9892.76 ms | 52997 tok/s) step 5653/76294 | train loss 3.321224 | norm 0.1597 | lr 1.03e-03 | (9887.82 ms | 53024 tok/s) step 5654/76294 | train loss 3.238011 | norm 0.1490 | lr 1.03e-03 | (9904.60 ms | 52934 tok/s) step 5655/76294 | train loss 3.276729 | norm 0.1343 | lr 1.03e-03 | (9885.71 ms | 53035 tok/s) step 5656/76294 | train loss 3.392489 | norm 0.1501 | lr 1.03e-03 | (9891.91 ms | 53002 tok/s) step 5657/76294 | train loss 3.358816 | norm 0.1534 | lr 1.03e-03 | (9884.00 ms | 53044 tok/s) step 5658/76294 | train loss 3.350694 | norm 0.1458 | lr 1.03e-03 | (9923.42 ms | 52833 tok/s) step 5659/76294 | train loss 3.313722 | norm 0.1683 | lr 1.03e-03 | (9886.66 ms | 53030 tok/s) step 5660/76294 | train loss 3.273759 | norm 0.1397 | lr 1.03e-03 | (9926.94 ms | 52815 tok/s) step 5661/76294 | train loss 3.280979 | norm 0.1502 | lr 1.03e-03 | (10775.78 ms | 48654 tok/s) step 5662/76294 | train loss 3.336674 | norm 0.1628 | lr 1.03e-03 | (9892.54 ms | 52998 tok/s) step 5663/76294 | train loss 3.288578 | norm 0.1370 | lr 1.03e-03 | (9887.77 ms | 53024 tok/s) step 5664/76294 | train loss 3.300940 | norm 0.1421 | lr 1.03e-03 | (9880.23 ms | 53064 tok/s) step 5665/76294 | train loss 3.396219 | norm 0.1239 | lr 1.03e-03 | (9889.66 ms | 53014 tok/s) step 5666/76294 | train loss 3.299148 | norm 0.1484 | lr 1.03e-03 | (9947.92 ms | 52703 tok/s) step 5667/76294 | train loss 3.297337 | norm 0.1229 | lr 1.03e-03 | (9891.69 ms | 53003 tok/s) step 5668/76294 | train loss 3.343535 | norm 0.1371 | lr 1.03e-03 | (9882.61 ms | 53052 tok/s) step 5669/76294 | train loss 3.272804 | norm 0.1258 | lr 1.03e-03 | (9887.69 ms | 53024 tok/s) step 5670/76294 | train loss 3.288341 | norm 0.1468 | lr 1.03e-03 | (9888.84 ms | 53018 tok/s) step 5671/76294 | train loss 3.366022 | norm 0.1293 | lr 1.03e-03 | (9934.35 ms | 52775 tok/s) step 5672/76294 | train loss 3.293490 | norm 0.1432 | lr 1.03e-03 | (9884.21 ms | 53043 tok/s) step 5673/76294 | train loss 3.297355 | norm 0.1435 | lr 1.03e-03 | (9894.51 ms | 52988 tok/s) step 5674/76294 | train loss 3.406120 | norm 0.1410 | lr 1.03e-03 | (9889.12 ms | 53017 tok/s) step 5675/76294 | train loss 3.229010 | norm 0.1579 | lr 1.03e-03 | (9902.50 ms | 52945 tok/s) step 5676/76294 | train loss 3.282759 | norm 0.1458 | lr 1.03e-03 | (9956.99 ms | 52655 tok/s) step 5677/76294 | train loss 3.356940 | norm 0.1497 | lr 1.03e-03 | (9900.57 ms | 52955 tok/s) step 5678/76294 | train loss 3.334003 | norm 0.1699 | lr 1.03e-03 | (9902.94 ms | 52943 tok/s) step 5679/76294 | train loss 3.401237 | norm 0.1794 | lr 1.03e-03 | (9905.03 ms | 52931 tok/s) step 5680/76294 | train loss 3.392018 | norm 0.1760 | lr 1.03e-03 | (9888.16 ms | 53022 tok/s) step 5681/76294 | train loss 3.316355 | norm 0.1390 | lr 1.03e-03 | (9898.42 ms | 52967 tok/s) step 5682/76294 | train loss 3.289234 | norm 0.1400 | lr 1.03e-03 | (9898.44 ms | 52967 tok/s) step 5683/76294 | train loss 3.331379 | norm 0.1559 | lr 1.03e-03 | (9899.21 ms | 52963 tok/s) step 5684/76294 | train loss 3.311899 | norm 0.1626 | lr 1.03e-03 | (9891.46 ms | 53004 tok/s) step 5685/76294 | train loss 3.331551 | norm 0.1471 | lr 1.03e-03 | (9896.84 ms | 52975 tok/s) step 5686/76294 | train loss 3.349561 | norm 0.1670 | lr 1.03e-03 | (9953.54 ms | 52674 tok/s) step 5687/76294 | train loss 3.284879 | norm 0.1390 | lr 1.02e-03 | (9895.24 ms | 52984 tok/s) step 5688/76294 | train loss 3.324229 | norm 0.1599 | lr 1.02e-03 | (9891.96 ms | 53001 tok/s) step 5689/76294 | train loss 3.348549 | norm 0.1343 | lr 1.02e-03 | (9933.43 ms | 52780 tok/s) step 5690/76294 | train loss 3.269811 | norm 0.1547 | lr 1.02e-03 | (9889.16 ms | 53016 tok/s) step 5691/76294 | train loss 3.393032 | norm 0.1436 | lr 1.02e-03 | (9896.24 ms | 52979 tok/s) step 5692/76294 | train loss 3.342775 | norm 0.1460 | lr 1.02e-03 | (9892.27 ms | 53000 tok/s) step 5693/76294 | train loss 3.296675 | norm 0.1358 | lr 1.02e-03 | (9897.69 ms | 52971 tok/s) step 5694/76294 | train loss 3.260568 | norm 0.1595 | lr 1.02e-03 | (9894.26 ms | 52989 tok/s) step 5695/76294 | train loss 3.338408 | norm 0.1468 | lr 1.02e-03 | (9896.26 ms | 52978 tok/s) step 5696/76294 | train loss 3.279212 | norm 0.1493 | lr 1.02e-03 | (9886.26 ms | 53032 tok/s) step 5697/76294 | train loss 3.329777 | norm 0.1801 | lr 1.02e-03 | (9953.12 ms | 52676 tok/s) step 5698/76294 | train loss 3.348791 | norm 0.1280 | lr 1.02e-03 | (9894.36 ms | 52989 tok/s) step 5699/76294 | train loss 3.247001 | norm 0.1822 | lr 1.02e-03 | (9900.74 ms | 52954 tok/s) step 5700/76294 | train loss 3.361890 | norm 0.1439 | lr 1.02e-03 | (9888.50 ms | 53020 tok/s) step 5701/76294 | train loss 3.366616 | norm 0.1698 | lr 1.02e-03 | (9900.42 ms | 52956 tok/s) step 5702/76294 | train loss 3.315574 | norm 0.1930 | lr 1.02e-03 | (9890.03 ms | 53012 tok/s) step 5703/76294 | train loss 3.300444 | norm 0.1361 | lr 1.02e-03 | (9898.33 ms | 52967 tok/s) step 5704/76294 | train loss 3.303174 | norm 0.1475 | lr 1.02e-03 | (9890.63 ms | 53009 tok/s) step 5705/76294 | train loss 3.325454 | norm 0.1461 | lr 1.02e-03 | (9897.78 ms | 52970 tok/s) step 5706/76294 | train loss 3.350050 | norm 0.1381 | lr 1.02e-03 | (9892.30 ms | 53000 tok/s) step 5707/76294 | train loss 3.296718 | norm 0.1517 | lr 1.02e-03 | (9902.02 ms | 52948 tok/s) step 5708/76294 | train loss 3.298208 | norm 0.1350 | lr 1.02e-03 | (9885.55 ms | 53036 tok/s) step 5709/76294 | train loss 3.349020 | norm 0.1609 | lr 1.02e-03 | (9897.00 ms | 52974 tok/s) step 5710/76294 | train loss 3.390394 | norm 0.1487 | lr 1.02e-03 | (9957.63 ms | 52652 tok/s) step 5711/76294 | train loss 3.284757 | norm 0.1563 | lr 1.02e-03 | (9895.10 ms | 52985 tok/s) step 5712/76294 | train loss 3.512478 | norm 0.1793 | lr 1.02e-03 | (9947.12 ms | 52708 tok/s) step 5713/76294 | train loss 3.314072 | norm 0.1776 | lr 1.02e-03 | (9889.16 ms | 53016 tok/s) step 5714/76294 | train loss 3.284350 | norm 0.1623 | lr 1.02e-03 | (9955.32 ms | 52664 tok/s) step 5715/76294 | train loss 3.370505 | norm 0.1430 | lr 1.02e-03 | (9939.94 ms | 52746 tok/s) step 5716/76294 | train loss 3.325602 | norm 0.1525 | lr 1.02e-03 | (9972.79 ms | 52572 tok/s) step 5717/76294 | train loss 3.299235 | norm 0.1433 | lr 1.02e-03 | (9888.49 ms | 53020 tok/s) step 5718/76294 | train loss 3.399324 | norm 0.1534 | lr 1.02e-03 | (9928.27 ms | 52808 tok/s) step 5719/76294 | train loss 3.298228 | norm 0.1307 | lr 1.02e-03 | (9896.10 ms | 52979 tok/s) step 5720/76294 | train loss 3.328322 | norm 0.1403 | lr 1.02e-03 | (9887.80 ms | 53024 tok/s) step 5721/76294 | train loss 3.349557 | norm 0.1367 | lr 1.02e-03 | (9893.34 ms | 52994 tok/s) step 5722/76294 | train loss 3.305840 | norm 0.1458 | lr 1.02e-03 | (9886.06 ms | 53033 tok/s) step 5723/76294 | train loss 3.294593 | norm 0.1500 | lr 1.02e-03 | (9895.01 ms | 52985 tok/s) step 5724/76294 | train loss 3.339581 | norm 0.1483 | lr 1.02e-03 | (9891.16 ms | 53006 tok/s) step 5725/76294 | train loss 3.517938 | norm 0.1515 | lr 1.02e-03 | (9893.89 ms | 52991 tok/s) step 5726/76294 | train loss 3.392450 | norm 0.1504 | lr 1.02e-03 | (9943.20 ms | 52728 tok/s) step 5727/76294 | train loss 3.334173 | norm 0.1428 | lr 1.02e-03 | (9895.75 ms | 52981 tok/s) step 5728/76294 | train loss 3.322498 | norm 0.1416 | lr 1.02e-03 | (9885.42 ms | 53037 tok/s) step 5729/76294 | train loss 3.294782 | norm 0.1365 | lr 1.02e-03 | (9896.74 ms | 52976 tok/s) step 5730/76294 | train loss 3.452682 | norm 0.1648 | lr 1.02e-03 | (9892.17 ms | 53000 tok/s) step 5731/76294 | train loss 3.391013 | norm 0.1659 | lr 1.02e-03 | (9897.17 ms | 52974 tok/s) step 5732/76294 | train loss 3.289478 | norm 0.1687 | lr 1.02e-03 | (9892.15 ms | 53000 tok/s) step 5733/76294 | train loss 3.344552 | norm 0.1411 | lr 1.02e-03 | (9931.35 ms | 52791 tok/s) step 5734/76294 | train loss 3.328247 | norm 0.1658 | lr 1.02e-03 | (9889.78 ms | 53013 tok/s) step 5735/76294 | train loss 3.317162 | norm 0.1440 | lr 1.02e-03 | (9903.83 ms | 52938 tok/s) step 5736/76294 | train loss 3.360759 | norm 0.1758 | lr 1.02e-03 | (9906.54 ms | 52923 tok/s) step 5737/76294 | train loss 3.311771 | norm 0.1475 | lr 1.02e-03 | (9895.59 ms | 52982 tok/s) step 5738/76294 | train loss 3.358614 | norm 0.1899 | lr 1.02e-03 | (9934.71 ms | 52773 tok/s) step 5739/76294 | train loss 3.361465 | norm 0.1633 | lr 1.02e-03 | (9892.29 ms | 53000 tok/s) step 5740/76294 | train loss 3.325269 | norm 0.1801 | lr 1.02e-03 | (9894.50 ms | 52988 tok/s) step 5741/76294 | train loss 3.255417 | norm 0.1455 | lr 1.02e-03 | (9895.01 ms | 52985 tok/s) step 5742/76294 | train loss 3.267416 | norm 0.1782 | lr 1.02e-03 | (9894.17 ms | 52990 tok/s) step 5743/76294 | train loss 3.309674 | norm 0.1598 | lr 1.02e-03 | (9891.10 ms | 53006 tok/s) step 5744/76294 | train loss 3.277844 | norm 0.1611 | lr 1.02e-03 | (9897.73 ms | 52971 tok/s) step 5745/76294 | train loss 3.341541 | norm 0.1591 | lr 1.02e-03 | (9891.30 ms | 53005 tok/s) step 5746/76294 | train loss 3.292212 | norm 0.1552 | lr 1.02e-03 | (9891.84 ms | 53002 tok/s) step 5747/76294 | train loss 3.379518 | norm 0.1491 | lr 1.02e-03 | (9912.25 ms | 52893 tok/s) step 5748/76294 | train loss 3.334770 | norm 0.1428 | lr 1.02e-03 | (9921.49 ms | 52844 tok/s) step 5749/76294 | train loss 3.349418 | norm 0.1462 | lr 1.02e-03 | (9921.52 ms | 52844 tok/s) step 5750/76294 | train loss 3.291800 | norm 0.1567 | lr 1.02e-03 | (9925.03 ms | 52825 tok/s) val loss: 3.322660 saving model checkpoint to ./results/gpt2-350M-gqa/step_5750.pth step 5751/76294 | train loss 3.320203 | norm 0.1537 | lr 1.02e-03 | (10005.56 ms | 52400 tok/s) step 5752/76294 | train loss 3.295680 | norm 0.1363 | lr 1.02e-03 | (9875.29 ms | 53091 tok/s) step 5753/76294 | train loss 3.303054 | norm 0.1655 | lr 1.02e-03 | (9877.91 ms | 53077 tok/s) step 5754/76294 | train loss 3.396149 | norm 0.1518 | lr 1.02e-03 | (9886.29 ms | 53032 tok/s) step 5755/76294 | train loss 3.322060 | norm 0.1639 | lr 1.02e-03 | (9901.14 ms | 52952 tok/s) step 5756/76294 | train loss 3.298577 | norm 0.1362 | lr 1.02e-03 | (9883.28 ms | 53048 tok/s) step 5757/76294 | train loss 3.304139 | norm 0.1295 | lr 1.02e-03 | (9912.26 ms | 52893 tok/s) step 5758/76294 | train loss 3.308244 | norm 0.1490 | lr 1.02e-03 | (9943.46 ms | 52727 tok/s) step 5759/76294 | train loss 3.273407 | norm 0.1359 | lr 1.02e-03 | (11110.14 ms | 47190 tok/s) step 5760/76294 | train loss 3.383774 | norm 0.1398 | lr 1.02e-03 | (9904.16 ms | 52936 tok/s) step 5761/76294 | train loss 3.471736 | norm 0.1404 | lr 1.02e-03 | (9890.22 ms | 53011 tok/s) step 5762/76294 | train loss 3.290997 | norm 0.1219 | lr 1.02e-03 | (9890.23 ms | 53011 tok/s) step 5763/76294 | train loss 3.348493 | norm 0.1358 | lr 1.02e-03 | (9909.31 ms | 52909 tok/s) step 5764/76294 | train loss 3.338894 | norm 0.1437 | lr 1.02e-03 | (9934.84 ms | 52773 tok/s) step 5765/76294 | train loss 3.326749 | norm 0.1327 | lr 1.02e-03 | (9904.93 ms | 52932 tok/s) step 5766/76294 | train loss 3.360988 | norm 0.1388 | lr 1.02e-03 | (9905.20 ms | 52931 tok/s) step 5767/76294 | train loss 3.369987 | norm 0.1328 | lr 1.02e-03 | (9899.30 ms | 52962 tok/s) step 5768/76294 | train loss 3.281788 | norm 0.1433 | lr 1.02e-03 | (9921.59 ms | 52843 tok/s) step 5769/76294 | train loss 3.310923 | norm 0.1240 | lr 1.02e-03 | (9951.65 ms | 52684 tok/s) step 5770/76294 | train loss 3.314332 | norm 0.1488 | lr 1.02e-03 | (9920.86 ms | 52847 tok/s) step 5771/76294 | train loss 3.300936 | norm 0.1339 | lr 1.02e-03 | (9956.09 ms | 52660 tok/s) step 5772/76294 | train loss 3.320912 | norm 0.1422 | lr 1.02e-03 | (9926.52 ms | 52817 tok/s) step 5773/76294 | train loss 3.309848 | norm 0.1295 | lr 1.02e-03 | (9895.23 ms | 52984 tok/s) step 5774/76294 | train loss 3.373605 | norm 0.1507 | lr 1.02e-03 | (9909.66 ms | 52907 tok/s) step 5775/76294 | train loss 3.319928 | norm 0.1364 | lr 1.02e-03 | (9902.46 ms | 52945 tok/s) step 5776/76294 | train loss 3.322199 | norm 0.1493 | lr 1.02e-03 | (9896.53 ms | 52977 tok/s) step 5777/76294 | train loss 3.275107 | norm 0.1168 | lr 1.02e-03 | (9896.39 ms | 52978 tok/s) step 5778/76294 | train loss 3.348420 | norm 0.1327 | lr 1.02e-03 | (9898.31 ms | 52967 tok/s) step 5779/76294 | train loss 3.312796 | norm 0.1302 | lr 1.02e-03 | (9896.12 ms | 52979 tok/s) step 5780/76294 | train loss 3.417654 | norm 0.1938 | lr 1.02e-03 | (9900.39 ms | 52956 tok/s) step 5781/76294 | train loss 3.295633 | norm 0.2042 | lr 1.02e-03 | (9893.32 ms | 52994 tok/s) step 5782/76294 | train loss 3.367444 | norm 0.1417 | lr 1.02e-03 | (9900.97 ms | 52953 tok/s) step 5783/76294 | train loss 3.278263 | norm 0.1550 | lr 1.02e-03 | (9894.53 ms | 52988 tok/s) step 5784/76294 | train loss 3.345459 | norm 0.1413 | lr 1.02e-03 | (9903.65 ms | 52939 tok/s) step 5785/76294 | train loss 3.334931 | norm 0.1430 | lr 1.02e-03 | (9896.83 ms | 52975 tok/s) step 5786/76294 | train loss 3.283553 | norm 0.1461 | lr 1.02e-03 | (9935.19 ms | 52771 tok/s) step 5787/76294 | train loss 3.274782 | norm 0.1425 | lr 1.02e-03 | (9942.49 ms | 52732 tok/s) step 5788/76294 | train loss 3.322890 | norm 0.1652 | lr 1.02e-03 | (9905.89 ms | 52927 tok/s) step 5789/76294 | train loss 3.348472 | norm 0.1496 | lr 1.02e-03 | (9895.13 ms | 52984 tok/s) step 5790/76294 | train loss 3.234044 | norm 0.1419 | lr 1.02e-03 | (9899.12 ms | 52963 tok/s) step 5791/76294 | train loss 3.264481 | norm 0.1404 | lr 1.02e-03 | (9899.24 ms | 52962 tok/s) step 5792/76294 | train loss 3.370611 | norm 0.1655 | lr 1.02e-03 | (9889.47 ms | 53015 tok/s) step 5793/76294 | train loss 3.258091 | norm 0.1514 | lr 1.02e-03 | (9903.56 ms | 52939 tok/s) step 5794/76294 | train loss 3.243996 | norm 0.1593 | lr 1.02e-03 | (9894.47 ms | 52988 tok/s) step 5795/76294 | train loss 3.355830 | norm 0.1305 | lr 1.02e-03 | (9934.78 ms | 52773 tok/s) step 5796/76294 | train loss 3.340158 | norm 0.1507 | lr 1.02e-03 | (9892.79 ms | 52997 tok/s) step 5797/76294 | train loss 3.273500 | norm 0.1332 | lr 1.02e-03 | (9936.38 ms | 52764 tok/s) step 5798/76294 | train loss 3.279870 | norm 0.1495 | lr 1.02e-03 | (9904.40 ms | 52935 tok/s) step 5799/76294 | train loss 3.310580 | norm 0.1431 | lr 1.02e-03 | (9903.63 ms | 52939 tok/s) step 5800/76294 | train loss 3.305937 | norm 0.1397 | lr 1.02e-03 | (9890.58 ms | 53009 tok/s) step 5801/76294 | train loss 3.365489 | norm 0.1430 | lr 1.02e-03 | (9955.25 ms | 52664 tok/s) step 5802/76294 | train loss 3.365380 | norm 0.1392 | lr 1.02e-03 | (9888.79 ms | 53018 tok/s) step 5803/76294 | train loss 3.282146 | norm 0.1532 | lr 1.02e-03 | (9890.15 ms | 53011 tok/s) step 5804/76294 | train loss 3.343826 | norm 0.1238 | lr 1.02e-03 | (9891.99 ms | 53001 tok/s) step 5805/76294 | train loss 3.457451 | norm 0.1649 | lr 1.02e-03 | (9930.44 ms | 52796 tok/s) step 5806/76294 | train loss 3.239098 | norm 0.1398 | lr 1.02e-03 | (9890.89 ms | 53007 tok/s) step 5807/76294 | train loss 3.358964 | norm 0.1388 | lr 1.02e-03 | (9919.38 ms | 52855 tok/s) step 5808/76294 | train loss 3.315486 | norm 0.1482 | lr 1.02e-03 | (9887.43 ms | 53026 tok/s) step 5809/76294 | train loss 3.282929 | norm 0.1405 | lr 1.02e-03 | (9915.04 ms | 52878 tok/s) step 5810/76294 | train loss 3.342938 | norm 0.1329 | lr 1.02e-03 | (9898.70 ms | 52965 tok/s) step 5811/76294 | train loss 3.327573 | norm 0.1398 | lr 1.02e-03 | (9889.30 ms | 53016 tok/s) step 5812/76294 | train loss 3.321626 | norm 0.1307 | lr 1.02e-03 | (9950.31 ms | 52691 tok/s) step 5813/76294 | train loss 3.356366 | norm 0.1427 | lr 1.02e-03 | (10007.98 ms | 52387 tok/s) step 5814/76294 | train loss 3.365445 | norm 0.1505 | lr 1.02e-03 | (9917.76 ms | 52864 tok/s) step 5815/76294 | train loss 3.335187 | norm 0.1329 | lr 1.02e-03 | (9903.82 ms | 52938 tok/s) step 5816/76294 | train loss 3.242268 | norm 0.1628 | lr 1.02e-03 | (9909.48 ms | 52908 tok/s) step 5817/76294 | train loss 3.372843 | norm 0.1394 | lr 1.02e-03 | (9898.73 ms | 52965 tok/s) step 5818/76294 | train loss 3.334779 | norm 0.1443 | lr 1.02e-03 | (9890.04 ms | 53012 tok/s) step 5819/76294 | train loss 3.247161 | norm 0.1210 | lr 1.02e-03 | (9899.44 ms | 52961 tok/s) step 5820/76294 | train loss 3.302866 | norm 0.1597 | lr 1.02e-03 | (9888.48 ms | 53020 tok/s) step 5821/76294 | train loss 3.293890 | norm 0.1376 | lr 1.02e-03 | (9899.44 ms | 52961 tok/s) step 5822/76294 | train loss 3.406078 | norm 0.1542 | lr 1.02e-03 | (9892.37 ms | 52999 tok/s) step 5823/76294 | train loss 3.295130 | norm 0.1465 | lr 1.02e-03 | (9901.40 ms | 52951 tok/s) step 5824/76294 | train loss 3.319813 | norm 0.1415 | lr 1.02e-03 | (9891.75 ms | 53003 tok/s) step 5825/76294 | train loss 3.281553 | norm 0.1398 | lr 1.02e-03 | (9915.50 ms | 52876 tok/s) step 5826/76294 | train loss 3.285702 | norm 0.1441 | lr 1.02e-03 | (9897.57 ms | 52971 tok/s) step 5827/76294 | train loss 3.336244 | norm 0.1348 | lr 1.02e-03 | (9899.58 ms | 52961 tok/s) step 5828/76294 | train loss 3.262287 | norm 0.1454 | lr 1.02e-03 | (10156.28 ms | 51622 tok/s) step 5829/76294 | train loss 3.340137 | norm 0.1634 | lr 1.02e-03 | (9891.12 ms | 53006 tok/s) step 5830/76294 | train loss 3.297487 | norm 0.1467 | lr 1.02e-03 | (9899.46 ms | 52961 tok/s) step 5831/76294 | train loss 3.362847 | norm 0.1406 | lr 1.02e-03 | (9890.15 ms | 53011 tok/s) step 5832/76294 | train loss 3.304176 | norm 0.1635 | lr 1.02e-03 | (9898.34 ms | 52967 tok/s) step 5833/76294 | train loss 3.370595 | norm 0.1589 | lr 1.02e-03 | (9891.92 ms | 53002 tok/s) step 5834/76294 | train loss 3.263392 | norm 0.1310 | lr 1.02e-03 | (9896.55 ms | 52977 tok/s) step 5835/76294 | train loss 3.369497 | norm 0.1634 | lr 1.02e-03 | (9894.28 ms | 52989 tok/s) step 5836/76294 | train loss 3.224413 | norm 0.1450 | lr 1.01e-03 | (9895.03 ms | 52985 tok/s) step 5837/76294 | train loss 3.260303 | norm 0.1536 | lr 1.01e-03 | (9891.63 ms | 53003 tok/s) step 5838/76294 | train loss 3.346470 | norm 0.1555 | lr 1.01e-03 | (10065.17 ms | 52089 tok/s) step 5839/76294 | train loss 3.281562 | norm 0.1491 | lr 1.01e-03 | (9895.61 ms | 52982 tok/s) step 5840/76294 | train loss 3.280370 | norm 0.1368 | lr 1.01e-03 | (9896.35 ms | 52978 tok/s) step 5841/76294 | train loss 3.397151 | norm 0.1485 | lr 1.01e-03 | (9947.91 ms | 52703 tok/s) step 5842/76294 | train loss 3.290459 | norm 0.1476 | lr 1.01e-03 | (9925.53 ms | 52822 tok/s) step 5843/76294 | train loss 3.354092 | norm 0.1559 | lr 1.01e-03 | (9967.90 ms | 52598 tok/s) step 5844/76294 | train loss 3.432601 | norm 0.1289 | lr 1.01e-03 | (9896.35 ms | 52978 tok/s) step 5845/76294 | train loss 3.281897 | norm 0.1335 | lr 1.01e-03 | (9956.97 ms | 52655 tok/s) step 5846/76294 | train loss 3.278421 | norm 0.1456 | lr 1.01e-03 | (9885.68 ms | 53035 tok/s) step 5847/76294 | train loss 3.330752 | norm 0.1390 | lr 1.01e-03 | (9889.03 ms | 53017 tok/s) step 5848/76294 | train loss 3.295800 | norm 0.1493 | lr 1.01e-03 | (9892.44 ms | 52999 tok/s) step 5849/76294 | train loss 3.363399 | norm 0.1756 | lr 1.01e-03 | (9927.06 ms | 52814 tok/s) step 5850/76294 | train loss 3.435827 | norm 0.1567 | lr 1.01e-03 | (9886.47 ms | 53031 tok/s) step 5851/76294 | train loss 3.296505 | norm 0.1512 | lr 1.01e-03 | (9911.73 ms | 52896 tok/s) step 5852/76294 | train loss 3.361959 | norm 0.1365 | lr 1.01e-03 | (9900.17 ms | 52957 tok/s) step 5853/76294 | train loss 3.446783 | norm 0.1521 | lr 1.01e-03 | (9892.68 ms | 52998 tok/s) step 5854/76294 | train loss 3.268508 | norm 0.1594 | lr 1.01e-03 | (9944.25 ms | 52723 tok/s) step 5855/76294 | train loss 3.301194 | norm 0.1652 | lr 1.01e-03 | (9891.33 ms | 53005 tok/s) step 5856/76294 | train loss 3.356747 | norm 0.1435 | lr 1.01e-03 | (10780.90 ms | 48631 tok/s) step 5857/76294 | train loss 3.331157 | norm 0.1856 | lr 1.01e-03 | (9890.07 ms | 53012 tok/s) step 5858/76294 | train loss 3.357644 | norm 0.1675 | lr 1.01e-03 | (9923.20 ms | 52835 tok/s) step 5859/76294 | train loss 3.333105 | norm 0.1693 | lr 1.01e-03 | (9887.73 ms | 53024 tok/s) step 5860/76294 | train loss 3.286641 | norm 0.1667 | lr 1.01e-03 | (9922.39 ms | 52839 tok/s) step 5861/76294 | train loss 3.357970 | norm 0.1496 | lr 1.01e-03 | (9890.97 ms | 53007 tok/s) step 5862/76294 | train loss 3.380475 | norm 0.1652 | lr 1.01e-03 | (9894.32 ms | 52989 tok/s) step 5863/76294 | train loss 3.301302 | norm 0.1468 | lr 1.01e-03 | (9925.28 ms | 52823 tok/s) step 5864/76294 | train loss 3.351723 | norm 0.1529 | lr 1.01e-03 | (9893.49 ms | 52993 tok/s) step 5865/76294 | train loss 3.316966 | norm 0.1514 | lr 1.01e-03 | (9899.91 ms | 52959 tok/s) step 5866/76294 | train loss 3.283162 | norm 0.1451 | lr 1.01e-03 | (9887.30 ms | 53026 tok/s) step 5867/76294 | train loss 3.297895 | norm 0.1440 | lr 1.01e-03 | (9894.14 ms | 52990 tok/s) step 5868/76294 | train loss 3.367796 | norm 0.1627 | lr 1.01e-03 | (9929.86 ms | 52799 tok/s) step 5869/76294 | train loss 3.233256 | norm 0.1501 | lr 1.01e-03 | (9890.40 ms | 53010 tok/s) step 5870/76294 | train loss 3.347542 | norm 0.1629 | lr 1.01e-03 | (9891.43 ms | 53004 tok/s) step 5871/76294 | train loss 3.273394 | norm 0.1642 | lr 1.01e-03 | (9890.66 ms | 53008 tok/s) step 5872/76294 | train loss 3.233143 | norm 0.1353 | lr 1.01e-03 | (9901.46 ms | 52951 tok/s) step 5873/76294 | train loss 3.340994 | norm 0.1483 | lr 1.01e-03 | (9891.77 ms | 53002 tok/s) step 5874/76294 | train loss 3.286697 | norm 0.1484 | lr 1.01e-03 | (9902.44 ms | 52945 tok/s) step 5875/76294 | train loss 3.325121 | norm 0.1457 | lr 1.01e-03 | (9900.50 ms | 52956 tok/s) step 5876/76294 | train loss 3.313273 | norm 0.1358 | lr 1.01e-03 | (9902.76 ms | 52944 tok/s) step 5877/76294 | train loss 3.288029 | norm 0.1516 | lr 1.01e-03 | (10747.12 ms | 48784 tok/s) step 5878/76294 | train loss 3.302663 | norm 0.1371 | lr 1.01e-03 | (10812.87 ms | 48487 tok/s) step 5879/76294 | train loss 3.338614 | norm 0.1360 | lr 1.01e-03 | (9885.36 ms | 53037 tok/s) step 5880/76294 | train loss 3.301422 | norm 0.1307 | lr 1.01e-03 | (9945.17 ms | 52718 tok/s) step 5881/76294 | train loss 3.363737 | norm 0.1518 | lr 1.01e-03 | (9890.40 ms | 53010 tok/s) step 5882/76294 | train loss 3.383451 | norm 0.1386 | lr 1.01e-03 | (9896.05 ms | 52980 tok/s) step 5883/76294 | train loss 3.315543 | norm 0.1673 | lr 1.01e-03 | (9893.51 ms | 52993 tok/s) step 5884/76294 | train loss 3.302505 | norm 0.1623 | lr 1.01e-03 | (9917.26 ms | 52866 tok/s) step 5885/76294 | train loss 3.367657 | norm 0.1415 | lr 1.01e-03 | (9909.26 ms | 52909 tok/s) step 5886/76294 | train loss 3.356574 | norm 0.1407 | lr 1.01e-03 | (9917.09 ms | 52867 tok/s) step 5887/76294 | train loss 3.241111 | norm 0.1349 | lr 1.01e-03 | (12326.47 ms | 42533 tok/s) step 5888/76294 | train loss 3.387331 | norm 0.1385 | lr 1.01e-03 | (9873.80 ms | 53099 tok/s) step 5889/76294 | train loss 3.314579 | norm 0.1392 | lr 1.01e-03 | (9888.51 ms | 53020 tok/s) step 5890/76294 | train loss 3.301839 | norm 0.1383 | lr 1.01e-03 | (9892.77 ms | 52997 tok/s) step 5891/76294 | train loss 3.318892 | norm 0.1367 | lr 1.01e-03 | (9891.62 ms | 53003 tok/s) step 5892/76294 | train loss 3.297036 | norm 0.1402 | lr 1.01e-03 | (9903.48 ms | 52940 tok/s) step 5893/76294 | train loss 3.267111 | norm 0.1369 | lr 1.01e-03 | (11492.44 ms | 45620 tok/s) step 5894/76294 | train loss 3.397017 | norm 0.1353 | lr 1.01e-03 | (9887.51 ms | 53025 tok/s) step 5895/76294 | train loss 3.256066 | norm 0.1467 | lr 1.01e-03 | (9897.85 ms | 52970 tok/s) step 5896/76294 | train loss 3.400120 | norm 0.1509 | lr 1.01e-03 | (9900.94 ms | 52953 tok/s) step 5897/76294 | train loss 3.382043 | norm 0.1292 | lr 1.01e-03 | (9901.65 ms | 52950 tok/s) step 5898/76294 | train loss 3.312330 | norm 0.1338 | lr 1.01e-03 | (9905.41 ms | 52929 tok/s) step 5899/76294 | train loss 3.500176 | norm 0.1652 | lr 1.01e-03 | (9903.83 ms | 52938 tok/s) step 5900/76294 | train loss 3.394204 | norm 0.1639 | lr 1.01e-03 | (9905.67 ms | 52928 tok/s) step 5901/76294 | train loss 3.314494 | norm 0.1675 | lr 1.01e-03 | (9965.10 ms | 52612 tok/s) step 5902/76294 | train loss 3.312975 | norm 0.1541 | lr 1.01e-03 | (9894.83 ms | 52986 tok/s) step 5903/76294 | train loss 3.298988 | norm 0.1434 | lr 1.01e-03 | (9893.07 ms | 52995 tok/s) step 5904/76294 | train loss 3.286725 | norm 0.1459 | lr 1.01e-03 | (9951.16 ms | 52686 tok/s) step 5905/76294 | train loss 3.275171 | norm 0.1281 | lr 1.01e-03 | (9897.95 ms | 52969 tok/s) step 5906/76294 | train loss 3.343045 | norm 0.1364 | lr 1.01e-03 | (9894.07 ms | 52990 tok/s) step 5907/76294 | train loss 3.275413 | norm 0.1199 | lr 1.01e-03 | (9900.52 ms | 52956 tok/s) step 5908/76294 | train loss 3.326093 | norm 0.1362 | lr 1.01e-03 | (9934.11 ms | 52777 tok/s) step 5909/76294 | train loss 3.269151 | norm 0.1279 | lr 1.01e-03 | (9894.99 ms | 52985 tok/s) step 5910/76294 | train loss 3.339226 | norm 0.1293 | lr 1.01e-03 | (9961.73 ms | 52630 tok/s) step 5911/76294 | train loss 3.426923 | norm 0.4861 | lr 1.01e-03 | (9898.11 ms | 52969 tok/s) step 5912/76294 | train loss 3.296707 | norm 0.2303 | lr 1.01e-03 | (9907.48 ms | 52918 tok/s) step 5913/76294 | train loss 3.217857 | norm 0.1833 | lr 1.01e-03 | (9922.04 ms | 52841 tok/s) step 5914/76294 | train loss 3.365141 | norm 0.1753 | lr 1.01e-03 | (9940.81 ms | 52741 tok/s) step 5915/76294 | train loss 3.337107 | norm 0.1542 | lr 1.01e-03 | (9898.21 ms | 52968 tok/s) step 5916/76294 | train loss 3.275849 | norm 0.1923 | lr 1.01e-03 | (9932.78 ms | 52784 tok/s) step 5917/76294 | train loss 3.307370 | norm 0.1772 | lr 1.01e-03 | (9891.53 ms | 53004 tok/s) step 5918/76294 | train loss 3.330224 | norm 0.1791 | lr 1.01e-03 | (9900.23 ms | 52957 tok/s) step 5919/76294 | train loss 3.337915 | norm 0.1624 | lr 1.01e-03 | (9904.35 ms | 52935 tok/s) step 5920/76294 | train loss 3.302135 | norm 0.1654 | lr 1.01e-03 | (9892.51 ms | 52998 tok/s) step 5921/76294 | train loss 3.281556 | norm 0.1570 | lr 1.01e-03 | (9910.73 ms | 52901 tok/s) step 5922/76294 | train loss 3.309447 | norm 0.1589 | lr 1.01e-03 | (9948.72 ms | 52699 tok/s) step 5923/76294 | train loss 3.352196 | norm 0.1618 | lr 1.01e-03 | (9897.07 ms | 52974 tok/s) step 5924/76294 | train loss 3.282974 | norm 0.1554 | lr 1.01e-03 | (9892.09 ms | 53001 tok/s) step 5925/76294 | train loss 3.329509 | norm 0.1426 | lr 1.01e-03 | (9900.07 ms | 52958 tok/s) step 5926/76294 | train loss 3.371564 | norm 0.1397 | lr 1.01e-03 | (9892.13 ms | 53001 tok/s) step 5927/76294 | train loss 3.444158 | norm 0.1391 | lr 1.01e-03 | (9904.87 ms | 52932 tok/s) step 5928/76294 | train loss 3.270423 | norm 0.1394 | lr 1.01e-03 | (9891.85 ms | 53002 tok/s) step 5929/76294 | train loss 3.319997 | norm 0.1370 | lr 1.01e-03 | (9940.68 ms | 52742 tok/s) step 5930/76294 | train loss 3.326844 | norm 0.1435 | lr 1.01e-03 | (9889.43 ms | 53015 tok/s) step 5931/76294 | train loss 3.333675 | norm 0.1419 | lr 1.01e-03 | (9897.77 ms | 52970 tok/s) step 5932/76294 | train loss 3.326965 | norm 0.1349 | lr 1.01e-03 | (9933.97 ms | 52777 tok/s) step 5933/76294 | train loss 3.307682 | norm 0.1505 | lr 1.01e-03 | (9898.34 ms | 52967 tok/s) step 5934/76294 | train loss 3.321139 | norm 0.1350 | lr 1.01e-03 | (9895.90 ms | 52980 tok/s) step 5935/76294 | train loss 3.340668 | norm 0.1315 | lr 1.01e-03 | (9901.11 ms | 52952 tok/s) step 5936/76294 | train loss 3.303349 | norm 0.1339 | lr 1.01e-03 | (9887.87 ms | 53023 tok/s) step 5937/76294 | train loss 3.290369 | norm 0.1366 | lr 1.01e-03 | (9895.75 ms | 52981 tok/s) step 5938/76294 | train loss 3.347507 | norm 0.1438 | lr 1.01e-03 | (9932.80 ms | 52783 tok/s) step 5939/76294 | train loss 3.347740 | norm 0.1592 | lr 1.01e-03 | (9893.90 ms | 52991 tok/s) step 5940/76294 | train loss 3.335140 | norm 0.1332 | lr 1.01e-03 | (9931.45 ms | 52791 tok/s) step 5941/76294 | train loss 3.315829 | norm 0.1568 | lr 1.01e-03 | (9958.43 ms | 52648 tok/s) step 5942/76294 | train loss 3.319602 | norm 0.1439 | lr 1.01e-03 | (9885.97 ms | 53034 tok/s) step 5943/76294 | train loss 3.302734 | norm 0.1275 | lr 1.01e-03 | (9892.29 ms | 53000 tok/s) step 5944/76294 | train loss 3.300605 | norm 0.1318 | lr 1.01e-03 | (9891.54 ms | 53004 tok/s) step 5945/76294 | train loss 3.242471 | norm 0.1380 | lr 1.01e-03 | (9953.62 ms | 52673 tok/s) step 5946/76294 | train loss 3.347883 | norm 0.1489 | lr 1.01e-03 | (9895.12 ms | 52984 tok/s) step 5947/76294 | train loss 3.293067 | norm 0.1488 | lr 1.01e-03 | (9895.36 ms | 52983 tok/s) step 5948/76294 | train loss 3.298224 | norm 0.1285 | lr 1.01e-03 | (9936.82 ms | 52762 tok/s) step 5949/76294 | train loss 3.383412 | norm 0.1502 | lr 1.01e-03 | (9891.46 ms | 53004 tok/s) step 5950/76294 | train loss 3.273431 | norm 0.1395 | lr 1.01e-03 | (9945.94 ms | 52714 tok/s) step 5951/76294 | train loss 3.260699 | norm 0.1499 | lr 1.01e-03 | (9890.66 ms | 53008 tok/s) step 5952/76294 | train loss 3.356338 | norm 0.1518 | lr 1.01e-03 | (9898.60 ms | 52966 tok/s) step 5953/76294 | train loss 3.330584 | norm 0.1304 | lr 1.01e-03 | (9892.49 ms | 52999 tok/s) step 5954/76294 | train loss 3.294868 | norm 0.1568 | lr 1.01e-03 | (11108.02 ms | 47199 tok/s) step 5955/76294 | train loss 3.317448 | norm 0.1557 | lr 1.01e-03 | (9888.32 ms | 53021 tok/s) step 5956/76294 | train loss 3.302474 | norm 0.1548 | lr 1.01e-03 | (9890.63 ms | 53009 tok/s) step 5957/76294 | train loss 3.288028 | norm 0.1525 | lr 1.01e-03 | (9888.74 ms | 53019 tok/s) step 5958/76294 | train loss 3.260782 | norm 0.1434 | lr 1.01e-03 | (9882.62 ms | 53052 tok/s) step 5959/76294 | train loss 3.410669 | norm 0.1524 | lr 1.01e-03 | (9894.29 ms | 52989 tok/s) step 5960/76294 | train loss 3.298366 | norm 0.1389 | lr 1.01e-03 | (9913.58 ms | 52886 tok/s) step 5961/76294 | train loss 3.330400 | norm 0.1454 | lr 1.01e-03 | (9891.20 ms | 53006 tok/s) step 5962/76294 | train loss 3.407714 | norm 0.1413 | lr 1.01e-03 | (9892.82 ms | 52997 tok/s) step 5963/76294 | train loss 3.249243 | norm 0.1330 | lr 1.01e-03 | (9897.84 ms | 52970 tok/s) step 5964/76294 | train loss 3.328830 | norm 0.1316 | lr 1.01e-03 | (9893.01 ms | 52996 tok/s) step 5965/76294 | train loss 3.325286 | norm 0.1244 | lr 1.01e-03 | (9893.48 ms | 52993 tok/s) step 5966/76294 | train loss 3.286042 | norm 0.1386 | lr 1.01e-03 | (9895.57 ms | 52982 tok/s) step 5967/76294 | train loss 3.228669 | norm 0.1281 | lr 1.01e-03 | (9895.48 ms | 52983 tok/s) step 5968/76294 | train loss 3.357908 | norm 0.1387 | lr 1.01e-03 | (9931.22 ms | 52792 tok/s) step 5969/76294 | train loss 3.360294 | norm 0.1324 | lr 1.01e-03 | (9896.42 ms | 52978 tok/s) step 5970/76294 | train loss 3.310848 | norm 0.1315 | lr 1.01e-03 | (9896.54 ms | 52977 tok/s) step 5971/76294 | train loss 3.378354 | norm 0.1265 | lr 1.01e-03 | (9900.30 ms | 52957 tok/s) step 5972/76294 | train loss 3.285987 | norm 0.1279 | lr 1.01e-03 | (9899.92 ms | 52959 tok/s) step 5973/76294 | train loss 3.244536 | norm 0.1218 | lr 1.01e-03 | (9891.21 ms | 53005 tok/s) step 5974/76294 | train loss 3.339369 | norm 0.1259 | lr 1.01e-03 | (9935.40 ms | 52770 tok/s) step 5975/76294 | train loss 3.319625 | norm 0.1302 | lr 1.01e-03 | (9895.50 ms | 52982 tok/s) step 5976/76294 | train loss 3.298480 | norm 0.1231 | lr 1.01e-03 | (9900.81 ms | 52954 tok/s) step 5977/76294 | train loss 3.546061 | norm 0.1793 | lr 1.01e-03 | (9894.83 ms | 52986 tok/s) step 5978/76294 | train loss 3.537349 | norm 0.2023 | lr 1.01e-03 | (9894.37 ms | 52988 tok/s) step 5979/76294 | train loss 3.291429 | norm 0.2173 | lr 1.01e-03 | (9901.89 ms | 52948 tok/s) step 5980/76294 | train loss 3.296597 | norm 0.1699 | lr 1.01e-03 | (9905.67 ms | 52928 tok/s) step 5981/76294 | train loss 3.353534 | norm 0.1596 | lr 1.01e-03 | (10003.98 ms | 52408 tok/s) step 5982/76294 | train loss 3.333303 | norm 0.1474 | lr 1.00e-03 | (9889.49 ms | 53015 tok/s) step 5983/76294 | train loss 3.300221 | norm 0.1526 | lr 1.00e-03 | (9961.19 ms | 52633 tok/s) step 5984/76294 | train loss 3.332999 | norm 0.1520 | lr 1.00e-03 | (9912.11 ms | 52894 tok/s) step 5985/76294 | train loss 3.245660 | norm 0.1396 | lr 1.00e-03 | (9959.33 ms | 52643 tok/s) step 5986/76294 | train loss 3.282650 | norm 0.1767 | lr 1.00e-03 | (9892.04 ms | 53001 tok/s) step 5987/76294 | train loss 3.273309 | norm 0.1433 | lr 1.00e-03 | (9958.08 ms | 52650 tok/s) step 5988/76294 | train loss 3.309185 | norm 0.1303 | lr 1.00e-03 | (9900.37 ms | 52956 tok/s) step 5989/76294 | train loss 3.351357 | norm 0.1387 | lr 1.00e-03 | (9899.89 ms | 52959 tok/s) step 5990/76294 | train loss 3.386761 | norm 0.1454 | lr 1.00e-03 | (9938.56 ms | 52753 tok/s) step 5991/76294 | train loss 3.277629 | norm 0.1427 | lr 1.00e-03 | (9895.41 ms | 52983 tok/s) step 5992/76294 | train loss 3.326345 | norm 0.1475 | lr 1.00e-03 | (9895.19 ms | 52984 tok/s) step 5993/76294 | train loss 3.319399 | norm 0.1473 | lr 1.00e-03 | (9895.21 ms | 52984 tok/s) step 5994/76294 | train loss 3.254179 | norm 0.1451 | lr 1.00e-03 | (9898.82 ms | 52965 tok/s) step 5995/76294 | train loss 3.334290 | norm 0.1542 | lr 1.00e-03 | (9902.11 ms | 52947 tok/s) step 5996/76294 | train loss 3.271506 | norm 0.1493 | lr 1.00e-03 | (9890.38 ms | 53010 tok/s) step 5997/76294 | train loss 3.277939 | norm 0.1327 | lr 1.00e-03 | (9902.91 ms | 52943 tok/s) step 5998/76294 | train loss 3.295816 | norm 0.1497 | lr 1.00e-03 | (9896.42 ms | 52978 tok/s) step 5999/76294 | train loss 3.266951 | norm 0.1320 | lr 1.00e-03 | (9930.66 ms | 52795 tok/s) step 6000/76294 | train loss 3.246728 | norm 0.1341 | lr 1.00e-03 | (9893.81 ms | 52992 tok/s) val loss: 3.311525 saving model checkpoint to ./results/gpt2-350M-gqa/step_6000.pth step 6001/76294 | train loss 3.320586 | norm 0.1410 | lr 1.00e-03 | (9921.58 ms | 52843 tok/s) step 6002/76294 | train loss 3.317002 | norm 0.1320 | lr 1.00e-03 | (9841.61 ms | 53273 tok/s) step 6003/76294 | train loss 3.252542 | norm 0.1428 | lr 1.00e-03 | (9890.30 ms | 53010 tok/s) step 6004/76294 | train loss 3.254759 | norm 0.1605 | lr 1.00e-03 | (9891.69 ms | 53003 tok/s) step 6005/76294 | train loss 3.299084 | norm 0.1887 | lr 1.00e-03 | (9856.88 ms | 53190 tok/s) step 6006/76294 | train loss 3.299464 | norm 0.1419 | lr 1.00e-03 | (9864.04 ms | 53151 tok/s) step 6007/76294 | train loss 3.286893 | norm 0.1545 | lr 1.00e-03 | (9866.95 ms | 53136 tok/s) step 6008/76294 | train loss 3.382002 | norm 0.1670 | lr 1.00e-03 | (9896.15 ms | 52979 tok/s) step 6009/76294 | train loss 3.274266 | norm 0.1572 | lr 1.00e-03 | (9869.48 ms | 53122 tok/s) step 6010/76294 | train loss 3.293416 | norm 0.1505 | lr 1.00e-03 | (9884.33 ms | 53042 tok/s) step 6011/76294 | train loss 3.438253 | norm 0.1695 | lr 1.00e-03 | (9867.09 ms | 53135 tok/s) step 6012/76294 | train loss 3.308982 | norm 0.1397 | lr 1.00e-03 | (9869.58 ms | 53122 tok/s) step 6013/76294 | train loss 3.219687 | norm 0.1562 | lr 1.00e-03 | (9867.80 ms | 53131 tok/s) step 6014/76294 | train loss 3.434836 | norm 0.1664 | lr 1.00e-03 | (9880.31 ms | 53064 tok/s) step 6015/76294 | train loss 3.351299 | norm 0.1885 | lr 1.00e-03 | (9870.40 ms | 53117 tok/s) step 6016/76294 | train loss 3.381131 | norm 0.1825 | lr 1.00e-03 | (9880.30 ms | 53064 tok/s) step 6017/76294 | train loss 3.331059 | norm 0.1531 | lr 1.00e-03 | (9869.85 ms | 53120 tok/s) step 6018/76294 | train loss 3.293714 | norm 0.1700 | lr 1.00e-03 | (9875.97 ms | 53087 tok/s) step 6019/76294 | train loss 3.375385 | norm 0.1345 | lr 1.00e-03 | (10741.28 ms | 48811 tok/s) step 6020/76294 | train loss 3.300650 | norm 0.1568 | lr 1.00e-03 | (9866.92 ms | 53136 tok/s) step 6021/76294 | train loss 3.277267 | norm 0.1497 | lr 1.00e-03 | (9883.32 ms | 53048 tok/s) step 6022/76294 | train loss 3.350523 | norm 0.1667 | lr 1.00e-03 | (9906.30 ms | 52925 tok/s) step 6023/76294 | train loss 3.336795 | norm 0.1437 | lr 1.00e-03 | (9882.40 ms | 53053 tok/s) step 6024/76294 | train loss 3.401882 | norm 0.1578 | lr 1.00e-03 | (9875.73 ms | 53089 tok/s) step 6025/76294 | train loss 3.301078 | norm 0.1378 | lr 1.00e-03 | (9869.98 ms | 53119 tok/s) step 6026/76294 | train loss 3.352457 | norm 0.1517 | lr 1.00e-03 | (9867.13 ms | 53135 tok/s) step 6027/76294 | train loss 3.364691 | norm 0.1317 | lr 1.00e-03 | (9871.31 ms | 53112 tok/s) step 6028/76294 | train loss 3.299444 | norm 0.1429 | lr 1.00e-03 | (9866.44 ms | 53139 tok/s) step 6029/76294 | train loss 3.299511 | norm 0.1384 | lr 1.00e-03 | (9876.45 ms | 53085 tok/s) step 6030/76294 | train loss 3.294605 | norm 0.1441 | lr 1.00e-03 | (9869.27 ms | 53123 tok/s) step 6031/76294 | train loss 3.373930 | norm 0.1403 | lr 1.00e-03 | (9877.86 ms | 53077 tok/s) step 6032/76294 | train loss 3.285206 | norm 0.1445 | lr 1.00e-03 | (9865.87 ms | 53142 tok/s) step 6033/76294 | train loss 3.269917 | norm 0.1555 | lr 1.00e-03 | (9868.83 ms | 53126 tok/s) step 6034/76294 | train loss 3.389345 | norm 0.1524 | lr 1.00e-03 | (9863.91 ms | 53152 tok/s) step 6035/76294 | train loss 3.329808 | norm 0.1569 | lr 1.00e-03 | (9878.14 ms | 53076 tok/s) step 6036/76294 | train loss 3.293408 | norm 0.1556 | lr 1.00e-03 | (9870.71 ms | 53116 tok/s) step 6037/76294 | train loss 3.339937 | norm 0.1519 | lr 1.00e-03 | (9875.70 ms | 53089 tok/s) step 6038/76294 | train loss 3.444674 | norm 0.1571 | lr 1.00e-03 | (9888.60 ms | 53019 tok/s) step 6039/76294 | train loss 3.329351 | norm 0.1832 | lr 1.00e-03 | (9866.45 ms | 53138 tok/s) step 6040/76294 | train loss 3.344609 | norm 0.1559 | lr 1.00e-03 | (9914.23 ms | 52882 tok/s) step 6041/76294 | train loss 3.399855 | norm 0.1662 | lr 1.00e-03 | (9872.66 ms | 53105 tok/s) step 6042/76294 | train loss 3.323053 | norm 0.1736 | lr 1.00e-03 | (9867.72 ms | 53132 tok/s) step 6043/76294 | train loss 3.358323 | norm 0.1340 | lr 1.00e-03 | (9876.12 ms | 53086 tok/s) step 6044/76294 | train loss 3.203176 | norm 0.1454 | lr 1.00e-03 | (9869.75 ms | 53121 tok/s) step 6045/76294 | train loss 3.251164 | norm 0.1497 | lr 1.00e-03 | (9869.89 ms | 53120 tok/s) step 6046/76294 | train loss 3.332438 | norm 0.1322 | lr 1.00e-03 | (9868.85 ms | 53126 tok/s) step 6047/76294 | train loss 3.258109 | norm 0.1805 | lr 1.00e-03 | (9874.41 ms | 53096 tok/s) step 6048/76294 | train loss 3.341014 | norm 0.1556 | lr 1.00e-03 | (9897.39 ms | 52972 tok/s) step 6049/76294 | train loss 3.321810 | norm 0.1628 | lr 1.00e-03 | (9871.33 ms | 53112 tok/s) step 6050/76294 | train loss 3.405160 | norm 0.1484 | lr 1.00e-03 | (10030.08 ms | 52272 tok/s) step 6051/76294 | train loss 3.313152 | norm 0.1559 | lr 1.00e-03 | (9868.24 ms | 53129 tok/s) step 6052/76294 | train loss 3.408536 | norm 0.1489 | lr 1.00e-03 | (11020.35 ms | 47575 tok/s) step 6053/76294 | train loss 3.335913 | norm 0.1739 | lr 1.00e-03 | (9863.97 ms | 53152 tok/s) step 6054/76294 | train loss 3.305487 | norm 0.1489 | lr 1.00e-03 | (9868.62 ms | 53127 tok/s) step 6055/76294 | train loss 3.323736 | norm 0.1423 | lr 1.00e-03 | (9867.25 ms | 53134 tok/s) step 6056/76294 | train loss 3.310812 | norm 0.1559 | lr 1.00e-03 | (9874.56 ms | 53095 tok/s) step 6057/76294 | train loss 3.264493 | norm 0.1372 | lr 1.00e-03 | (9865.14 ms | 53146 tok/s) step 6058/76294 | train loss 3.303140 | norm 0.1436 | lr 1.00e-03 | (9904.10 ms | 52936 tok/s) step 6059/76294 | train loss 3.408059 | norm 0.1412 | lr 1.00e-03 | (9888.55 ms | 53020 tok/s) step 6060/76294 | train loss 3.273511 | norm 0.1653 | lr 1.00e-03 | (9867.82 ms | 53131 tok/s) step 6061/76294 | train loss 3.296172 | norm 0.1377 | lr 9.99e-04 | (9888.86 ms | 53018 tok/s) step 6062/76294 | train loss 3.303060 | norm 0.1388 | lr 9.99e-04 | (9872.65 ms | 53105 tok/s) step 6063/76294 | train loss 3.341614 | norm 0.1509 | lr 9.99e-04 | (9869.10 ms | 53124 tok/s) step 6064/76294 | train loss 3.359306 | norm 0.1600 | lr 9.99e-04 | (9874.35 ms | 53096 tok/s) step 6065/76294 | train loss 3.285444 | norm 0.1582 | lr 9.99e-04 | (9867.98 ms | 53130 tok/s) step 6066/76294 | train loss 3.317780 | norm 0.1452 | lr 9.99e-04 | (9867.91 ms | 53131 tok/s) step 6067/76294 | train loss 3.323423 | norm 0.1659 | lr 9.99e-04 | (9870.97 ms | 53114 tok/s) step 6068/76294 | train loss 3.388631 | norm 0.1324 | lr 9.99e-04 | (9873.22 ms | 53102 tok/s) step 6069/76294 | train loss 3.296209 | norm 0.1467 | lr 9.99e-04 | (9873.18 ms | 53102 tok/s) step 6070/76294 | train loss 3.287938 | norm 0.1479 | lr 9.99e-04 | (9900.79 ms | 52954 tok/s) step 6071/76294 | train loss 3.456403 | norm 0.1868 | lr 9.99e-04 | (9944.75 ms | 52720 tok/s) step 6072/76294 | train loss 3.356074 | norm 0.1785 | lr 9.99e-04 | (9870.37 ms | 53117 tok/s) step 6073/76294 | train loss 3.294304 | norm 0.1400 | lr 9.99e-04 | (9869.86 ms | 53120 tok/s) step 6074/76294 | train loss 3.346211 | norm 0.1589 | lr 9.99e-04 | (9870.00 ms | 53119 tok/s) step 6075/76294 | train loss 3.343475 | norm 0.1532 | lr 9.98e-04 | (9878.12 ms | 53076 tok/s) step 6076/76294 | train loss 3.311504 | norm 0.1437 | lr 9.98e-04 | (9911.22 ms | 52898 tok/s) step 6077/76294 | train loss 3.339149 | norm 0.1269 | lr 9.98e-04 | (9874.73 ms | 53094 tok/s) step 6078/76294 | train loss 3.273894 | norm 0.1343 | lr 9.98e-04 | (9879.36 ms | 53069 tok/s) step 6079/76294 | train loss 3.363803 | norm 0.1298 | lr 9.98e-04 | (9875.57 ms | 53089 tok/s) step 6080/76294 | train loss 3.312423 | norm 0.1306 | lr 9.98e-04 | (9879.52 ms | 53068 tok/s) step 6081/76294 | train loss 3.305527 | norm 0.1322 | lr 9.98e-04 | (9936.70 ms | 52763 tok/s) step 6082/76294 | train loss 3.281716 | norm 0.1406 | lr 9.98e-04 | (9878.24 ms | 53075 tok/s) step 6083/76294 | train loss 3.261301 | norm 0.1203 | lr 9.98e-04 | (9876.06 ms | 53087 tok/s) step 6084/76294 | train loss 3.321477 | norm 0.1752 | lr 9.98e-04 | (9913.16 ms | 52888 tok/s) step 6085/76294 | train loss 3.287052 | norm 0.1364 | lr 9.98e-04 | (9879.88 ms | 53066 tok/s) step 6086/76294 | train loss 3.280563 | norm 0.1632 | lr 9.98e-04 | (9878.25 ms | 53075 tok/s) step 6087/76294 | train loss 3.333419 | norm 0.1508 | lr 9.98e-04 | (9877.66 ms | 53078 tok/s) step 6088/76294 | train loss 3.350235 | norm 0.1422 | lr 9.98e-04 | (9891.31 ms | 53005 tok/s) step 6089/76294 | train loss 3.330750 | norm 0.1443 | lr 9.97e-04 | (9877.36 ms | 53080 tok/s) step 6090/76294 | train loss 3.289165 | norm 0.1383 | lr 9.97e-04 | (9883.36 ms | 53048 tok/s) step 6091/76294 | train loss 3.323295 | norm 0.1297 | lr 9.97e-04 | (9886.45 ms | 53031 tok/s) step 6092/76294 | train loss 3.272283 | norm 0.1248 | lr 9.97e-04 | (9910.83 ms | 52901 tok/s) step 6093/76294 | train loss 3.351579 | norm 0.1589 | lr 9.97e-04 | (9877.24 ms | 53080 tok/s) step 6094/76294 | train loss 3.282122 | norm 0.1385 | lr 9.97e-04 | (9882.04 ms | 53055 tok/s) step 6095/76294 | train loss 3.267034 | norm 0.1383 | lr 9.97e-04 | (9874.83 ms | 53093 tok/s) step 6096/76294 | train loss 3.344227 | norm 0.1511 | lr 9.97e-04 | (9900.83 ms | 52954 tok/s) step 6097/76294 | train loss 3.365537 | norm 0.1463 | lr 9.97e-04 | (9880.56 ms | 53063 tok/s) step 6098/76294 | train loss 3.362591 | norm 0.1414 | lr 9.97e-04 | (9876.27 ms | 53086 tok/s) step 6099/76294 | train loss 3.305491 | norm 0.1438 | lr 9.97e-04 | (9883.48 ms | 53047 tok/s) step 6100/76294 | train loss 3.315760 | norm 0.1457 | lr 9.97e-04 | (9875.40 ms | 53090 tok/s) step 6101/76294 | train loss 3.349862 | norm 0.1528 | lr 9.97e-04 | (9884.38 ms | 53042 tok/s) step 6102/76294 | train loss 3.264190 | norm 0.1573 | lr 9.97e-04 | (9879.86 ms | 53066 tok/s) step 6103/76294 | train loss 3.313248 | norm 0.1428 | lr 9.96e-04 | (9876.52 ms | 53084 tok/s) step 6104/76294 | train loss 3.383955 | norm 0.1426 | lr 9.96e-04 | (9872.52 ms | 53106 tok/s) step 6105/76294 | train loss 3.341091 | norm 0.1476 | lr 9.96e-04 | (9884.68 ms | 53040 tok/s) step 6106/76294 | train loss 3.280740 | norm 0.1222 | lr 9.96e-04 | (9875.27 ms | 53091 tok/s) step 6107/76294 | train loss 3.319883 | norm 0.1475 | lr 9.96e-04 | (9885.83 ms | 53034 tok/s) step 6108/76294 | train loss 3.305005 | norm 0.1284 | lr 9.96e-04 | (9967.59 ms | 52599 tok/s) step 6109/76294 | train loss 3.288622 | norm 0.1397 | lr 9.96e-04 | (9880.85 ms | 53061 tok/s) step 6110/76294 | train loss 3.459725 | norm 0.1747 | lr 9.96e-04 | (9882.39 ms | 53053 tok/s) step 6111/76294 | train loss 3.389221 | norm 0.1440 | lr 9.96e-04 | (9879.17 ms | 53070 tok/s) step 6112/76294 | train loss 3.331845 | norm 0.1777 | lr 9.96e-04 | (9875.32 ms | 53091 tok/s) step 6113/76294 | train loss 3.306588 | norm 0.1613 | lr 9.96e-04 | (9913.49 ms | 52886 tok/s) step 6114/76294 | train loss 3.369660 | norm 0.1506 | lr 9.96e-04 | (9898.49 ms | 52966 tok/s) step 6115/76294 | train loss 3.318235 | norm 0.1359 | lr 9.96e-04 | (9902.01 ms | 52948 tok/s) step 6116/76294 | train loss 3.359608 | norm 0.1516 | lr 9.96e-04 | (9881.54 ms | 53057 tok/s) step 6117/76294 | train loss 3.292412 | norm 0.1321 | lr 9.96e-04 | (9923.61 ms | 52832 tok/s) step 6118/76294 | train loss 3.399123 | norm 0.1499 | lr 9.95e-04 | (9873.96 ms | 53098 tok/s) step 6119/76294 | train loss 3.343922 | norm 0.1296 | lr 9.95e-04 | (9887.56 ms | 53025 tok/s) step 6120/76294 | train loss 3.349594 | norm 0.1517 | lr 9.95e-04 | (9876.43 ms | 53085 tok/s) step 6121/76294 | train loss 3.258858 | norm 0.1294 | lr 9.95e-04 | (9888.90 ms | 53018 tok/s) step 6122/76294 | train loss 3.313551 | norm 0.1327 | lr 9.95e-04 | (9880.80 ms | 53061 tok/s) step 6123/76294 | train loss 3.326843 | norm 0.1238 | lr 9.95e-04 | (9950.07 ms | 52692 tok/s) step 6124/76294 | train loss 3.312583 | norm 0.1370 | lr 9.95e-04 | (9883.21 ms | 53048 tok/s) step 6125/76294 | train loss 3.315141 | norm 0.1395 | lr 9.95e-04 | (9882.84 ms | 53050 tok/s) step 6126/76294 | train loss 3.262466 | norm 0.1386 | lr 9.95e-04 | (9904.61 ms | 52934 tok/s) step 6127/76294 | train loss 3.321638 | norm 0.1348 | lr 9.95e-04 | (9879.41 ms | 53069 tok/s) step 6128/76294 | train loss 3.369236 | norm 0.1487 | lr 9.95e-04 | (9882.35 ms | 53053 tok/s) step 6129/76294 | train loss 3.338676 | norm 0.1270 | lr 9.95e-04 | (9879.81 ms | 53067 tok/s) step 6130/76294 | train loss 3.330195 | norm 0.1579 | lr 9.95e-04 | (9883.61 ms | 53046 tok/s) step 6131/76294 | train loss 3.281955 | norm 0.1412 | lr 9.95e-04 | (9885.87 ms | 53034 tok/s) step 6132/76294 | train loss 3.293817 | norm 0.1460 | lr 9.94e-04 | (9887.15 ms | 53027 tok/s) step 6133/76294 | train loss 3.374321 | norm 0.1353 | lr 9.94e-04 | (9888.66 ms | 53019 tok/s) step 6134/76294 | train loss 3.341967 | norm 0.1415 | lr 9.94e-04 | (9896.82 ms | 52975 tok/s) step 6135/76294 | train loss 3.320693 | norm 0.1334 | lr 9.94e-04 | (9884.75 ms | 53040 tok/s) step 6136/76294 | train loss 3.290494 | norm 0.1265 | lr 9.94e-04 | (9881.42 ms | 53058 tok/s) step 6137/76294 | train loss 3.245986 | norm 0.1498 | lr 9.94e-04 | (9887.39 ms | 53026 tok/s) step 6138/76294 | train loss 3.627980 | norm 0.1780 | lr 9.94e-04 | (9877.28 ms | 53080 tok/s) step 6139/76294 | train loss 3.330765 | norm 0.1579 | lr 9.94e-04 | (9912.53 ms | 52891 tok/s) step 6140/76294 | train loss 3.285501 | norm 0.1562 | lr 9.94e-04 | (9878.20 ms | 53075 tok/s) step 6141/76294 | train loss 3.261717 | norm 0.1647 | lr 9.94e-04 | (9885.66 ms | 53035 tok/s) step 6142/76294 | train loss 3.359261 | norm 0.1705 | lr 9.94e-04 | (9881.62 ms | 53057 tok/s) step 6143/76294 | train loss 3.344769 | norm 0.1499 | lr 9.94e-04 | (9886.01 ms | 53033 tok/s) step 6144/76294 | train loss 3.326754 | norm 0.1629 | lr 9.94e-04 | (9877.65 ms | 53078 tok/s) step 6145/76294 | train loss 3.297433 | norm 0.1365 | lr 9.94e-04 | (9881.40 ms | 53058 tok/s) step 6146/76294 | train loss 3.300806 | norm 0.1434 | lr 9.93e-04 | (9877.03 ms | 53082 tok/s) step 6147/76294 | train loss 3.297574 | norm 0.1728 | lr 9.93e-04 | (9878.85 ms | 53072 tok/s) step 6148/76294 | train loss 3.302496 | norm 0.1706 | lr 9.93e-04 | (9877.06 ms | 53081 tok/s) step 6149/76294 | train loss 3.344776 | norm 0.1498 | lr 9.93e-04 | (10950.71 ms | 47877 tok/s) step 6150/76294 | train loss 3.313060 | norm 0.1555 | lr 9.93e-04 | (9874.64 ms | 53094 tok/s) step 6151/76294 | train loss 3.319989 | norm 0.1416 | lr 9.93e-04 | (9881.50 ms | 53058 tok/s) step 6152/76294 | train loss 3.308877 | norm 0.1427 | lr 9.93e-04 | (9871.17 ms | 53113 tok/s) step 6153/76294 | train loss 3.263346 | norm 0.1471 | lr 9.93e-04 | (9883.99 ms | 53044 tok/s) step 6154/76294 | train loss 3.396266 | norm 0.1465 | lr 9.93e-04 | (9879.75 ms | 53067 tok/s) step 6155/76294 | train loss 3.365995 | norm 0.1528 | lr 9.93e-04 | (9877.58 ms | 53079 tok/s) step 6156/76294 | train loss 3.292634 | norm 0.1457 | lr 9.93e-04 | (9875.19 ms | 53091 tok/s) step 6157/76294 | train loss 3.241821 | norm 0.1345 | lr 9.93e-04 | (9925.84 ms | 52821 tok/s) step 6158/76294 | train loss 3.295389 | norm 0.1315 | lr 9.93e-04 | (9916.13 ms | 52872 tok/s) step 6159/76294 | train loss 3.375709 | norm 0.1292 | lr 9.93e-04 | (9876.21 ms | 53086 tok/s) step 6160/76294 | train loss 3.381257 | norm 0.1279 | lr 9.92e-04 | (9908.91 ms | 52911 tok/s) step 6161/76294 | train loss 3.381080 | norm 0.1399 | lr 9.92e-04 | (9880.65 ms | 53062 tok/s) step 6162/76294 | train loss 3.307854 | norm 0.1235 | lr 9.92e-04 | (9883.01 ms | 53049 tok/s) step 6163/76294 | train loss 3.331110 | norm 0.1268 | lr 9.92e-04 | (9885.25 ms | 53037 tok/s) step 6164/76294 | train loss 3.323655 | norm 0.1265 | lr 9.92e-04 | (9912.39 ms | 52892 tok/s) step 6165/76294 | train loss 3.368606 | norm 0.1431 | lr 9.92e-04 | (9882.87 ms | 53050 tok/s) step 6166/76294 | train loss 3.319739 | norm 0.1204 | lr 9.92e-04 | (9879.22 ms | 53070 tok/s) step 6167/76294 | train loss 3.337497 | norm 0.1298 | lr 9.92e-04 | (9928.41 ms | 52807 tok/s) step 6168/76294 | train loss 3.336915 | norm 0.1450 | lr 9.92e-04 | (9878.33 ms | 53075 tok/s) step 6169/76294 | train loss 3.302534 | norm 0.1329 | lr 9.92e-04 | (9888.92 ms | 53018 tok/s) step 6170/76294 | train loss 3.306789 | norm 0.1253 | lr 9.92e-04 | (9883.47 ms | 53047 tok/s) step 6171/76294 | train loss 3.292706 | norm 0.1298 | lr 9.92e-04 | (9886.33 ms | 53032 tok/s) step 6172/76294 | train loss 3.378512 | norm 0.1180 | lr 9.92e-04 | (9883.64 ms | 53046 tok/s) step 6173/76294 | train loss 3.287948 | norm 0.1401 | lr 9.92e-04 | (9901.48 ms | 52950 tok/s) step 6174/76294 | train loss 3.283889 | norm 0.1580 | lr 9.91e-04 | (9892.69 ms | 52998 tok/s) step 6175/76294 | train loss 3.318758 | norm 0.1480 | lr 9.91e-04 | (9886.65 ms | 53030 tok/s) step 6176/76294 | train loss 3.363455 | norm 0.1648 | lr 9.91e-04 | (9879.56 ms | 53068 tok/s) step 6177/76294 | train loss 3.345896 | norm 0.1555 | lr 9.91e-04 | (9892.76 ms | 52997 tok/s) step 6178/76294 | train loss 3.359843 | norm 0.1535 | lr 9.91e-04 | (9882.27 ms | 53053 tok/s) step 6179/76294 | train loss 3.489460 | norm 0.1474 | lr 9.91e-04 | (9910.52 ms | 52902 tok/s) step 6180/76294 | train loss 3.353036 | norm 0.1810 | lr 9.91e-04 | (9881.68 ms | 53057 tok/s) step 6181/76294 | train loss 3.336890 | norm 0.1403 | lr 9.91e-04 | (9909.82 ms | 52906 tok/s) step 6182/76294 | train loss 3.369746 | norm 0.1424 | lr 9.91e-04 | (9884.63 ms | 53041 tok/s) step 6183/76294 | train loss 3.340549 | norm 0.1423 | lr 9.91e-04 | (9925.05 ms | 52825 tok/s) step 6184/76294 | train loss 3.343576 | norm 0.1441 | lr 9.91e-04 | (9881.09 ms | 53060 tok/s) step 6185/76294 | train loss 3.376089 | norm 0.1405 | lr 9.91e-04 | (9887.85 ms | 53023 tok/s) step 6186/76294 | train loss 3.552020 | norm 0.1682 | lr 9.91e-04 | (9888.23 ms | 53021 tok/s) step 6187/76294 | train loss 3.311860 | norm 0.1430 | lr 9.91e-04 | (9941.65 ms | 52737 tok/s) step 6188/76294 | train loss 3.379806 | norm 0.1351 | lr 9.90e-04 | (9880.18 ms | 53065 tok/s) step 6189/76294 | train loss 3.369043 | norm 0.1426 | lr 9.90e-04 | (9884.91 ms | 53039 tok/s) step 6190/76294 | train loss 3.321283 | norm 0.1259 | lr 9.90e-04 | (9879.30 ms | 53069 tok/s) step 6191/76294 | train loss 3.444915 | norm 0.1644 | lr 9.90e-04 | (9885.77 ms | 53035 tok/s) step 6192/76294 | train loss 3.353434 | norm 0.1393 | lr 9.90e-04 | (9944.98 ms | 52719 tok/s) step 6193/76294 | train loss 3.327371 | norm 0.1386 | lr 9.90e-04 | (9883.57 ms | 53046 tok/s) step 6194/76294 | train loss 3.327695 | norm 0.1352 | lr 9.90e-04 | (9887.43 ms | 53026 tok/s) step 6195/76294 | train loss 3.230124 | norm 0.1422 | lr 9.90e-04 | (9905.69 ms | 52928 tok/s) step 6196/76294 | train loss 3.312710 | norm 0.1561 | lr 9.90e-04 | (9886.98 ms | 53028 tok/s) step 6197/76294 | train loss 3.315025 | norm 0.1378 | lr 9.90e-04 | (9881.40 ms | 53058 tok/s) step 6198/76294 | train loss 3.443059 | norm 0.1746 | lr 9.90e-04 | (9879.09 ms | 53070 tok/s) step 6199/76294 | train loss 3.273372 | norm 0.1712 | lr 9.90e-04 | (9894.89 ms | 52986 tok/s) step 6200/76294 | train loss 3.294568 | norm 0.1637 | lr 9.90e-04 | (9891.27 ms | 53005 tok/s) step 6201/76294 | train loss 3.342992 | norm 0.1401 | lr 9.90e-04 | (9918.90 ms | 52857 tok/s) step 6202/76294 | train loss 3.298713 | norm 0.1470 | lr 9.89e-04 | (9883.71 ms | 53046 tok/s) step 6203/76294 | train loss 3.335771 | norm 0.1320 | lr 9.89e-04 | (9890.14 ms | 53011 tok/s) step 6204/76294 | train loss 3.288958 | norm 0.1634 | lr 9.89e-04 | (9924.17 ms | 52829 tok/s) step 6205/76294 | train loss 3.333259 | norm 0.1381 | lr 9.89e-04 | (9898.65 ms | 52966 tok/s) step 6206/76294 | train loss 3.349970 | norm 0.1358 | lr 9.89e-04 | (9890.76 ms | 53008 tok/s) step 6207/76294 | train loss 3.323423 | norm 0.1415 | lr 9.89e-04 | (9948.42 ms | 52701 tok/s) step 6208/76294 | train loss 3.308801 | norm 0.1240 | lr 9.89e-04 | (9878.88 ms | 53072 tok/s) step 6209/76294 | train loss 3.321065 | norm 0.1240 | lr 9.89e-04 | (9893.23 ms | 52995 tok/s) step 6210/76294 | train loss 3.349142 | norm 0.1301 | lr 9.89e-04 | (10231.15 ms | 51244 tok/s) step 6211/76294 | train loss 3.299665 | norm 0.1196 | lr 9.89e-04 | (9883.99 ms | 53044 tok/s) step 6212/76294 | train loss 3.221463 | norm 0.1316 | lr 9.89e-04 | (9957.87 ms | 52651 tok/s) step 6213/76294 | train loss 3.340791 | norm 0.1279 | lr 9.89e-04 | (9887.48 ms | 53025 tok/s) step 6214/76294 | train loss 3.280606 | norm 0.1309 | lr 9.89e-04 | (9883.60 ms | 53046 tok/s) step 6215/76294 | train loss 3.262670 | norm 0.1277 | lr 9.89e-04 | (9886.17 ms | 53032 tok/s) step 6216/76294 | train loss 3.287750 | norm 0.1451 | lr 9.88e-04 | (9883.56 ms | 53046 tok/s) step 6217/76294 | train loss 3.302373 | norm 0.1249 | lr 9.88e-04 | (9887.91 ms | 53023 tok/s) step 6218/76294 | train loss 3.305243 | norm 0.1362 | lr 9.88e-04 | (9879.64 ms | 53068 tok/s) step 6219/76294 | train loss 3.317595 | norm 0.1509 | lr 9.88e-04 | (9888.66 ms | 53019 tok/s) step 6220/76294 | train loss 3.351973 | norm 0.1570 | lr 9.88e-04 | (9877.03 ms | 53082 tok/s) step 6221/76294 | train loss 3.340938 | norm 0.1445 | lr 9.88e-04 | (9920.03 ms | 52851 tok/s) step 6222/76294 | train loss 3.260856 | norm 0.1460 | lr 9.88e-04 | (9876.97 ms | 53082 tok/s) step 6223/76294 | train loss 3.281380 | norm 0.1414 | lr 9.88e-04 | (9894.01 ms | 52990 tok/s) step 6224/76294 | train loss 3.351706 | norm 0.1377 | lr 9.88e-04 | (9899.14 ms | 52963 tok/s) step 6225/76294 | train loss 3.251211 | norm 0.1284 | lr 9.88e-04 | (9908.71 ms | 52912 tok/s) step 6226/76294 | train loss 3.321415 | norm 0.1429 | lr 9.88e-04 | (9877.21 ms | 53081 tok/s) step 6227/76294 | train loss 3.325386 | norm 0.1339 | lr 9.88e-04 | (9883.42 ms | 53047 tok/s) step 6228/76294 | train loss 3.240512 | norm 0.1344 | lr 9.88e-04 | (9877.57 ms | 53079 tok/s) step 6229/76294 | train loss 3.293622 | norm 0.1380 | lr 9.88e-04 | (9879.19 ms | 53070 tok/s) step 6230/76294 | train loss 3.383847 | norm 0.1239 | lr 9.87e-04 | (9871.52 ms | 53111 tok/s) step 6231/76294 | train loss 3.320882 | norm 0.1533 | lr 9.87e-04 | (9882.29 ms | 53053 tok/s) step 6232/76294 | train loss 3.310685 | norm 0.1433 | lr 9.87e-04 | (9874.75 ms | 53094 tok/s) step 6233/76294 | train loss 3.284997 | norm 0.1332 | lr 9.87e-04 | (9882.98 ms | 53050 tok/s) step 6234/76294 | train loss 3.299281 | norm 0.1258 | lr 9.87e-04 | (9877.82 ms | 53077 tok/s) step 6235/76294 | train loss 3.335840 | norm 0.1222 | lr 9.87e-04 | (9880.83 ms | 53061 tok/s) step 6236/76294 | train loss 3.341786 | norm 0.1218 | lr 9.87e-04 | (9871.63 ms | 53111 tok/s) step 6237/76294 | train loss 3.348264 | norm 0.1251 | lr 9.87e-04 | (9879.03 ms | 53071 tok/s) step 6238/76294 | train loss 3.391872 | norm 0.1407 | lr 9.87e-04 | (9874.16 ms | 53097 tok/s) step 6239/76294 | train loss 3.550315 | norm 0.1439 | lr 9.87e-04 | (9882.67 ms | 53051 tok/s) step 6240/76294 | train loss 3.301664 | norm 0.1352 | lr 9.87e-04 | (9931.95 ms | 52788 tok/s) step 6241/76294 | train loss 3.238579 | norm 0.1383 | lr 9.87e-04 | (9876.87 ms | 53082 tok/s) step 6242/76294 | train loss 3.337197 | norm 0.1293 | lr 9.87e-04 | (9880.21 ms | 53064 tok/s) step 6243/76294 | train loss 3.303075 | norm 0.1337 | lr 9.87e-04 | (9877.12 ms | 53081 tok/s) step 6244/76294 | train loss 3.274256 | norm 0.1320 | lr 9.86e-04 | (9870.45 ms | 53117 tok/s) step 6245/76294 | train loss 3.279077 | norm 0.1286 | lr 9.86e-04 | (9934.16 ms | 52776 tok/s) step 6246/76294 | train loss 3.270446 | norm 0.1380 | lr 9.86e-04 | (9873.54 ms | 53100 tok/s) step 6247/76294 | train loss 3.328199 | norm 0.1350 | lr 9.86e-04 | (11123.94 ms | 47131 tok/s) step 6248/76294 | train loss 3.328140 | norm 0.1527 | lr 9.86e-04 | (9891.89 ms | 53002 tok/s) step 6249/76294 | train loss 3.278789 | norm 0.1285 | lr 9.86e-04 | (9871.91 ms | 53109 tok/s) step 6250/76294 | train loss 3.282456 | norm 0.1412 | lr 9.86e-04 | (9867.45 ms | 53133 tok/s) val loss: 3.305677 saving model checkpoint to ./results/gpt2-350M-gqa/step_6250.pth step 6251/76294 | train loss 3.307585 | norm 0.1249 | lr 9.86e-04 | (9919.42 ms | 52855 tok/s) step 6252/76294 | train loss 3.289866 | norm 0.1297 | lr 9.86e-04 | (9842.02 ms | 53270 tok/s) step 6253/76294 | train loss 3.269862 | norm 0.1370 | lr 9.86e-04 | (9845.29 ms | 53253 tok/s) step 6254/76294 | train loss 3.305762 | norm 0.1374 | lr 9.86e-04 | (9849.40 ms | 53230 tok/s) step 6255/76294 | train loss 3.312811 | norm 0.1210 | lr 9.86e-04 | (9886.67 ms | 53030 tok/s) step 6256/76294 | train loss 3.336723 | norm 0.1368 | lr 9.86e-04 | (9975.25 ms | 52559 tok/s) step 6257/76294 | train loss 3.246413 | norm 0.1460 | lr 9.86e-04 | (9864.09 ms | 53151 tok/s) step 6258/76294 | train loss 3.302376 | norm 0.1605 | lr 9.85e-04 | (9891.53 ms | 53004 tok/s) step 6259/76294 | train loss 3.240950 | norm 0.1426 | lr 9.85e-04 | (9870.07 ms | 53119 tok/s) step 6260/76294 | train loss 3.322235 | norm 0.1623 | lr 9.85e-04 | (9880.03 ms | 53065 tok/s) step 6261/76294 | train loss 3.225992 | norm 0.1465 | lr 9.85e-04 | (9880.76 ms | 53062 tok/s) step 6262/76294 | train loss 3.250469 | norm 0.1445 | lr 9.85e-04 | (9873.64 ms | 53100 tok/s) step 6263/76294 | train loss 3.347875 | norm 0.1411 | lr 9.85e-04 | (9879.05 ms | 53071 tok/s) step 6264/76294 | train loss 3.284370 | norm 0.1682 | lr 9.85e-04 | (9880.20 ms | 53065 tok/s) step 6265/76294 | train loss 3.298352 | norm 0.1585 | lr 9.85e-04 | (9913.88 ms | 52884 tok/s) step 6266/76294 | train loss 3.294735 | norm 0.1623 | lr 9.85e-04 | (9880.28 ms | 53064 tok/s) step 6267/76294 | train loss 3.332162 | norm 0.1312 | lr 9.85e-04 | (9890.09 ms | 53011 tok/s) step 6268/76294 | train loss 3.289151 | norm 0.1829 | lr 9.85e-04 | (9878.96 ms | 53071 tok/s) step 6269/76294 | train loss 3.318800 | norm 0.1530 | lr 9.85e-04 | (10763.27 ms | 48711 tok/s) step 6270/76294 | train loss 3.292444 | norm 0.1369 | lr 9.85e-04 | (9887.91 ms | 53023 tok/s) step 6271/76294 | train loss 3.295878 | norm 0.1542 | lr 9.85e-04 | (9938.24 ms | 52755 tok/s) step 6272/76294 | train loss 3.267905 | norm 0.1475 | lr 9.84e-04 | (9914.05 ms | 52883 tok/s) step 6273/76294 | train loss 3.337004 | norm 0.1332 | lr 9.84e-04 | (9882.51 ms | 53052 tok/s) step 6274/76294 | train loss 3.267627 | norm 0.1363 | lr 9.84e-04 | (9889.71 ms | 53013 tok/s) step 6275/76294 | train loss 3.336009 | norm 0.1313 | lr 9.84e-04 | (9912.26 ms | 52893 tok/s) step 6276/76294 | train loss 3.311401 | norm 0.1363 | lr 9.84e-04 | (9881.39 ms | 53058 tok/s) step 6277/76294 | train loss 3.326708 | norm 0.1470 | lr 9.84e-04 | (9906.03 ms | 52926 tok/s) step 6278/76294 | train loss 3.302118 | norm 0.1310 | lr 9.84e-04 | (12656.22 ms | 41425 tok/s) step 6279/76294 | train loss 3.303374 | norm 0.1765 | lr 9.84e-04 | (9886.44 ms | 53031 tok/s) step 6280/76294 | train loss 3.251632 | norm 0.1360 | lr 9.84e-04 | (9870.88 ms | 53115 tok/s) step 6281/76294 | train loss 3.273127 | norm 0.1594 | lr 9.84e-04 | (9884.46 ms | 53042 tok/s) step 6282/76294 | train loss 3.325724 | norm 0.1440 | lr 9.84e-04 | (9875.92 ms | 53088 tok/s) step 6283/76294 | train loss 3.421788 | norm 0.1712 | lr 9.84e-04 | (9890.08 ms | 53011 tok/s) step 6284/76294 | train loss 3.302493 | norm 0.1754 | lr 9.84e-04 | (9884.00 ms | 53044 tok/s) step 6285/76294 | train loss 3.323095 | norm 0.1436 | lr 9.84e-04 | (9894.62 ms | 52987 tok/s) step 6286/76294 | train loss 3.244932 | norm 0.1500 | lr 9.83e-04 | (9885.83 ms | 53034 tok/s) step 6287/76294 | train loss 3.311509 | norm 0.1470 | lr 9.83e-04 | (9895.02 ms | 52985 tok/s) step 6288/76294 | train loss 3.253503 | norm 0.1539 | lr 9.83e-04 | (9887.39 ms | 53026 tok/s) step 6289/76294 | train loss 3.324718 | norm 0.1461 | lr 9.83e-04 | (9898.06 ms | 52969 tok/s) step 6290/76294 | train loss 3.312263 | norm 0.1427 | lr 9.83e-04 | (9890.51 ms | 53009 tok/s) step 6291/76294 | train loss 3.309710 | norm 0.1519 | lr 9.83e-04 | (9900.40 ms | 52956 tok/s) step 6292/76294 | train loss 3.270333 | norm 0.1417 | lr 9.83e-04 | (9890.11 ms | 53011 tok/s) step 6293/76294 | train loss 3.292220 | norm 0.1477 | lr 9.83e-04 | (9908.22 ms | 52914 tok/s) step 6294/76294 | train loss 3.231699 | norm 0.1393 | lr 9.83e-04 | (9893.22 ms | 52995 tok/s) step 6295/76294 | train loss 3.320375 | norm 0.1345 | lr 9.83e-04 | (9910.54 ms | 52902 tok/s) step 6296/76294 | train loss 3.345905 | norm 0.1400 | lr 9.83e-04 | (9893.66 ms | 52992 tok/s) step 6297/76294 | train loss 3.303741 | norm 0.1329 | lr 9.83e-04 | (9890.94 ms | 53007 tok/s) step 6298/76294 | train loss 3.243556 | norm 0.1525 | lr 9.83e-04 | (10002.27 ms | 52417 tok/s) step 6299/76294 | train loss 3.232293 | norm 0.1506 | lr 9.83e-04 | (9961.67 ms | 52631 tok/s) step 6300/76294 | train loss 3.290802 | norm 0.1217 | lr 9.82e-04 | (9888.80 ms | 53018 tok/s) step 6301/76294 | train loss 3.298349 | norm 0.1571 | lr 9.82e-04 | (9931.99 ms | 52788 tok/s) step 6302/76294 | train loss 3.296529 | norm 0.1337 | lr 9.82e-04 | (9889.94 ms | 53012 tok/s) step 6303/76294 | train loss 3.292320 | norm 0.1333 | lr 9.82e-04 | (9893.04 ms | 52996 tok/s) step 6304/76294 | train loss 3.274867 | norm 0.1297 | lr 9.82e-04 | (9906.83 ms | 52922 tok/s) step 6305/76294 | train loss 3.298507 | norm 0.1336 | lr 9.82e-04 | (9939.13 ms | 52750 tok/s) step 6306/76294 | train loss 3.370281 | norm 0.1790 | lr 9.82e-04 | (9893.29 ms | 52994 tok/s) step 6307/76294 | train loss 3.338184 | norm 0.1883 | lr 9.82e-04 | (9903.11 ms | 52942 tok/s) step 6308/76294 | train loss 3.336523 | norm 0.1426 | lr 9.82e-04 | (9950.84 ms | 52688 tok/s) step 6309/76294 | train loss 3.358067 | norm 0.2065 | lr 9.82e-04 | (9905.43 ms | 52929 tok/s) step 6310/76294 | train loss 3.272028 | norm 0.2101 | lr 9.82e-04 | (9952.21 ms | 52681 tok/s) step 6311/76294 | train loss 3.318026 | norm 0.1630 | lr 9.82e-04 | (9896.63 ms | 52976 tok/s) step 6312/76294 | train loss 3.296466 | norm 0.1672 | lr 9.82e-04 | (9895.40 ms | 52983 tok/s) step 6313/76294 | train loss 3.279925 | norm 0.1392 | lr 9.82e-04 | (9924.10 ms | 52830 tok/s) step 6314/76294 | train loss 3.281020 | norm 0.1586 | lr 9.81e-04 | (9891.34 ms | 53005 tok/s) step 6315/76294 | train loss 3.251291 | norm 0.1444 | lr 9.81e-04 | (9901.90 ms | 52948 tok/s) step 6316/76294 | train loss 3.298011 | norm 0.1662 | lr 9.81e-04 | (9935.91 ms | 52767 tok/s) step 6317/76294 | train loss 3.289633 | norm 0.1428 | lr 9.81e-04 | (9896.55 ms | 52977 tok/s) step 6318/76294 | train loss 3.252247 | norm 0.1620 | lr 9.81e-04 | (9900.46 ms | 52956 tok/s) step 6319/76294 | train loss 3.403350 | norm 0.1503 | lr 9.81e-04 | (9897.94 ms | 52969 tok/s) step 6320/76294 | train loss 3.219619 | norm 0.2030 | lr 9.81e-04 | (9904.15 ms | 52936 tok/s) step 6321/76294 | train loss 3.302031 | norm 0.1500 | lr 9.81e-04 | (9964.26 ms | 52617 tok/s) step 6322/76294 | train loss 3.289256 | norm 0.1466 | lr 9.81e-04 | (9898.63 ms | 52966 tok/s) step 6323/76294 | train loss 3.319827 | norm 0.1570 | lr 9.81e-04 | (10012.26 ms | 52365 tok/s) step 6324/76294 | train loss 3.236001 | norm 0.1403 | lr 9.81e-04 | (9893.00 ms | 52996 tok/s) step 6325/76294 | train loss 3.381191 | norm 0.1752 | lr 9.81e-04 | (9959.14 ms | 52644 tok/s) step 6326/76294 | train loss 3.227624 | norm 0.1562 | lr 9.81e-04 | (9899.23 ms | 52962 tok/s) step 6327/76294 | train loss 3.297530 | norm 0.1640 | lr 9.80e-04 | (9901.62 ms | 52950 tok/s) step 6328/76294 | train loss 3.250533 | norm 0.1580 | lr 9.80e-04 | (9928.16 ms | 52808 tok/s) step 6329/76294 | train loss 3.331923 | norm 0.1431 | lr 9.80e-04 | (9894.55 ms | 52988 tok/s) step 6330/76294 | train loss 3.265324 | norm 0.1476 | lr 9.80e-04 | (9940.01 ms | 52745 tok/s) step 6331/76294 | train loss 3.337859 | norm 0.1475 | lr 9.80e-04 | (9895.29 ms | 52984 tok/s) step 6332/76294 | train loss 3.272565 | norm 0.1349 | lr 9.80e-04 | (9898.26 ms | 52968 tok/s) step 6333/76294 | train loss 3.329646 | norm 0.1487 | lr 9.80e-04 | (9897.73 ms | 52971 tok/s) step 6334/76294 | train loss 3.257020 | norm 0.1406 | lr 9.80e-04 | (9919.05 ms | 52857 tok/s) step 6335/76294 | train loss 3.280299 | norm 0.1200 | lr 9.80e-04 | (9903.09 ms | 52942 tok/s) step 6336/76294 | train loss 3.285132 | norm 0.1252 | lr 9.80e-04 | (9898.19 ms | 52968 tok/s) step 6337/76294 | train loss 3.297795 | norm 0.1226 | lr 9.80e-04 | (9895.14 ms | 52984 tok/s) step 6338/76294 | train loss 3.276897 | norm 0.1206 | lr 9.80e-04 | (9890.37 ms | 53010 tok/s) step 6339/76294 | train loss 3.283177 | norm 0.1247 | lr 9.80e-04 | (9901.40 ms | 52951 tok/s) step 6340/76294 | train loss 3.282232 | norm 0.1264 | lr 9.80e-04 | (9892.40 ms | 52999 tok/s) step 6341/76294 | train loss 3.255222 | norm 0.1312 | lr 9.79e-04 | (9899.54 ms | 52961 tok/s) step 6342/76294 | train loss 3.295072 | norm 0.1367 | lr 9.79e-04 | (9887.79 ms | 53024 tok/s) step 6343/76294 | train loss 3.294231 | norm 0.1403 | lr 9.79e-04 | (9899.76 ms | 52960 tok/s) step 6344/76294 | train loss 3.321673 | norm 0.1493 | lr 9.79e-04 | (9892.74 ms | 52997 tok/s) step 6345/76294 | train loss 3.274888 | norm 0.1358 | lr 9.79e-04 | (12588.62 ms | 41648 tok/s) step 6346/76294 | train loss 3.285612 | norm 0.1408 | lr 9.79e-04 | (9878.13 ms | 53076 tok/s) step 6347/76294 | train loss 3.238878 | norm 0.1292 | lr 9.79e-04 | (9889.73 ms | 53013 tok/s) step 6348/76294 | train loss 3.344327 | norm 0.1277 | lr 9.79e-04 | (9894.31 ms | 52989 tok/s) step 6349/76294 | train loss 3.278529 | norm 0.1418 | lr 9.79e-04 | (9885.53 ms | 53036 tok/s) step 6350/76294 | train loss 3.359709 | norm 0.1337 | lr 9.79e-04 | (9887.57 ms | 53025 tok/s) step 6351/76294 | train loss 3.241765 | norm 0.1333 | lr 9.79e-04 | (9902.81 ms | 52943 tok/s) step 6352/76294 | train loss 3.356169 | norm 0.1270 | lr 9.79e-04 | (9887.71 ms | 53024 tok/s) step 6353/76294 | train loss 3.214268 | norm 0.1296 | lr 9.79e-04 | (9898.96 ms | 52964 tok/s) step 6354/76294 | train loss 3.364203 | norm 0.1503 | lr 9.79e-04 | (9925.69 ms | 52821 tok/s) step 6355/76294 | train loss 3.186232 | norm 0.1325 | lr 9.78e-04 | (9891.95 ms | 53001 tok/s) step 6356/76294 | train loss 3.428689 | norm 0.1633 | lr 9.78e-04 | (9897.15 ms | 52974 tok/s) step 6357/76294 | train loss 3.335268 | norm 0.1525 | lr 9.78e-04 | (9924.41 ms | 52828 tok/s) step 6358/76294 | train loss 3.290590 | norm 0.1536 | lr 9.78e-04 | (9916.78 ms | 52869 tok/s) step 6359/76294 | train loss 3.289686 | norm 0.1664 | lr 9.78e-04 | (9942.89 ms | 52730 tok/s) step 6360/76294 | train loss 3.253147 | norm 0.1459 | lr 9.78e-04 | (9892.87 ms | 52997 tok/s) step 6361/76294 | train loss 3.225920 | norm 0.1742 | lr 9.78e-04 | (9896.38 ms | 52978 tok/s) step 6362/76294 | train loss 3.246743 | norm 0.1646 | lr 9.78e-04 | (9977.73 ms | 52546 tok/s) step 6363/76294 | train loss 3.323640 | norm 0.1626 | lr 9.78e-04 | (9964.10 ms | 52618 tok/s) step 6364/76294 | train loss 3.234234 | norm 0.1740 | lr 9.78e-04 | (9896.24 ms | 52979 tok/s) step 6365/76294 | train loss 3.379658 | norm 0.1739 | lr 9.78e-04 | (9925.00 ms | 52825 tok/s) step 6366/76294 | train loss 3.248062 | norm 0.1494 | lr 9.78e-04 | (9985.95 ms | 52503 tok/s) step 6367/76294 | train loss 3.355892 | norm 0.1629 | lr 9.78e-04 | (9898.43 ms | 52967 tok/s) step 6368/76294 | train loss 3.290616 | norm 0.1657 | lr 9.78e-04 | (9927.87 ms | 52810 tok/s) step 6369/76294 | train loss 3.298595 | norm 0.1466 | lr 9.77e-04 | (9898.36 ms | 52967 tok/s) step 6370/76294 | train loss 3.257935 | norm 0.1542 | lr 9.77e-04 | (9895.46 ms | 52983 tok/s) step 6371/76294 | train loss 3.316022 | norm 0.1358 | lr 9.77e-04 | (9932.73 ms | 52784 tok/s) step 6372/76294 | train loss 3.254296 | norm 0.1643 | lr 9.77e-04 | (9902.84 ms | 52943 tok/s) step 6373/76294 | train loss 3.267821 | norm 0.1687 | lr 9.77e-04 | (9905.90 ms | 52927 tok/s) step 6374/76294 | train loss 3.355526 | norm 0.1441 | lr 9.77e-04 | (9954.78 ms | 52667 tok/s) step 6375/76294 | train loss 3.193563 | norm 0.1653 | lr 9.77e-04 | (9902.04 ms | 52947 tok/s) step 6376/76294 | train loss 3.382569 | norm 0.1492 | lr 9.77e-04 | (9892.78 ms | 52997 tok/s) step 6377/76294 | train loss 3.226825 | norm 0.1573 | lr 9.77e-04 | (9945.65 ms | 52715 tok/s) step 6378/76294 | train loss 3.341406 | norm 0.1432 | lr 9.77e-04 | (9894.80 ms | 52986 tok/s) step 6379/76294 | train loss 3.295524 | norm 0.1608 | lr 9.77e-04 | (9935.34 ms | 52770 tok/s) step 6380/76294 | train loss 3.399669 | norm 0.1352 | lr 9.77e-04 | (9894.27 ms | 52989 tok/s) step 6381/76294 | train loss 3.294446 | norm 0.1552 | lr 9.77e-04 | (9913.77 ms | 52885 tok/s) step 6382/76294 | train loss 3.284820 | norm 0.1388 | lr 9.76e-04 | (9893.76 ms | 52992 tok/s) step 6383/76294 | train loss 3.214039 | norm 0.2174 | lr 9.76e-04 | (9896.34 ms | 52978 tok/s) step 6384/76294 | train loss 3.398466 | norm 0.2065 | lr 9.76e-04 | (9920.74 ms | 52848 tok/s) step 6385/76294 | train loss 3.246312 | norm 0.1800 | lr 9.76e-04 | (9894.16 ms | 52990 tok/s) step 6386/76294 | train loss 3.288287 | norm 0.1766 | lr 9.76e-04 | (9900.54 ms | 52955 tok/s) step 6387/76294 | train loss 3.267597 | norm 0.1776 | lr 9.76e-04 | (9936.21 ms | 52765 tok/s) step 6388/76294 | train loss 3.247468 | norm 0.1757 | lr 9.76e-04 | (9891.45 ms | 53004 tok/s) step 6389/76294 | train loss 3.306059 | norm 0.1379 | lr 9.76e-04 | (9962.36 ms | 52627 tok/s) step 6390/76294 | train loss 3.293558 | norm 0.1551 | lr 9.76e-04 | (9889.97 ms | 53012 tok/s) step 6391/76294 | train loss 3.298966 | norm 0.1482 | lr 9.76e-04 | (9902.80 ms | 52943 tok/s) step 6392/76294 | train loss 3.379008 | norm 0.2003 | lr 9.76e-04 | (9896.74 ms | 52976 tok/s) step 6393/76294 | train loss 3.336004 | norm 0.1670 | lr 9.76e-04 | (9896.85 ms | 52975 tok/s) step 6394/76294 | train loss 3.252789 | norm 0.1619 | lr 9.76e-04 | (9972.20 ms | 52575 tok/s) step 6395/76294 | train loss 3.339489 | norm 0.1334 | lr 9.76e-04 | (10044.62 ms | 52196 tok/s) step 6396/76294 | train loss 3.233906 | norm 0.1463 | lr 9.75e-04 | (9888.54 ms | 53020 tok/s) step 6397/76294 | train loss 3.475412 | norm 0.1409 | lr 9.75e-04 | (9896.02 ms | 52980 tok/s) step 6398/76294 | train loss 3.274909 | norm 0.1505 | lr 9.75e-04 | (9934.43 ms | 52775 tok/s) step 6399/76294 | train loss 3.331504 | norm 0.1369 | lr 9.75e-04 | (9894.56 ms | 52987 tok/s) step 6400/76294 | train loss 3.253224 | norm 0.1470 | lr 9.75e-04 | (10695.32 ms | 49020 tok/s) step 6401/76294 | train loss 3.377221 | norm 0.1595 | lr 9.75e-04 | (9885.00 ms | 53039 tok/s) step 6402/76294 | train loss 3.345863 | norm 0.1440 | lr 9.75e-04 | (9889.07 ms | 53017 tok/s) step 6403/76294 | train loss 3.325204 | norm 0.1549 | lr 9.75e-04 | (9890.47 ms | 53009 tok/s) step 6404/76294 | train loss 3.315166 | norm 0.1406 | lr 9.75e-04 | (9893.49 ms | 52993 tok/s) step 6405/76294 | train loss 3.461934 | norm 0.1434 | lr 9.75e-04 | (9938.51 ms | 52753 tok/s) step 6406/76294 | train loss 3.381408 | norm 0.1793 | lr 9.75e-04 | (9888.09 ms | 53022 tok/s) step 6407/76294 | train loss 3.321088 | norm 0.1239 | lr 9.75e-04 | (9900.61 ms | 52955 tok/s) step 6408/76294 | train loss 3.306632 | norm 0.1376 | lr 9.75e-04 | (9887.82 ms | 53024 tok/s) step 6409/76294 | train loss 3.356127 | norm 0.1321 | lr 9.75e-04 | (9894.86 ms | 52986 tok/s) step 6410/76294 | train loss 3.311747 | norm 0.1541 | lr 9.74e-04 | (9951.83 ms | 52683 tok/s) step 6411/76294 | train loss 3.336383 | norm 0.1578 | lr 9.74e-04 | (9887.83 ms | 53024 tok/s) step 6412/76294 | train loss 3.349355 | norm 0.1470 | lr 9.74e-04 | (9882.11 ms | 53054 tok/s) step 6413/76294 | train loss 3.334257 | norm 0.1447 | lr 9.74e-04 | (9891.93 ms | 53002 tok/s) step 6414/76294 | train loss 3.343657 | norm 0.1363 | lr 9.74e-04 | (9925.51 ms | 52822 tok/s) step 6415/76294 | train loss 3.348091 | norm 0.1386 | lr 9.74e-04 | (9888.30 ms | 53021 tok/s) step 6416/76294 | train loss 3.390463 | norm 0.1363 | lr 9.74e-04 | (9895.05 ms | 52985 tok/s) step 6417/76294 | train loss 3.326206 | norm 0.1369 | lr 9.74e-04 | (9887.24 ms | 53027 tok/s) step 6418/76294 | train loss 3.309786 | norm 0.1396 | lr 9.74e-04 | (9892.15 ms | 53000 tok/s) step 6419/76294 | train loss 3.260110 | norm 0.1414 | lr 9.74e-04 | (9889.60 ms | 53014 tok/s) step 6420/76294 | train loss 3.303463 | norm 0.1325 | lr 9.74e-04 | (9892.31 ms | 53000 tok/s) step 6421/76294 | train loss 3.273582 | norm 0.1622 | lr 9.74e-04 | (9910.10 ms | 52904 tok/s) step 6422/76294 | train loss 3.347131 | norm 0.1274 | lr 9.74e-04 | (9890.65 ms | 53008 tok/s) step 6423/76294 | train loss 3.249507 | norm 0.1507 | lr 9.73e-04 | (9939.46 ms | 52748 tok/s) step 6424/76294 | train loss 3.243582 | norm 0.2421 | lr 9.73e-04 | (9883.36 ms | 53048 tok/s) step 6425/76294 | train loss 3.305572 | norm 0.1839 | lr 9.73e-04 | (9893.62 ms | 52993 tok/s) step 6426/76294 | train loss 3.257663 | norm 0.1514 | lr 9.73e-04 | (9883.33 ms | 53048 tok/s) step 6427/76294 | train loss 3.450624 | norm 0.1767 | lr 9.73e-04 | (9894.88 ms | 52986 tok/s) step 6428/76294 | train loss 3.289581 | norm 0.1627 | lr 9.73e-04 | (9887.82 ms | 53024 tok/s) step 6429/76294 | train loss 3.287116 | norm 0.1891 | lr 9.73e-04 | (9937.23 ms | 52760 tok/s) step 6430/76294 | train loss 3.297586 | norm 0.1916 | lr 9.73e-04 | (9883.29 ms | 53048 tok/s) step 6431/76294 | train loss 3.277448 | norm 0.1350 | lr 9.73e-04 | (9888.14 ms | 53022 tok/s) step 6432/76294 | train loss 3.278924 | norm 0.1565 | lr 9.73e-04 | (9930.86 ms | 52794 tok/s) step 6433/76294 | train loss 3.259649 | norm 0.1564 | lr 9.73e-04 | (9890.70 ms | 53008 tok/s) step 6434/76294 | train loss 3.301133 | norm 0.1449 | lr 9.73e-04 | (9899.36 ms | 52962 tok/s) step 6435/76294 | train loss 3.232048 | norm 0.1488 | lr 9.73e-04 | (9892.30 ms | 53000 tok/s) step 6436/76294 | train loss 3.253371 | norm 0.1493 | lr 9.73e-04 | (9897.83 ms | 52970 tok/s) step 6437/76294 | train loss 3.292283 | norm 0.1346 | lr 9.72e-04 | (9965.27 ms | 52612 tok/s) step 6438/76294 | train loss 3.282850 | norm 0.1481 | lr 9.72e-04 | (9885.68 ms | 53035 tok/s) step 6439/76294 | train loss 3.314482 | norm 0.1467 | lr 9.72e-04 | (9929.50 ms | 52801 tok/s) step 6440/76294 | train loss 3.269316 | norm 0.1410 | lr 9.72e-04 | (9889.89 ms | 53012 tok/s) step 6441/76294 | train loss 3.285687 | norm 0.1339 | lr 9.72e-04 | (9902.67 ms | 52944 tok/s) step 6442/76294 | train loss 3.291905 | norm 0.1403 | lr 9.72e-04 | (11072.26 ms | 47351 tok/s) step 6443/76294 | train loss 3.295686 | norm 0.1429 | lr 9.72e-04 | (9878.25 ms | 53075 tok/s) step 6444/76294 | train loss 3.314969 | norm 0.1493 | lr 9.72e-04 | (9902.73 ms | 52944 tok/s) step 6445/76294 | train loss 3.253773 | norm 0.1407 | lr 9.72e-04 | (9891.36 ms | 53005 tok/s) step 6446/76294 | train loss 3.282212 | norm 0.1288 | lr 9.72e-04 | (9888.61 ms | 53019 tok/s) step 6447/76294 | train loss 3.348929 | norm 0.1368 | lr 9.72e-04 | (9887.81 ms | 53024 tok/s) step 6448/76294 | train loss 3.242242 | norm 0.1424 | lr 9.72e-04 | (9888.85 ms | 53018 tok/s) step 6449/76294 | train loss 3.301959 | norm 0.1435 | lr 9.72e-04 | (9976.70 ms | 52551 tok/s) step 6450/76294 | train loss 3.337906 | norm 0.1497 | lr 9.72e-04 | (9888.35 ms | 53021 tok/s) step 6451/76294 | train loss 3.297590 | norm 0.1626 | lr 9.71e-04 | (9888.17 ms | 53022 tok/s) step 6452/76294 | train loss 3.354715 | norm 0.1402 | lr 9.71e-04 | (9918.67 ms | 52859 tok/s) step 6453/76294 | train loss 3.271541 | norm 0.1377 | lr 9.71e-04 | (9890.13 ms | 53011 tok/s) step 6454/76294 | train loss 3.202796 | norm 0.1367 | lr 9.71e-04 | (9935.62 ms | 52769 tok/s) step 6455/76294 | train loss 3.259665 | norm 0.1459 | lr 9.71e-04 | (9894.30 ms | 52989 tok/s) step 6456/76294 | train loss 3.253676 | norm 0.1501 | lr 9.71e-04 | (9894.61 ms | 52987 tok/s) step 6457/76294 | train loss 3.234675 | norm 0.1568 | lr 9.71e-04 | (9927.22 ms | 52813 tok/s) step 6458/76294 | train loss 3.414283 | norm 0.1422 | lr 9.71e-04 | (9883.85 ms | 53045 tok/s) step 6459/76294 | train loss 3.279971 | norm 0.1309 | lr 9.71e-04 | (9897.03 ms | 52974 tok/s) step 6460/76294 | train loss 3.286631 | norm 0.1370 | lr 9.71e-04 | (9888.46 ms | 53020 tok/s) step 6461/76294 | train loss 3.269658 | norm 0.1395 | lr 9.71e-04 | (9904.67 ms | 52933 tok/s) step 6462/76294 | train loss 3.320016 | norm 0.1437 | lr 9.71e-04 | (9915.04 ms | 52878 tok/s) step 6463/76294 | train loss 3.324043 | norm 0.1463 | lr 9.71e-04 | (9897.74 ms | 52970 tok/s) step 6464/76294 | train loss 3.269123 | norm 0.1278 | lr 9.70e-04 | (9891.18 ms | 53006 tok/s) step 6465/76294 | train loss 3.266778 | norm 0.1318 | lr 9.70e-04 | (9900.17 ms | 52957 tok/s) step 6466/76294 | train loss 3.263084 | norm 0.1410 | lr 9.70e-04 | (9892.42 ms | 52999 tok/s) step 6467/76294 | train loss 3.281095 | norm 0.1384 | lr 9.70e-04 | (9897.93 ms | 52969 tok/s) step 6468/76294 | train loss 3.325708 | norm 0.1420 | lr 9.70e-04 | (9895.18 ms | 52984 tok/s) step 6469/76294 | train loss 3.286130 | norm 0.1278 | lr 9.70e-04 | (9900.91 ms | 52954 tok/s) step 6470/76294 | train loss 3.304486 | norm 0.1389 | lr 9.70e-04 | (9894.58 ms | 52987 tok/s) step 6471/76294 | train loss 3.303808 | norm 0.1210 | lr 9.70e-04 | (9938.62 ms | 52753 tok/s) step 6472/76294 | train loss 3.279187 | norm 0.1360 | lr 9.70e-04 | (9887.53 ms | 53025 tok/s) step 6473/76294 | train loss 3.313975 | norm 0.1318 | lr 9.70e-04 | (9898.19 ms | 52968 tok/s) step 6474/76294 | train loss 3.277589 | norm 0.1211 | lr 9.70e-04 | (9962.97 ms | 52624 tok/s) step 6475/76294 | train loss 3.305231 | norm 0.1343 | lr 9.70e-04 | (9920.01 ms | 52852 tok/s) step 6476/76294 | train loss 3.345496 | norm 0.1245 | lr 9.70e-04 | (9887.48 ms | 53025 tok/s) step 6477/76294 | train loss 3.322464 | norm 0.1246 | lr 9.70e-04 | (9891.18 ms | 53006 tok/s) step 6478/76294 | train loss 3.318029 | norm 0.1204 | lr 9.69e-04 | (9951.09 ms | 52686 tok/s) step 6479/76294 | train loss 3.209637 | norm 0.1360 | lr 9.69e-04 | (9901.45 ms | 52951 tok/s) step 6480/76294 | train loss 3.361423 | norm 0.1337 | lr 9.69e-04 | (9895.73 ms | 52981 tok/s) step 6481/76294 | train loss 3.279402 | norm 0.1597 | lr 9.69e-04 | (9926.26 ms | 52818 tok/s) step 6482/76294 | train loss 3.280607 | norm 0.1376 | lr 9.69e-04 | (9891.80 ms | 53002 tok/s) step 6483/76294 | train loss 3.266109 | norm 0.1400 | lr 9.69e-04 | (9901.97 ms | 52948 tok/s) step 6484/76294 | train loss 3.200947 | norm 0.1562 | lr 9.69e-04 | (9891.37 ms | 53005 tok/s) step 6485/76294 | train loss 3.353476 | norm 0.1344 | lr 9.69e-04 | (9896.61 ms | 52977 tok/s) step 6486/76294 | train loss 3.301121 | norm 0.1509 | lr 9.69e-04 | (9892.55 ms | 52998 tok/s) step 6487/76294 | train loss 3.279654 | norm 0.1686 | lr 9.69e-04 | (9893.38 ms | 52994 tok/s) step 6488/76294 | train loss 3.324263 | norm 0.1341 | lr 9.69e-04 | (9897.33 ms | 52973 tok/s) step 6489/76294 | train loss 3.338467 | norm 0.1575 | lr 9.69e-04 | (9905.33 ms | 52930 tok/s) step 6490/76294 | train loss 3.312000 | norm 0.1365 | lr 9.69e-04 | (9933.45 ms | 52780 tok/s) step 6491/76294 | train loss 3.244330 | norm 0.1450 | lr 9.68e-04 | (9893.98 ms | 52991 tok/s) step 6492/76294 | train loss 3.279319 | norm 0.1327 | lr 9.68e-04 | (9900.85 ms | 52954 tok/s) step 6493/76294 | train loss 3.247691 | norm 0.1455 | lr 9.68e-04 | (9896.74 ms | 52976 tok/s) step 6494/76294 | train loss 3.331500 | norm 0.1312 | lr 9.68e-04 | (9910.81 ms | 52901 tok/s) step 6495/76294 | train loss 3.283067 | norm 0.1619 | lr 9.68e-04 | (9904.76 ms | 52933 tok/s) step 6496/76294 | train loss 3.334254 | norm 0.1217 | lr 9.68e-04 | (9893.60 ms | 52993 tok/s) step 6497/76294 | train loss 3.287438 | norm 0.1403 | lr 9.68e-04 | (9902.04 ms | 52947 tok/s) step 6498/76294 | train loss 3.349453 | norm 0.1432 | lr 9.68e-04 | (9901.67 ms | 52949 tok/s) step 6499/76294 | train loss 3.322973 | norm 0.1262 | lr 9.68e-04 | (9893.48 ms | 52993 tok/s) step 6500/76294 | train loss 3.296137 | norm 0.1527 | lr 9.68e-04 | (9924.46 ms | 52828 tok/s) val loss: 3.288541 saving model checkpoint to ./results/gpt2-350M-gqa/step_6500.pth step 6501/76294 | train loss 3.299109 | norm 0.1379 | lr 9.68e-04 | (9961.00 ms | 52634 tok/s) step 6502/76294 | train loss 3.362510 | norm 0.1558 | lr 9.68e-04 | (9863.34 ms | 53155 tok/s) step 6503/76294 | train loss 3.353204 | norm 0.1400 | lr 9.68e-04 | (9880.60 ms | 53062 tok/s) step 6504/76294 | train loss 3.256090 | norm 0.1363 | lr 9.68e-04 | (9879.80 ms | 53067 tok/s) step 6505/76294 | train loss 3.275616 | norm 0.1282 | lr 9.67e-04 | (9888.12 ms | 53022 tok/s) step 6506/76294 | train loss 3.250612 | norm 0.1297 | lr 9.67e-04 | (9891.91 ms | 53002 tok/s) step 6507/76294 | train loss 3.311495 | norm 0.1373 | lr 9.67e-04 | (9900.25 ms | 52957 tok/s) step 6508/76294 | train loss 3.295855 | norm 0.1718 | lr 9.67e-04 | (9901.33 ms | 52951 tok/s) step 6509/76294 | train loss 3.323611 | norm 0.1384 | lr 9.67e-04 | (9915.27 ms | 52877 tok/s) step 6510/76294 | train loss 3.239729 | norm 0.1582 | lr 9.67e-04 | (9897.31 ms | 52973 tok/s) step 6511/76294 | train loss 3.302006 | norm 0.1539 | lr 9.67e-04 | (9945.61 ms | 52716 tok/s) step 6512/76294 | train loss 3.324072 | norm 0.1388 | lr 9.67e-04 | (9965.00 ms | 52613 tok/s) step 6513/76294 | train loss 3.311523 | norm 0.1536 | lr 9.67e-04 | (9908.85 ms | 52911 tok/s) step 6514/76294 | train loss 3.323667 | norm 0.1273 | lr 9.67e-04 | (10383.53 ms | 50492 tok/s) step 6515/76294 | train loss 3.304781 | norm 0.1568 | lr 9.67e-04 | (9911.51 ms | 52897 tok/s) step 6516/76294 | train loss 3.286861 | norm 0.1385 | lr 9.67e-04 | (9899.96 ms | 52959 tok/s) step 6517/76294 | train loss 3.263427 | norm 0.1573 | lr 9.67e-04 | (9916.21 ms | 52872 tok/s) step 6518/76294 | train loss 3.388689 | norm 0.1350 | lr 9.66e-04 | (9893.05 ms | 52996 tok/s) step 6519/76294 | train loss 3.332482 | norm 0.1424 | lr 9.66e-04 | (9961.30 ms | 52632 tok/s) step 6520/76294 | train loss 3.301147 | norm 0.1304 | lr 9.66e-04 | (9892.01 ms | 53001 tok/s) step 6521/76294 | train loss 3.288799 | norm 0.1588 | lr 9.66e-04 | (9951.17 ms | 52686 tok/s) step 6522/76294 | train loss 3.339567 | norm 0.1832 | lr 9.66e-04 | (9892.18 ms | 53000 tok/s) step 6523/76294 | train loss 3.340974 | norm 0.1373 | lr 9.66e-04 | (9891.85 ms | 53002 tok/s) step 6524/76294 | train loss 3.292962 | norm 0.1611 | lr 9.66e-04 | (9893.73 ms | 52992 tok/s) step 6525/76294 | train loss 3.328802 | norm 0.1388 | lr 9.66e-04 | (9938.53 ms | 52753 tok/s) step 6526/76294 | train loss 3.254203 | norm 0.1758 | lr 9.66e-04 | (9891.92 ms | 53002 tok/s) step 6527/76294 | train loss 3.309929 | norm 0.1469 | lr 9.66e-04 | (9902.15 ms | 52947 tok/s) step 6528/76294 | train loss 3.268226 | norm 0.1429 | lr 9.66e-04 | (9894.31 ms | 52989 tok/s) step 6529/76294 | train loss 3.320587 | norm 0.1460 | lr 9.66e-04 | (9968.34 ms | 52595 tok/s) step 6530/76294 | train loss 3.291073 | norm 0.1717 | lr 9.66e-04 | (9892.87 ms | 52997 tok/s) step 6531/76294 | train loss 3.205668 | norm 0.1479 | lr 9.66e-04 | (9952.59 ms | 52679 tok/s) step 6532/76294 | train loss 3.272213 | norm 0.1382 | lr 9.65e-04 | (9884.07 ms | 53044 tok/s) step 6533/76294 | train loss 3.262121 | norm 0.1377 | lr 9.65e-04 | (9895.38 ms | 52983 tok/s) step 6534/76294 | train loss 3.297916 | norm 0.1430 | lr 9.65e-04 | (9894.31 ms | 52989 tok/s) step 6535/76294 | train loss 3.263572 | norm 0.1397 | lr 9.65e-04 | (9934.85 ms | 52773 tok/s) step 6536/76294 | train loss 3.457853 | norm 0.1561 | lr 9.65e-04 | (9890.72 ms | 53008 tok/s) step 6537/76294 | train loss 3.271344 | norm 0.1291 | lr 9.65e-04 | (9902.29 ms | 52946 tok/s) step 6538/76294 | train loss 3.259501 | norm 0.1493 | lr 9.65e-04 | (9889.56 ms | 53014 tok/s) step 6539/76294 | train loss 3.263734 | norm 0.1327 | lr 9.65e-04 | (9901.33 ms | 52951 tok/s) step 6540/76294 | train loss 3.366035 | norm 0.1482 | lr 9.65e-04 | (11042.21 ms | 47480 tok/s) step 6541/76294 | train loss 3.235154 | norm 0.1321 | lr 9.65e-04 | (9888.31 ms | 53021 tok/s) step 6542/76294 | train loss 3.277071 | norm 0.1377 | lr 9.65e-04 | (9892.96 ms | 52996 tok/s) step 6543/76294 | train loss 3.326000 | norm 0.1432 | lr 9.65e-04 | (9889.50 ms | 53015 tok/s) step 6544/76294 | train loss 3.315602 | norm 0.1316 | lr 9.65e-04 | (9895.17 ms | 52984 tok/s) step 6545/76294 | train loss 3.267496 | norm 0.1442 | lr 9.64e-04 | (9889.62 ms | 53014 tok/s) step 6546/76294 | train loss 3.343717 | norm 0.1284 | lr 9.64e-04 | (9893.51 ms | 52993 tok/s) step 6547/76294 | train loss 3.285745 | norm 0.1439 | lr 9.64e-04 | (9932.24 ms | 52787 tok/s) step 6548/76294 | train loss 3.249008 | norm 0.1983 | lr 9.64e-04 | (9896.92 ms | 52975 tok/s) step 6549/76294 | train loss 3.325776 | norm 0.2982 | lr 9.64e-04 | (9899.29 ms | 52962 tok/s) step 6550/76294 | train loss 3.295512 | norm 0.2277 | lr 9.64e-04 | (9889.28 ms | 53016 tok/s) step 6551/76294 | train loss 3.197652 | norm 0.1778 | lr 9.64e-04 | (9897.60 ms | 52971 tok/s) step 6552/76294 | train loss 3.297751 | norm 0.1749 | lr 9.64e-04 | (9894.69 ms | 52987 tok/s) step 6553/76294 | train loss 3.285718 | norm 0.1730 | lr 9.64e-04 | (9900.40 ms | 52956 tok/s) step 6554/76294 | train loss 3.380512 | norm 0.1670 | lr 9.64e-04 | (9891.48 ms | 53004 tok/s) step 6555/76294 | train loss 3.279010 | norm 0.1722 | lr 9.64e-04 | (9904.23 ms | 52936 tok/s) step 6556/76294 | train loss 3.274169 | norm 0.1574 | lr 9.64e-04 | (9891.70 ms | 53003 tok/s) step 6557/76294 | train loss 3.286378 | norm 0.1614 | lr 9.64e-04 | (9901.43 ms | 52951 tok/s) step 6558/76294 | train loss 3.275817 | norm 0.1756 | lr 9.64e-04 | (9891.32 ms | 53005 tok/s) step 6559/76294 | train loss 3.283152 | norm 0.1598 | lr 9.63e-04 | (9918.78 ms | 52858 tok/s) step 6560/76294 | train loss 3.278275 | norm 0.1882 | lr 9.63e-04 | (9892.09 ms | 53001 tok/s) step 6561/76294 | train loss 3.350532 | norm 0.1360 | lr 9.63e-04 | (9894.05 ms | 52990 tok/s) step 6562/76294 | train loss 3.299002 | norm 0.1958 | lr 9.63e-04 | (9951.02 ms | 52687 tok/s) step 6563/76294 | train loss 3.325546 | norm 0.1810 | lr 9.63e-04 | (9898.39 ms | 52967 tok/s) step 6564/76294 | train loss 3.394507 | norm 0.1544 | lr 9.63e-04 | (9891.04 ms | 53006 tok/s) step 6565/76294 | train loss 3.310292 | norm 0.1894 | lr 9.63e-04 | (9971.44 ms | 52579 tok/s) step 6566/76294 | train loss 3.206242 | norm 0.1389 | lr 9.63e-04 | (9891.54 ms | 53004 tok/s) step 6567/76294 | train loss 3.286292 | norm 0.1962 | lr 9.63e-04 | (9897.38 ms | 52972 tok/s) step 6568/76294 | train loss 3.339048 | norm 0.1366 | lr 9.63e-04 | (9919.55 ms | 52854 tok/s) step 6569/76294 | train loss 3.310556 | norm 0.1719 | lr 9.63e-04 | (9912.13 ms | 52894 tok/s) step 6570/76294 | train loss 3.258873 | norm 0.1301 | lr 9.63e-04 | (10196.77 ms | 51417 tok/s) step 6571/76294 | train loss 3.287024 | norm 0.1571 | lr 9.63e-04 | (9888.10 ms | 53022 tok/s) step 6572/76294 | train loss 3.315139 | norm 0.1400 | lr 9.62e-04 | (9905.68 ms | 52928 tok/s) step 6573/76294 | train loss 3.260429 | norm 0.1369 | lr 9.62e-04 | (9891.52 ms | 53004 tok/s) step 6574/76294 | train loss 3.331597 | norm 0.1343 | lr 9.62e-04 | (9928.68 ms | 52805 tok/s) step 6575/76294 | train loss 3.269543 | norm 0.1309 | lr 9.62e-04 | (9901.39 ms | 52951 tok/s) step 6576/76294 | train loss 3.308559 | norm 0.1315 | lr 9.62e-04 | (9893.97 ms | 52991 tok/s) step 6577/76294 | train loss 3.292782 | norm 0.1237 | lr 9.62e-04 | (9894.28 ms | 52989 tok/s) step 6578/76294 | train loss 3.307737 | norm 0.1305 | lr 9.62e-04 | (9931.83 ms | 52789 tok/s) step 6579/76294 | train loss 3.262463 | norm 0.1263 | lr 9.62e-04 | (9889.94 ms | 53012 tok/s) step 6580/76294 | train loss 3.320726 | norm 0.1394 | lr 9.62e-04 | (9910.42 ms | 52903 tok/s) step 6581/76294 | train loss 3.229941 | norm 0.1283 | lr 9.62e-04 | (9892.74 ms | 52997 tok/s) step 6582/76294 | train loss 3.320071 | norm 0.1686 | lr 9.62e-04 | (9893.37 ms | 52994 tok/s) step 6583/76294 | train loss 3.257557 | norm 0.1404 | lr 9.62e-04 | (9945.12 ms | 52718 tok/s) step 6584/76294 | train loss 3.276800 | norm 0.1517 | lr 9.62e-04 | (9886.01 ms | 53033 tok/s) step 6585/76294 | train loss 3.291342 | norm 0.1430 | lr 9.62e-04 | (9915.43 ms | 52876 tok/s) step 6586/76294 | train loss 3.360944 | norm 0.1547 | lr 9.61e-04 | (9952.51 ms | 52679 tok/s) step 6587/76294 | train loss 3.352979 | norm 0.1593 | lr 9.61e-04 | (9897.03 ms | 52974 tok/s) step 6588/76294 | train loss 3.284952 | norm 0.1431 | lr 9.61e-04 | (9892.58 ms | 52998 tok/s) step 6589/76294 | train loss 3.301200 | norm 0.1489 | lr 9.61e-04 | (9936.89 ms | 52762 tok/s) step 6590/76294 | train loss 3.287277 | norm 0.1475 | lr 9.61e-04 | (9899.49 ms | 52961 tok/s) step 6591/76294 | train loss 3.413253 | norm 0.1377 | lr 9.61e-04 | (10138.43 ms | 51713 tok/s) step 6592/76294 | train loss 3.318507 | norm 0.1610 | lr 9.61e-04 | (9944.13 ms | 52723 tok/s) step 6593/76294 | train loss 3.403350 | norm 0.1649 | lr 9.61e-04 | (9889.99 ms | 53012 tok/s) step 6594/76294 | train loss 3.401827 | norm 0.1646 | lr 9.61e-04 | (9905.13 ms | 52931 tok/s) step 6595/76294 | train loss 3.343126 | norm 0.1351 | lr 9.61e-04 | (9890.63 ms | 53009 tok/s) step 6596/76294 | train loss 3.323416 | norm 0.1314 | lr 9.61e-04 | (9894.75 ms | 52986 tok/s) step 6597/76294 | train loss 3.334334 | norm 0.1530 | lr 9.61e-04 | (9890.16 ms | 53011 tok/s) step 6598/76294 | train loss 3.397847 | norm 0.1338 | lr 9.61e-04 | (10327.25 ms | 50767 tok/s) step 6599/76294 | train loss 3.377980 | norm 0.1549 | lr 9.60e-04 | (9886.26 ms | 53032 tok/s) step 6600/76294 | train loss 3.323006 | norm 0.1259 | lr 9.60e-04 | (9888.68 ms | 53019 tok/s) step 6601/76294 | train loss 3.381424 | norm 0.1599 | lr 9.60e-04 | (9890.88 ms | 53007 tok/s) step 6602/76294 | train loss 3.282673 | norm 0.1208 | lr 9.60e-04 | (9891.21 ms | 53005 tok/s) step 6603/76294 | train loss 3.381483 | norm 0.1409 | lr 9.60e-04 | (9891.77 ms | 53002 tok/s) step 6604/76294 | train loss 3.281876 | norm 0.1226 | lr 9.60e-04 | (9887.47 ms | 53025 tok/s) step 6605/76294 | train loss 3.355908 | norm 0.1518 | lr 9.60e-04 | (9883.92 ms | 53045 tok/s) step 6606/76294 | train loss 3.344496 | norm 0.1533 | lr 9.60e-04 | (9922.93 ms | 52836 tok/s) step 6607/76294 | train loss 3.417048 | norm 0.1216 | lr 9.60e-04 | (9883.33 ms | 53048 tok/s) step 6608/76294 | train loss 3.288729 | norm 0.1348 | lr 9.60e-04 | (9884.57 ms | 53041 tok/s) step 6609/76294 | train loss 3.507930 | norm 0.1373 | lr 9.60e-04 | (9885.44 ms | 53036 tok/s) step 6610/76294 | train loss 3.391749 | norm 0.1416 | lr 9.60e-04 | (9879.48 ms | 53068 tok/s) step 6611/76294 | train loss 3.325830 | norm 0.1633 | lr 9.60e-04 | (9936.95 ms | 52761 tok/s) step 6612/76294 | train loss 3.303499 | norm 0.1429 | lr 9.59e-04 | (9882.55 ms | 53052 tok/s) step 6613/76294 | train loss 3.352430 | norm 0.1871 | lr 9.59e-04 | (9887.56 ms | 53025 tok/s) step 6614/76294 | train loss 3.308452 | norm 0.1407 | lr 9.59e-04 | (9887.42 ms | 53026 tok/s) step 6615/76294 | train loss 3.284030 | norm 0.1667 | lr 9.59e-04 | (9888.66 ms | 53019 tok/s) step 6616/76294 | train loss 3.327914 | norm 0.1455 | lr 9.59e-04 | (9888.61 ms | 53019 tok/s) step 6617/76294 | train loss 3.260580 | norm 0.1933 | lr 9.59e-04 | (9951.98 ms | 52682 tok/s) step 6618/76294 | train loss 3.345371 | norm 0.2107 | lr 9.59e-04 | (9887.87 ms | 53023 tok/s) step 6619/76294 | train loss 3.411689 | norm 0.1528 | lr 9.59e-04 | (9884.59 ms | 53041 tok/s) step 6620/76294 | train loss 3.301721 | norm 0.1507 | lr 9.59e-04 | (9882.43 ms | 53053 tok/s) step 6621/76294 | train loss 3.272134 | norm 0.1338 | lr 9.59e-04 | (9891.68 ms | 53003 tok/s) step 6622/76294 | train loss 3.330609 | norm 0.1726 | lr 9.59e-04 | (9914.47 ms | 52881 tok/s) step 6623/76294 | train loss 3.329248 | norm 0.1800 | lr 9.59e-04 | (9892.09 ms | 53001 tok/s) step 6624/76294 | train loss 3.374109 | norm 0.1502 | lr 9.59e-04 | (9890.98 ms | 53007 tok/s) step 6625/76294 | train loss 3.416023 | norm 0.1579 | lr 9.59e-04 | (9888.67 ms | 53019 tok/s) step 6626/76294 | train loss 3.368575 | norm 0.1525 | lr 9.58e-04 | (11390.79 ms | 46027 tok/s) step 6627/76294 | train loss 3.325572 | norm 0.1610 | lr 9.58e-04 | (9886.99 ms | 53028 tok/s) step 6628/76294 | train loss 3.383474 | norm 0.1588 | lr 9.58e-04 | (9882.61 ms | 53052 tok/s) step 6629/76294 | train loss 3.296858 | norm 0.1526 | lr 9.58e-04 | (9882.97 ms | 53050 tok/s) step 6630/76294 | train loss 3.295627 | norm 0.1752 | lr 9.58e-04 | (9911.81 ms | 52895 tok/s) step 6631/76294 | train loss 3.366615 | norm 0.1529 | lr 9.58e-04 | (9884.72 ms | 53040 tok/s) step 6632/76294 | train loss 3.339422 | norm 0.1489 | lr 9.58e-04 | (9891.04 ms | 53006 tok/s) step 6633/76294 | train loss 3.284844 | norm 0.1278 | lr 9.58e-04 | (9889.45 ms | 53015 tok/s) step 6634/76294 | train loss 3.627852 | norm 0.1794 | lr 9.58e-04 | (9899.71 ms | 52960 tok/s) step 6635/76294 | train loss 3.366511 | norm 0.1759 | lr 9.58e-04 | (9891.36 ms | 53005 tok/s) step 6636/76294 | train loss 3.556916 | norm 0.2364 | lr 9.58e-04 | (9889.39 ms | 53015 tok/s) step 6637/76294 | train loss 3.329300 | norm 0.1542 | lr 9.58e-04 | (9895.79 ms | 52981 tok/s) step 6638/76294 | train loss 3.329906 | norm 0.1532 | lr 9.58e-04 | (11029.16 ms | 47537 tok/s) step 6639/76294 | train loss 3.381404 | norm 0.1447 | lr 9.57e-04 | (9883.96 ms | 53044 tok/s) step 6640/76294 | train loss 3.296777 | norm 0.1465 | lr 9.57e-04 | (9889.82 ms | 53013 tok/s) step 6641/76294 | train loss 3.376899 | norm 0.1355 | lr 9.57e-04 | (9895.04 ms | 52985 tok/s) step 6642/76294 | train loss 3.319209 | norm 0.1810 | lr 9.57e-04 | (9899.97 ms | 52959 tok/s) step 6643/76294 | train loss 3.433355 | norm 0.1424 | lr 9.57e-04 | (9899.37 ms | 52962 tok/s) step 6644/76294 | train loss 3.341533 | norm 0.1491 | lr 9.57e-04 | (9898.11 ms | 52969 tok/s) step 6645/76294 | train loss 3.426665 | norm 0.1660 | lr 9.57e-04 | (9993.66 ms | 52462 tok/s) step 6646/76294 | train loss 3.339730 | norm 0.1458 | lr 9.57e-04 | (9891.83 ms | 53002 tok/s) step 6647/76294 | train loss 3.366628 | norm 0.1435 | lr 9.57e-04 | (9964.00 ms | 52618 tok/s) step 6648/76294 | train loss 3.297263 | norm 0.1346 | lr 9.57e-04 | (9890.99 ms | 53007 tok/s) step 6649/76294 | train loss 3.474192 | norm 0.1481 | lr 9.57e-04 | (9894.00 ms | 52990 tok/s) step 6650/76294 | train loss 3.316796 | norm 0.1330 | lr 9.57e-04 | (9890.56 ms | 53009 tok/s) step 6651/76294 | train loss 3.293750 | norm 0.1390 | lr 9.57e-04 | (9961.19 ms | 52633 tok/s) step 6652/76294 | train loss 3.368599 | norm 0.1248 | lr 9.56e-04 | (9892.78 ms | 52997 tok/s) step 6653/76294 | train loss 3.349422 | norm 0.1392 | lr 9.56e-04 | (9900.60 ms | 52955 tok/s) step 6654/76294 | train loss 3.326294 | norm 0.1540 | lr 9.56e-04 | (9925.64 ms | 52822 tok/s) step 6655/76294 | train loss 3.297132 | norm 0.1613 | lr 9.56e-04 | (9897.18 ms | 52973 tok/s) step 6656/76294 | train loss 3.392680 | norm 0.1487 | lr 9.56e-04 | (9942.34 ms | 52733 tok/s) step 6657/76294 | train loss 3.370721 | norm 0.1663 | lr 9.56e-04 | (9904.02 ms | 52937 tok/s) step 6658/76294 | train loss 3.265961 | norm 0.1410 | lr 9.56e-04 | (9896.47 ms | 52977 tok/s) step 6659/76294 | train loss 3.411613 | norm 0.1543 | lr 9.56e-04 | (9888.76 ms | 53019 tok/s) step 6660/76294 | train loss 3.313633 | norm 0.1314 | lr 9.56e-04 | (11893.53 ms | 44082 tok/s) step 6661/76294 | train loss 3.310992 | norm 0.1408 | lr 9.56e-04 | (9941.20 ms | 52739 tok/s) step 6662/76294 | train loss 3.289968 | norm 0.1532 | lr 9.56e-04 | (9912.70 ms | 52891 tok/s) step 6663/76294 | train loss 3.294892 | norm 0.1378 | lr 9.56e-04 | (9883.94 ms | 53044 tok/s) step 6664/76294 | train loss 3.294959 | norm 0.1477 | lr 9.56e-04 | (9917.31 ms | 52866 tok/s) step 6665/76294 | train loss 3.278229 | norm 0.1337 | lr 9.56e-04 | (9885.03 ms | 53039 tok/s) step 6666/76294 | train loss 3.341905 | norm 0.1431 | lr 9.55e-04 | (9885.09 ms | 53038 tok/s) step 6667/76294 | train loss 3.317442 | norm 0.1199 | lr 9.55e-04 | (9889.62 ms | 53014 tok/s) step 6668/76294 | train loss 3.342901 | norm 0.1350 | lr 9.55e-04 | (9888.85 ms | 53018 tok/s) step 6669/76294 | train loss 3.281358 | norm 0.1411 | lr 9.55e-04 | (11074.55 ms | 47342 tok/s) step 6670/76294 | train loss 3.310583 | norm 0.1534 | lr 9.55e-04 | (9906.50 ms | 52924 tok/s) step 6671/76294 | train loss 3.351081 | norm 0.1273 | lr 9.55e-04 | (10492.45 ms | 49968 tok/s) step 6672/76294 | train loss 3.234343 | norm 0.1436 | lr 9.55e-04 | (9902.05 ms | 52947 tok/s) step 6673/76294 | train loss 3.308553 | norm 0.1344 | lr 9.55e-04 | (9878.82 ms | 53072 tok/s) step 6674/76294 | train loss 3.286369 | norm 0.1527 | lr 9.55e-04 | (9900.91 ms | 52954 tok/s) step 6675/76294 | train loss 3.315490 | norm 0.1372 | lr 9.55e-04 | (9881.31 ms | 53059 tok/s) step 6676/76294 | train loss 3.336957 | norm 0.1483 | lr 9.55e-04 | (9884.31 ms | 53042 tok/s) step 6677/76294 | train loss 3.306155 | norm 0.1341 | lr 9.55e-04 | (9880.49 ms | 53063 tok/s) step 6678/76294 | train loss 3.302598 | norm 0.1654 | lr 9.55e-04 | (9888.04 ms | 53022 tok/s) step 6679/76294 | train loss 3.296319 | norm 0.1266 | lr 9.54e-04 | (9886.50 ms | 53031 tok/s) step 6680/76294 | train loss 3.327772 | norm 0.1584 | lr 9.54e-04 | (9924.02 ms | 52830 tok/s) step 6681/76294 | train loss 3.298390 | norm 0.1244 | lr 9.54e-04 | (9888.37 ms | 53021 tok/s) step 6682/76294 | train loss 3.327305 | norm 0.1401 | lr 9.54e-04 | (9890.13 ms | 53011 tok/s) step 6683/76294 | train loss 3.317997 | norm 0.1186 | lr 9.54e-04 | (9889.14 ms | 53017 tok/s) step 6684/76294 | train loss 3.375334 | norm 0.1318 | lr 9.54e-04 | (9902.23 ms | 52946 tok/s) step 6685/76294 | train loss 3.330969 | norm 0.1210 | lr 9.54e-04 | (9961.72 ms | 52630 tok/s) step 6686/76294 | train loss 3.330090 | norm 0.1453 | lr 9.54e-04 | (9886.56 ms | 53030 tok/s) step 6687/76294 | train loss 3.325298 | norm 0.1219 | lr 9.54e-04 | (9886.08 ms | 53033 tok/s) step 6688/76294 | train loss 3.295396 | norm 0.1372 | lr 9.54e-04 | (9902.39 ms | 52946 tok/s) step 6689/76294 | train loss 3.352210 | norm 0.1479 | lr 9.54e-04 | (9931.07 ms | 52793 tok/s) step 6690/76294 | train loss 3.357234 | norm 0.1418 | lr 9.54e-04 | (9885.58 ms | 53036 tok/s) step 6691/76294 | train loss 3.397094 | norm 0.1733 | lr 9.54e-04 | (9899.49 ms | 52961 tok/s) step 6692/76294 | train loss 3.355766 | norm 0.1460 | lr 9.53e-04 | (9886.26 ms | 53032 tok/s) step 6693/76294 | train loss 3.385297 | norm 0.1283 | lr 9.53e-04 | (9917.43 ms | 52865 tok/s) step 6694/76294 | train loss 3.368452 | norm 0.1386 | lr 9.53e-04 | (9885.52 ms | 53036 tok/s) step 6695/76294 | train loss 3.289722 | norm 0.1209 | lr 9.53e-04 | (9890.80 ms | 53008 tok/s) step 6696/76294 | train loss 3.298499 | norm 0.1323 | lr 9.53e-04 | (9885.82 ms | 53034 tok/s) step 6697/76294 | train loss 3.416823 | norm 0.1300 | lr 9.53e-04 | (9895.85 ms | 52981 tok/s) step 6698/76294 | train loss 3.338444 | norm 0.1504 | lr 9.53e-04 | (9933.18 ms | 52781 tok/s) step 6699/76294 | train loss 3.379942 | norm 0.1411 | lr 9.53e-04 | (9888.86 ms | 53018 tok/s) step 6700/76294 | train loss 3.341820 | norm 0.1370 | lr 9.53e-04 | (9892.13 ms | 53001 tok/s) step 6701/76294 | train loss 3.272126 | norm 0.1303 | lr 9.53e-04 | (9890.44 ms | 53010 tok/s) step 6702/76294 | train loss 3.296586 | norm 0.1237 | lr 9.53e-04 | (9900.05 ms | 52958 tok/s) step 6703/76294 | train loss 3.291847 | norm 0.1451 | lr 9.53e-04 | (9891.67 ms | 53003 tok/s) step 6704/76294 | train loss 3.371100 | norm 0.1326 | lr 9.53e-04 | (9955.49 ms | 52663 tok/s) step 6705/76294 | train loss 3.432869 | norm 0.1566 | lr 9.52e-04 | (9886.38 ms | 53031 tok/s) step 6706/76294 | train loss 3.295332 | norm 0.1289 | lr 9.52e-04 | (9885.20 ms | 53038 tok/s) step 6707/76294 | train loss 3.408986 | norm 0.1293 | lr 9.52e-04 | (9893.81 ms | 52992 tok/s) step 6708/76294 | train loss 3.323434 | norm 0.1388 | lr 9.52e-04 | (9929.57 ms | 52801 tok/s) step 6709/76294 | train loss 3.396254 | norm 0.1285 | lr 9.52e-04 | (9893.08 ms | 52995 tok/s) step 6710/76294 | train loss 3.298493 | norm 0.1382 | lr 9.52e-04 | (9931.66 ms | 52790 tok/s) step 6711/76294 | train loss 3.570915 | norm 0.1902 | lr 9.52e-04 | (9889.69 ms | 53014 tok/s) step 6712/76294 | train loss 3.285276 | norm 0.1887 | lr 9.52e-04 | (9895.63 ms | 52982 tok/s) step 6713/76294 | train loss 3.288904 | norm 0.1678 | lr 9.52e-04 | (9889.80 ms | 53013 tok/s) step 6714/76294 | train loss 3.343246 | norm 0.1553 | lr 9.52e-04 | (9896.92 ms | 52975 tok/s) step 6715/76294 | train loss 3.331836 | norm 0.1544 | lr 9.52e-04 | (9899.26 ms | 52962 tok/s) step 6716/76294 | train loss 3.290284 | norm 0.1314 | lr 9.52e-04 | (9891.75 ms | 53003 tok/s) step 6717/76294 | train loss 3.293014 | norm 0.1508 | lr 9.52e-04 | (9893.39 ms | 52994 tok/s) step 6718/76294 | train loss 3.337993 | norm 0.1384 | lr 9.52e-04 | (9921.59 ms | 52843 tok/s) step 6719/76294 | train loss 3.342527 | norm 0.1452 | lr 9.51e-04 | (9891.21 ms | 53005 tok/s) step 6720/76294 | train loss 3.390653 | norm 0.1319 | lr 9.51e-04 | (9899.16 ms | 52963 tok/s) step 6721/76294 | train loss 3.344812 | norm 0.1403 | lr 9.51e-04 | (9887.33 ms | 53026 tok/s) step 6722/76294 | train loss 3.236132 | norm 0.1545 | lr 9.51e-04 | (9894.92 ms | 52986 tok/s) step 6723/76294 | train loss 3.339359 | norm 0.1391 | lr 9.51e-04 | (9898.20 ms | 52968 tok/s) step 6724/76294 | train loss 3.348774 | norm 0.1392 | lr 9.51e-04 | (9897.35 ms | 52973 tok/s) step 6725/76294 | train loss 3.269653 | norm 0.1315 | lr 9.51e-04 | (9893.07 ms | 52995 tok/s) step 6726/76294 | train loss 3.277381 | norm 0.1452 | lr 9.51e-04 | (9891.36 ms | 53005 tok/s) step 6727/76294 | train loss 3.330108 | norm 0.1451 | lr 9.51e-04 | (9894.16 ms | 52990 tok/s) step 6728/76294 | train loss 3.327569 | norm 0.1411 | lr 9.51e-04 | (9979.45 ms | 52537 tok/s) step 6729/76294 | train loss 3.317056 | norm 0.1397 | lr 9.51e-04 | (9891.20 ms | 53005 tok/s) step 6730/76294 | train loss 3.331823 | norm 0.1416 | lr 9.51e-04 | (9910.07 ms | 52905 tok/s) step 6731/76294 | train loss 3.297716 | norm 0.1470 | lr 9.51e-04 | (9890.63 ms | 53009 tok/s) step 6732/76294 | train loss 3.396857 | norm 0.1332 | lr 9.50e-04 | (9889.55 ms | 53014 tok/s) step 6733/76294 | train loss 3.446607 | norm 0.1345 | lr 9.50e-04 | (9890.74 ms | 53008 tok/s) step 6734/76294 | train loss 3.327420 | norm 0.1487 | lr 9.50e-04 | (9916.20 ms | 52872 tok/s) step 6735/76294 | train loss 3.305476 | norm 0.1313 | lr 9.50e-04 | (11348.57 ms | 46199 tok/s) step 6736/76294 | train loss 3.343182 | norm 0.1317 | lr 9.50e-04 | (9897.42 ms | 52972 tok/s) step 6737/76294 | train loss 3.260167 | norm 0.1285 | lr 9.50e-04 | (9881.68 ms | 53057 tok/s) step 6738/76294 | train loss 3.252694 | norm 0.1437 | lr 9.50e-04 | (9877.48 ms | 53079 tok/s) step 6739/76294 | train loss 3.219543 | norm 0.1267 | lr 9.50e-04 | (9947.69 ms | 52704 tok/s) step 6740/76294 | train loss 3.298002 | norm 0.1395 | lr 9.50e-04 | (9885.15 ms | 53038 tok/s) step 6741/76294 | train loss 3.294084 | norm 0.1213 | lr 9.50e-04 | (9886.60 ms | 53030 tok/s) step 6742/76294 | train loss 3.312190 | norm 0.1426 | lr 9.50e-04 | (9889.98 ms | 53012 tok/s) step 6743/76294 | train loss 3.331249 | norm 0.1221 | lr 9.50e-04 | (9891.21 ms | 53005 tok/s) step 6744/76294 | train loss 3.336680 | norm 0.1462 | lr 9.50e-04 | (9911.45 ms | 52897 tok/s) step 6745/76294 | train loss 3.308064 | norm 0.1256 | lr 9.49e-04 | (9920.96 ms | 52847 tok/s) step 6746/76294 | train loss 3.324086 | norm 0.1301 | lr 9.49e-04 | (9920.92 ms | 52847 tok/s) step 6747/76294 | train loss 3.361050 | norm 0.1349 | lr 9.49e-04 | (9891.69 ms | 53003 tok/s) step 6748/76294 | train loss 3.350884 | norm 0.1321 | lr 9.49e-04 | (9905.93 ms | 52927 tok/s) step 6749/76294 | train loss 3.295116 | norm 0.1256 | lr 9.49e-04 | (9898.47 ms | 52967 tok/s) step 6750/76294 | train loss 3.390060 | norm 0.1172 | lr 9.49e-04 | (9896.26 ms | 52978 tok/s) val loss: 3.283884 saving model checkpoint to ./results/gpt2-350M-gqa/step_6750.pth step 6751/76294 | train loss 3.338629 | norm 0.1270 | lr 9.49e-04 | (9923.49 ms | 52833 tok/s) step 6752/76294 | train loss 3.308538 | norm 0.1345 | lr 9.49e-04 | (9844.46 ms | 53257 tok/s) step 6753/76294 | train loss 3.221347 | norm 0.1256 | lr 9.49e-04 | (9893.29 ms | 52994 tok/s) step 6754/76294 | train loss 3.376019 | norm 0.1351 | lr 9.49e-04 | (9856.90 ms | 53190 tok/s) step 6755/76294 | train loss 3.511106 | norm 0.1387 | lr 9.49e-04 | (9870.75 ms | 53115 tok/s) step 6756/76294 | train loss 3.315637 | norm 0.1343 | lr 9.49e-04 | (9867.35 ms | 53134 tok/s) step 6757/76294 | train loss 3.347702 | norm 0.1353 | lr 9.49e-04 | (9903.18 ms | 52941 tok/s) step 6758/76294 | train loss 3.329054 | norm 0.1491 | lr 9.48e-04 | (9873.62 ms | 53100 tok/s) step 6759/76294 | train loss 3.290943 | norm 0.1461 | lr 9.48e-04 | (9901.35 ms | 52951 tok/s) step 6760/76294 | train loss 3.292840 | norm 0.1348 | lr 9.48e-04 | (9880.32 ms | 53064 tok/s) step 6761/76294 | train loss 3.320153 | norm 0.1298 | lr 9.48e-04 | (9924.47 ms | 52828 tok/s) step 6762/76294 | train loss 3.535573 | norm 0.1479 | lr 9.48e-04 | (9879.37 ms | 53069 tok/s) step 6763/76294 | train loss 3.304487 | norm 0.1377 | lr 9.48e-04 | (9889.57 ms | 53014 tok/s) step 6764/76294 | train loss 3.356673 | norm 0.1297 | lr 9.48e-04 | (9939.98 ms | 52745 tok/s) step 6765/76294 | train loss 3.249672 | norm 0.1398 | lr 9.48e-04 | (9883.79 ms | 53045 tok/s) step 6766/76294 | train loss 3.331751 | norm 0.1339 | lr 9.48e-04 | (9884.94 ms | 53039 tok/s) step 6767/76294 | train loss 3.313165 | norm 0.1409 | lr 9.48e-04 | (9888.26 ms | 53021 tok/s) step 6768/76294 | train loss 3.275358 | norm 0.1450 | lr 9.48e-04 | (9888.83 ms | 53018 tok/s) step 6769/76294 | train loss 3.323503 | norm 0.1292 | lr 9.48e-04 | (9889.81 ms | 53013 tok/s) step 6770/76294 | train loss 3.309767 | norm 0.1569 | lr 9.48e-04 | (9891.95 ms | 53001 tok/s) step 6771/76294 | train loss 3.307655 | norm 0.1218 | lr 9.47e-04 | (9887.66 ms | 53024 tok/s) step 6772/76294 | train loss 3.288478 | norm 0.1314 | lr 9.47e-04 | (9956.69 ms | 52657 tok/s) step 6773/76294 | train loss 3.373072 | norm 0.1344 | lr 9.47e-04 | (9886.98 ms | 53028 tok/s) step 6774/76294 | train loss 3.329396 | norm 0.1416 | lr 9.47e-04 | (9882.08 ms | 53054 tok/s) step 6775/76294 | train loss 3.299197 | norm 0.1399 | lr 9.47e-04 | (9940.57 ms | 52742 tok/s) step 6776/76294 | train loss 3.326974 | norm 0.1537 | lr 9.47e-04 | (9888.89 ms | 53018 tok/s) step 6777/76294 | train loss 3.369947 | norm 0.1411 | lr 9.47e-04 | (9889.63 ms | 53014 tok/s) step 6778/76294 | train loss 3.320471 | norm 0.1618 | lr 9.47e-04 | (9911.43 ms | 52897 tok/s) step 6779/76294 | train loss 3.279303 | norm 0.1336 | lr 9.47e-04 | (9915.42 ms | 52876 tok/s) step 6780/76294 | train loss 3.287023 | norm 0.1783 | lr 9.47e-04 | (9890.44 ms | 53010 tok/s) step 6781/76294 | train loss 3.306280 | norm 0.1271 | lr 9.47e-04 | (10834.73 ms | 48390 tok/s) step 6782/76294 | train loss 3.256560 | norm 0.1432 | lr 9.47e-04 | (10684.88 ms | 49068 tok/s) step 6783/76294 | train loss 3.283777 | norm 0.1379 | lr 9.47e-04 | (9943.71 ms | 52726 tok/s) step 6784/76294 | train loss 3.270748 | norm 0.1252 | lr 9.46e-04 | (9907.99 ms | 52916 tok/s) step 6785/76294 | train loss 3.414914 | norm 0.1306 | lr 9.46e-04 | (9901.66 ms | 52949 tok/s) step 6786/76294 | train loss 3.267164 | norm 0.1517 | lr 9.46e-04 | (10178.07 ms | 51512 tok/s) step 6787/76294 | train loss 3.213357 | norm 0.1475 | lr 9.46e-04 | (9911.76 ms | 52896 tok/s) step 6788/76294 | train loss 3.326354 | norm 0.1441 | lr 9.46e-04 | (9881.91 ms | 53055 tok/s) step 6789/76294 | train loss 3.261783 | norm 0.1509 | lr 9.46e-04 | (9900.43 ms | 52956 tok/s) step 6790/76294 | train loss 3.302268 | norm 0.1746 | lr 9.46e-04 | (9886.62 ms | 53030 tok/s) step 6791/76294 | train loss 3.298512 | norm 0.1465 | lr 9.46e-04 | (9914.30 ms | 52882 tok/s) step 6792/76294 | train loss 3.259478 | norm 0.1504 | lr 9.46e-04 | (9887.15 ms | 53027 tok/s) step 6793/76294 | train loss 3.310663 | norm 0.1543 | lr 9.46e-04 | (9896.65 ms | 52976 tok/s) step 6794/76294 | train loss 3.260493 | norm 0.1321 | lr 9.46e-04 | (9885.29 ms | 53037 tok/s) step 6795/76294 | train loss 3.298365 | norm 0.1274 | lr 9.46e-04 | (9891.07 ms | 53006 tok/s) step 6796/76294 | train loss 3.286736 | norm 0.1440 | lr 9.46e-04 | (9884.88 ms | 53039 tok/s) step 6797/76294 | train loss 3.389789 | norm 0.1393 | lr 9.45e-04 | (9901.78 ms | 52949 tok/s) step 6798/76294 | train loss 3.255599 | norm 0.1515 | lr 9.45e-04 | (9927.53 ms | 52812 tok/s) step 6799/76294 | train loss 3.288497 | norm 0.1341 | lr 9.45e-04 | (9887.62 ms | 53025 tok/s) step 6800/76294 | train loss 3.264659 | norm 0.1426 | lr 9.45e-04 | (9893.17 ms | 52995 tok/s) step 6801/76294 | train loss 3.301754 | norm 0.1295 | lr 9.45e-04 | (9887.13 ms | 53027 tok/s) step 6802/76294 | train loss 3.338621 | norm 0.1467 | lr 9.45e-04 | (9891.01 ms | 53007 tok/s) step 6803/76294 | train loss 3.280513 | norm 0.1313 | lr 9.45e-04 | (9892.28 ms | 53000 tok/s) step 6804/76294 | train loss 3.273196 | norm 0.1367 | lr 9.45e-04 | (9920.62 ms | 52848 tok/s) step 6805/76294 | train loss 3.329628 | norm 0.1242 | lr 9.45e-04 | (9892.79 ms | 52997 tok/s) step 6806/76294 | train loss 3.297359 | norm 0.2056 | lr 9.45e-04 | (9895.01 ms | 52985 tok/s) step 6807/76294 | train loss 3.357867 | norm 0.1636 | lr 9.45e-04 | (9893.14 ms | 52995 tok/s) step 6808/76294 | train loss 3.269241 | norm 0.1521 | lr 9.45e-04 | (9933.65 ms | 52779 tok/s) step 6809/76294 | train loss 3.385725 | norm 0.1547 | lr 9.45e-04 | (9892.49 ms | 52999 tok/s) step 6810/76294 | train loss 3.336809 | norm 0.1608 | lr 9.44e-04 | (9935.67 ms | 52768 tok/s) step 6811/76294 | train loss 3.311596 | norm 0.1875 | lr 9.44e-04 | (9896.11 ms | 52979 tok/s) step 6812/76294 | train loss 3.300440 | norm 0.2106 | lr 9.44e-04 | (9905.41 ms | 52929 tok/s) step 6813/76294 | train loss 3.255456 | norm 0.1797 | lr 9.44e-04 | (9894.01 ms | 52990 tok/s) step 6814/76294 | train loss 3.295149 | norm 0.1606 | lr 9.44e-04 | (9895.32 ms | 52983 tok/s) step 6815/76294 | train loss 3.186063 | norm 0.1857 | lr 9.44e-04 | (9895.29 ms | 52984 tok/s) step 6816/76294 | train loss 3.255875 | norm 0.1756 | lr 9.44e-04 | (9895.50 ms | 52982 tok/s) step 6817/76294 | train loss 3.258260 | norm 0.1528 | lr 9.44e-04 | (9892.58 ms | 52998 tok/s) step 6818/76294 | train loss 3.340122 | norm 0.2213 | lr 9.44e-04 | (9934.89 ms | 52772 tok/s) step 6819/76294 | train loss 3.304020 | norm 0.2013 | lr 9.44e-04 | (9896.66 ms | 52976 tok/s) step 6820/76294 | train loss 3.308467 | norm 0.1724 | lr 9.44e-04 | (9936.11 ms | 52766 tok/s) step 6821/76294 | train loss 3.356612 | norm 0.1834 | lr 9.44e-04 | (9898.59 ms | 52966 tok/s) step 6822/76294 | train loss 3.271492 | norm 0.1901 | lr 9.44e-04 | (9891.11 ms | 53006 tok/s) step 6823/76294 | train loss 3.322108 | norm 0.1696 | lr 9.44e-04 | (9890.15 ms | 53011 tok/s) step 6824/76294 | train loss 3.308797 | norm 0.1959 | lr 9.43e-04 | (9899.74 ms | 52960 tok/s) step 6825/76294 | train loss 3.342897 | norm 0.1675 | lr 9.43e-04 | (9956.86 ms | 52656 tok/s) step 6826/76294 | train loss 3.221220 | norm 0.1593 | lr 9.43e-04 | (9896.79 ms | 52976 tok/s) step 6827/76294 | train loss 3.304002 | norm 0.1669 | lr 9.43e-04 | (9895.71 ms | 52981 tok/s) step 6828/76294 | train loss 3.257221 | norm 0.1397 | lr 9.43e-04 | (9930.38 ms | 52796 tok/s) step 6829/76294 | train loss 3.301351 | norm 0.1427 | lr 9.43e-04 | (9893.13 ms | 52995 tok/s) step 6830/76294 | train loss 3.266902 | norm 0.1468 | lr 9.43e-04 | (9903.33 ms | 52941 tok/s) step 6831/76294 | train loss 3.304380 | norm 0.1428 | lr 9.43e-04 | (9896.92 ms | 52975 tok/s) step 6832/76294 | train loss 3.295303 | norm 0.1493 | lr 9.43e-04 | (10339.01 ms | 50710 tok/s) step 6833/76294 | train loss 3.313047 | norm 0.1364 | lr 9.43e-04 | (11872.85 ms | 44159 tok/s) step 6834/76294 | train loss 3.316688 | norm 0.2043 | lr 9.43e-04 | (9889.95 ms | 53012 tok/s) step 6835/76294 | train loss 3.227322 | norm 0.1444 | lr 9.43e-04 | (9885.90 ms | 53034 tok/s) step 6836/76294 | train loss 3.288644 | norm 0.1698 | lr 9.43e-04 | (9883.64 ms | 53046 tok/s) step 6837/76294 | train loss 3.312171 | norm 0.1460 | lr 9.42e-04 | (9937.91 ms | 52756 tok/s) step 6838/76294 | train loss 3.303024 | norm 0.1695 | lr 9.42e-04 | (9879.41 ms | 53069 tok/s) step 6839/76294 | train loss 3.322057 | norm 0.1532 | lr 9.42e-04 | (9881.74 ms | 53056 tok/s) step 6840/76294 | train loss 3.323456 | norm 0.1386 | lr 9.42e-04 | (9891.66 ms | 53003 tok/s) step 6841/76294 | train loss 3.264444 | norm 0.1438 | lr 9.42e-04 | (9905.04 ms | 52931 tok/s) step 6842/76294 | train loss 3.238093 | norm 0.1393 | lr 9.42e-04 | (9887.33 ms | 53026 tok/s) step 6843/76294 | train loss 3.247446 | norm 0.1322 | lr 9.42e-04 | (9888.74 ms | 53019 tok/s) step 6844/76294 | train loss 3.276267 | norm 0.1334 | lr 9.42e-04 | (9885.74 ms | 53035 tok/s) step 6845/76294 | train loss 3.362854 | norm 0.1237 | lr 9.42e-04 | (9912.30 ms | 52893 tok/s) step 6846/76294 | train loss 3.275637 | norm 0.1356 | lr 9.42e-04 | (9886.22 ms | 53032 tok/s) step 6847/76294 | train loss 3.297102 | norm 0.1202 | lr 9.42e-04 | (9896.03 ms | 52980 tok/s) step 6848/76294 | train loss 3.283392 | norm 0.1256 | lr 9.42e-04 | (9888.25 ms | 53021 tok/s) step 6849/76294 | train loss 3.294776 | norm 0.1366 | lr 9.42e-04 | (9903.67 ms | 52939 tok/s) step 6850/76294 | train loss 3.227071 | norm 0.1262 | lr 9.41e-04 | (9900.25 ms | 52957 tok/s) step 6851/76294 | train loss 3.272181 | norm 0.1188 | lr 9.41e-04 | (9892.94 ms | 52996 tok/s) step 6852/76294 | train loss 3.264953 | norm 0.1260 | lr 9.41e-04 | (9892.81 ms | 52997 tok/s) step 6853/76294 | train loss 3.304783 | norm 0.1403 | lr 9.41e-04 | (9892.39 ms | 52999 tok/s) step 6854/76294 | train loss 3.345827 | norm 0.1344 | lr 9.41e-04 | (9904.63 ms | 52934 tok/s) step 6855/76294 | train loss 3.287432 | norm 0.1177 | lr 9.41e-04 | (9906.19 ms | 52925 tok/s) step 6856/76294 | train loss 3.283152 | norm 0.1281 | lr 9.41e-04 | (9903.21 ms | 52941 tok/s) step 6857/76294 | train loss 3.420937 | norm 0.1612 | lr 9.41e-04 | (9895.08 ms | 52985 tok/s) step 6858/76294 | train loss 3.211108 | norm 0.1886 | lr 9.41e-04 | (9896.47 ms | 52977 tok/s) step 6859/76294 | train loss 3.292118 | norm 0.1572 | lr 9.41e-04 | (9893.27 ms | 52994 tok/s) step 6860/76294 | train loss 3.291865 | norm 0.1257 | lr 9.41e-04 | (9900.68 ms | 52955 tok/s) step 6861/76294 | train loss 3.444847 | norm 0.1445 | lr 9.41e-04 | (9909.05 ms | 52910 tok/s) step 6862/76294 | train loss 3.320836 | norm 0.1556 | lr 9.41e-04 | (9952.29 ms | 52680 tok/s) step 6863/76294 | train loss 3.288401 | norm 0.1606 | lr 9.40e-04 | (9960.35 ms | 52637 tok/s) step 6864/76294 | train loss 3.312406 | norm 0.1765 | lr 9.40e-04 | (9890.52 ms | 53009 tok/s) step 6865/76294 | train loss 3.308074 | norm 0.1583 | lr 9.40e-04 | (9929.42 ms | 52801 tok/s) step 6866/76294 | train loss 3.508036 | norm 0.1620 | lr 9.40e-04 | (9958.09 ms | 52649 tok/s) step 6867/76294 | train loss 3.326079 | norm 0.1368 | lr 9.40e-04 | (9952.05 ms | 52681 tok/s) step 6868/76294 | train loss 3.324639 | norm 0.1473 | lr 9.40e-04 | (9893.49 ms | 52993 tok/s) step 6869/76294 | train loss 3.273077 | norm 0.1420 | lr 9.40e-04 | (9918.36 ms | 52860 tok/s) step 6870/76294 | train loss 3.279495 | norm 0.1349 | lr 9.40e-04 | (9917.58 ms | 52864 tok/s) step 6871/76294 | train loss 3.293174 | norm 0.1269 | lr 9.40e-04 | (9894.81 ms | 52986 tok/s) step 6872/76294 | train loss 3.322552 | norm 0.1229 | lr 9.40e-04 | (9942.27 ms | 52733 tok/s) step 6873/76294 | train loss 3.300726 | norm 0.1244 | lr 9.40e-04 | (9899.58 ms | 52961 tok/s) step 6874/76294 | train loss 3.273906 | norm 0.1460 | lr 9.40e-04 | (9915.29 ms | 52877 tok/s) step 6875/76294 | train loss 3.398129 | norm 0.1368 | lr 9.40e-04 | (9954.34 ms | 52669 tok/s) step 6876/76294 | train loss 3.320001 | norm 0.1459 | lr 9.39e-04 | (9891.27 ms | 53005 tok/s) step 6877/76294 | train loss 3.314217 | norm 0.1459 | lr 9.39e-04 | (9901.53 ms | 52950 tok/s) step 6878/76294 | train loss 3.411231 | norm 0.1273 | lr 9.39e-04 | (9886.50 ms | 53031 tok/s) step 6879/76294 | train loss 3.292508 | norm 0.1438 | lr 9.39e-04 | (9895.27 ms | 52984 tok/s) step 6880/76294 | train loss 3.344851 | norm 0.1349 | lr 9.39e-04 | (9926.78 ms | 52816 tok/s) step 6881/76294 | train loss 3.278748 | norm 0.1251 | lr 9.39e-04 | (9894.21 ms | 52989 tok/s) step 6882/76294 | train loss 3.278094 | norm 0.1359 | lr 9.39e-04 | (9954.11 ms | 52670 tok/s) step 6883/76294 | train loss 3.269765 | norm 0.1461 | lr 9.39e-04 | (9894.10 ms | 52990 tok/s) step 6884/76294 | train loss 3.351456 | norm 0.1501 | lr 9.39e-04 | (9900.37 ms | 52956 tok/s) step 6885/76294 | train loss 3.278449 | norm 0.1476 | lr 9.39e-04 | (9895.54 ms | 52982 tok/s) step 6886/76294 | train loss 3.299546 | norm 0.1444 | lr 9.39e-04 | (9895.59 ms | 52982 tok/s) step 6887/76294 | train loss 3.252906 | norm 0.1477 | lr 9.39e-04 | (9895.95 ms | 52980 tok/s) step 6888/76294 | train loss 3.290017 | norm 0.1461 | lr 9.39e-04 | (9891.48 ms | 53004 tok/s) step 6889/76294 | train loss 3.319287 | norm 0.1318 | lr 9.38e-04 | (9890.37 ms | 53010 tok/s) step 6890/76294 | train loss 3.260237 | norm 0.1426 | lr 9.38e-04 | (9895.20 ms | 52984 tok/s) step 6891/76294 | train loss 3.277968 | norm 0.1421 | lr 9.38e-04 | (9895.43 ms | 52983 tok/s) step 6892/76294 | train loss 3.319455 | norm 0.1290 | lr 9.38e-04 | (9910.02 ms | 52905 tok/s) step 6893/76294 | train loss 3.264631 | norm 0.1291 | lr 9.38e-04 | (9901.49 ms | 52950 tok/s) step 6894/76294 | train loss 3.278665 | norm 0.1413 | lr 9.38e-04 | (9890.92 ms | 53007 tok/s) step 6895/76294 | train loss 3.253702 | norm 0.1361 | lr 9.38e-04 | (9900.46 ms | 52956 tok/s) step 6896/76294 | train loss 3.286331 | norm 0.1398 | lr 9.38e-04 | (9888.78 ms | 53018 tok/s) step 6897/76294 | train loss 3.315536 | norm 0.1339 | lr 9.38e-04 | (9898.39 ms | 52967 tok/s) step 6898/76294 | train loss 3.227568 | norm 0.1480 | lr 9.38e-04 | (9891.90 ms | 53002 tok/s) step 6899/76294 | train loss 3.365586 | norm 0.1784 | lr 9.38e-04 | (9910.80 ms | 52901 tok/s) step 6900/76294 | train loss 3.274651 | norm 0.1693 | lr 9.38e-04 | (9941.29 ms | 52738 tok/s) step 6901/76294 | train loss 3.255631 | norm 0.1425 | lr 9.38e-04 | (9919.54 ms | 52854 tok/s) step 6902/76294 | train loss 3.285694 | norm 0.1550 | lr 9.37e-04 | (9915.02 ms | 52878 tok/s) step 6903/76294 | train loss 3.217109 | norm 0.1662 | lr 9.37e-04 | (9905.28 ms | 52930 tok/s) step 6904/76294 | train loss 3.298847 | norm 0.1656 | lr 9.37e-04 | (9892.06 ms | 53001 tok/s) step 6905/76294 | train loss 3.275772 | norm 0.1695 | lr 9.37e-04 | (9902.60 ms | 52945 tok/s) step 6906/76294 | train loss 3.227068 | norm 0.1480 | lr 9.37e-04 | (9889.24 ms | 53016 tok/s) step 6907/76294 | train loss 3.361190 | norm 0.1525 | lr 9.37e-04 | (9900.90 ms | 52954 tok/s) step 6908/76294 | train loss 3.244454 | norm 0.1400 | lr 9.37e-04 | (9890.50 ms | 53009 tok/s) step 6909/76294 | train loss 3.245905 | norm 0.1590 | lr 9.37e-04 | (9996.75 ms | 52446 tok/s) step 6910/76294 | train loss 3.246033 | norm 0.1369 | lr 9.37e-04 | (9892.27 ms | 53000 tok/s) step 6911/76294 | train loss 3.273076 | norm 0.1548 | lr 9.37e-04 | (9904.86 ms | 52932 tok/s) step 6912/76294 | train loss 3.381997 | norm 0.1471 | lr 9.37e-04 | (9887.26 ms | 53027 tok/s) step 6913/76294 | train loss 3.268842 | norm 0.1375 | lr 9.37e-04 | (9897.75 ms | 52970 tok/s) step 6914/76294 | train loss 3.282475 | norm 0.1416 | lr 9.36e-04 | (9900.05 ms | 52958 tok/s) step 6915/76294 | train loss 3.302018 | norm 0.1266 | lr 9.36e-04 | (9894.14 ms | 52990 tok/s) step 6916/76294 | train loss 3.286616 | norm 0.1405 | lr 9.36e-04 | (9956.34 ms | 52659 tok/s) step 6917/76294 | train loss 3.282106 | norm 0.1342 | lr 9.36e-04 | (9906.92 ms | 52921 tok/s) step 6918/76294 | train loss 3.259790 | norm 0.1308 | lr 9.36e-04 | (9973.93 ms | 52566 tok/s) step 6919/76294 | train loss 3.279411 | norm 0.1344 | lr 9.36e-04 | (9893.72 ms | 52992 tok/s) step 6920/76294 | train loss 3.225152 | norm 0.1289 | lr 9.36e-04 | (9961.04 ms | 52634 tok/s) step 6921/76294 | train loss 3.389183 | norm 0.1529 | lr 9.36e-04 | (9903.43 ms | 52940 tok/s) step 6922/76294 | train loss 3.227151 | norm 0.1238 | lr 9.36e-04 | (9917.28 ms | 52866 tok/s) step 6923/76294 | train loss 3.287050 | norm 0.1312 | lr 9.36e-04 | (9965.80 ms | 52609 tok/s) step 6924/76294 | train loss 3.329071 | norm 0.1605 | lr 9.36e-04 | (9887.23 ms | 53027 tok/s) step 6925/76294 | train loss 3.322711 | norm 0.1435 | lr 9.36e-04 | (9897.69 ms | 52971 tok/s) step 6926/76294 | train loss 3.231865 | norm 0.1597 | lr 9.36e-04 | (9889.90 ms | 53012 tok/s) step 6927/76294 | train loss 3.331218 | norm 0.1438 | lr 9.35e-04 | (9906.68 ms | 52923 tok/s) step 6928/76294 | train loss 3.251954 | norm 0.1675 | lr 9.35e-04 | (9913.26 ms | 52888 tok/s) step 6929/76294 | train loss 3.360645 | norm 0.1822 | lr 9.35e-04 | (9890.16 ms | 53011 tok/s) step 6930/76294 | train loss 3.247143 | norm 0.1340 | lr 9.35e-04 | (9904.42 ms | 52935 tok/s) step 6931/76294 | train loss 3.304700 | norm 0.1639 | lr 9.35e-04 | (10981.29 ms | 47744 tok/s) step 6932/76294 | train loss 3.251966 | norm 0.1347 | lr 9.35e-04 | (9970.12 ms | 52586 tok/s) step 6933/76294 | train loss 3.418037 | norm 0.1687 | lr 9.35e-04 | (9903.67 ms | 52939 tok/s) step 6934/76294 | train loss 3.228082 | norm 0.1433 | lr 9.35e-04 | (9890.34 ms | 53010 tok/s) step 6935/76294 | train loss 3.322582 | norm 0.1688 | lr 9.35e-04 | (9928.38 ms | 52807 tok/s) step 6936/76294 | train loss 3.224436 | norm 0.1439 | lr 9.35e-04 | (9886.52 ms | 53031 tok/s) step 6937/76294 | train loss 3.263755 | norm 0.1677 | lr 9.35e-04 | (9898.70 ms | 52965 tok/s) step 6938/76294 | train loss 3.306318 | norm 0.1435 | lr 9.35e-04 | (9887.80 ms | 53024 tok/s) step 6939/76294 | train loss 3.270995 | norm 0.1472 | lr 9.35e-04 | (9898.61 ms | 52966 tok/s) step 6940/76294 | train loss 3.277403 | norm 0.1489 | lr 9.34e-04 | (9884.46 ms | 53042 tok/s) step 6941/76294 | train loss 3.217575 | norm 0.1353 | lr 9.34e-04 | (9895.95 ms | 52980 tok/s) step 6942/76294 | train loss 3.257753 | norm 0.1476 | lr 9.34e-04 | (9896.29 ms | 52978 tok/s) step 6943/76294 | train loss 3.259722 | norm 0.1320 | lr 9.34e-04 | (9911.84 ms | 52895 tok/s) step 6944/76294 | train loss 3.293768 | norm 0.1480 | lr 9.34e-04 | (9916.13 ms | 52872 tok/s) step 6945/76294 | train loss 3.353453 | norm 0.1324 | lr 9.34e-04 | (9934.39 ms | 52775 tok/s) step 6946/76294 | train loss 3.243560 | norm 0.1396 | lr 9.34e-04 | (9888.85 ms | 53018 tok/s) step 6947/76294 | train loss 3.223757 | norm 0.1210 | lr 9.34e-04 | (9919.32 ms | 52855 tok/s) step 6948/76294 | train loss 3.293464 | norm 0.1319 | lr 9.34e-04 | (9888.11 ms | 53022 tok/s) step 6949/76294 | train loss 3.233299 | norm 0.1223 | lr 9.34e-04 | (9896.44 ms | 52977 tok/s) step 6950/76294 | train loss 3.260042 | norm 0.1355 | lr 9.34e-04 | (9914.95 ms | 52879 tok/s) step 6951/76294 | train loss 3.226953 | norm 0.1282 | lr 9.34e-04 | (9896.05 ms | 52980 tok/s) step 6952/76294 | train loss 3.320403 | norm 0.1266 | lr 9.34e-04 | (9891.75 ms | 53003 tok/s) step 6953/76294 | train loss 3.292512 | norm 0.1528 | lr 9.33e-04 | (9896.62 ms | 52976 tok/s) step 6954/76294 | train loss 3.268229 | norm 0.1703 | lr 9.33e-04 | (9891.99 ms | 53001 tok/s) step 6955/76294 | train loss 3.261417 | norm 0.1539 | lr 9.33e-04 | (9921.48 ms | 52844 tok/s) step 6956/76294 | train loss 3.327233 | norm 0.1706 | lr 9.33e-04 | (9887.48 ms | 53025 tok/s) step 6957/76294 | train loss 3.213872 | norm 0.1480 | lr 9.33e-04 | (9897.34 ms | 52973 tok/s) step 6958/76294 | train loss 3.258113 | norm 0.1563 | lr 9.33e-04 | (9890.67 ms | 53008 tok/s) step 6959/76294 | train loss 3.280269 | norm 0.1432 | lr 9.33e-04 | (9925.42 ms | 52823 tok/s) step 6960/76294 | train loss 3.303037 | norm 0.1476 | lr 9.33e-04 | (9962.88 ms | 52624 tok/s) step 6961/76294 | train loss 3.333933 | norm 0.1297 | lr 9.33e-04 | (9904.03 ms | 52937 tok/s) step 6962/76294 | train loss 3.311306 | norm 0.1551 | lr 9.33e-04 | (9957.53 ms | 52652 tok/s) step 6963/76294 | train loss 3.258594 | norm 0.1484 | lr 9.33e-04 | (9927.28 ms | 52813 tok/s) step 6964/76294 | train loss 3.367249 | norm 0.1489 | lr 9.33e-04 | (9901.68 ms | 52949 tok/s) step 6965/76294 | train loss 3.310482 | norm 0.1444 | lr 9.33e-04 | (9897.10 ms | 52974 tok/s) step 6966/76294 | train loss 3.257375 | norm 0.1405 | lr 9.32e-04 | (9895.69 ms | 52981 tok/s) step 6967/76294 | train loss 3.311735 | norm 0.1907 | lr 9.32e-04 | (9894.46 ms | 52988 tok/s) step 6968/76294 | train loss 3.309840 | norm 0.1367 | lr 9.32e-04 | (9895.54 ms | 52982 tok/s) step 6969/76294 | train loss 3.309132 | norm 0.1607 | lr 9.32e-04 | (9950.29 ms | 52691 tok/s) step 6970/76294 | train loss 3.324912 | norm 0.1305 | lr 9.32e-04 | (9886.38 ms | 53031 tok/s) step 6971/76294 | train loss 3.366594 | norm 0.1411 | lr 9.32e-04 | (9899.74 ms | 52960 tok/s) step 6972/76294 | train loss 3.285437 | norm 0.1372 | lr 9.32e-04 | (10209.33 ms | 51354 tok/s) step 6973/76294 | train loss 3.322310 | norm 0.1188 | lr 9.32e-04 | (9954.70 ms | 52667 tok/s) step 6974/76294 | train loss 3.303524 | norm 0.1538 | lr 9.32e-04 | (9885.75 ms | 53035 tok/s) step 6975/76294 | train loss 3.280019 | norm 0.1196 | lr 9.32e-04 | (9976.22 ms | 52554 tok/s) step 6976/76294 | train loss 3.270707 | norm 0.1406 | lr 9.32e-04 | (10032.09 ms | 52261 tok/s) step 6977/76294 | train loss 3.320002 | norm 0.1400 | lr 9.32e-04 | (9888.60 ms | 53019 tok/s) step 6978/76294 | train loss 3.344591 | norm 0.1380 | lr 9.32e-04 | (9890.07 ms | 53012 tok/s) step 6979/76294 | train loss 3.324366 | norm 0.1470 | lr 9.31e-04 | (9930.26 ms | 52797 tok/s) step 6980/76294 | train loss 3.293148 | norm 0.1375 | lr 9.31e-04 | (9883.96 ms | 53044 tok/s) step 6981/76294 | train loss 3.305030 | norm 0.1394 | lr 9.31e-04 | (9951.14 ms | 52686 tok/s) step 6982/76294 | train loss 3.278667 | norm 0.1318 | lr 9.31e-04 | (9879.75 ms | 53067 tok/s) step 6983/76294 | train loss 3.270825 | norm 0.1495 | lr 9.31e-04 | (9901.31 ms | 52951 tok/s) step 6984/76294 | train loss 3.252273 | norm 0.1327 | lr 9.31e-04 | (9912.97 ms | 52889 tok/s) step 6985/76294 | train loss 3.276824 | norm 0.1446 | lr 9.31e-04 | (9886.31 ms | 53032 tok/s) step 6986/76294 | train loss 3.302359 | norm 0.1300 | lr 9.31e-04 | (9896.83 ms | 52975 tok/s) step 6987/76294 | train loss 3.345186 | norm 0.1514 | lr 9.31e-04 | (9889.99 ms | 53012 tok/s) step 6988/76294 | train loss 3.268318 | norm 0.1473 | lr 9.31e-04 | (9885.97 ms | 53034 tok/s) step 6989/76294 | train loss 3.221714 | norm 0.1528 | lr 9.31e-04 | (9923.83 ms | 52831 tok/s) step 6990/76294 | train loss 3.337300 | norm 0.1388 | lr 9.31e-04 | (9882.33 ms | 53053 tok/s) step 6991/76294 | train loss 3.347303 | norm 0.1486 | lr 9.31e-04 | (9897.85 ms | 52970 tok/s) step 6992/76294 | train loss 3.251467 | norm 0.1292 | lr 9.30e-04 | (9883.34 ms | 53048 tok/s) step 6993/76294 | train loss 3.300776 | norm 0.1394 | lr 9.30e-04 | (9891.01 ms | 53007 tok/s) step 6994/76294 | train loss 3.347117 | norm 0.1239 | lr 9.30e-04 | (9883.60 ms | 53046 tok/s) step 6995/76294 | train loss 3.338689 | norm 0.1223 | lr 9.30e-04 | (9895.95 ms | 52980 tok/s) step 6996/76294 | train loss 3.378072 | norm 0.1174 | lr 9.30e-04 | (9884.48 ms | 53042 tok/s) step 6997/76294 | train loss 3.321175 | norm 0.1220 | lr 9.30e-04 | (9895.28 ms | 52984 tok/s) step 6998/76294 | train loss 3.295514 | norm 0.1325 | lr 9.30e-04 | (9883.50 ms | 53047 tok/s) step 6999/76294 | train loss 3.318023 | norm 0.1279 | lr 9.30e-04 | (9931.20 ms | 52792 tok/s) step 7000/76294 | train loss 3.281953 | norm 0.1225 | lr 9.30e-04 | (9882.42 ms | 53053 tok/s) val loss: 3.273243 saving model checkpoint to ./results/gpt2-350M-gqa/step_7000.pth step 7001/76294 | train loss 3.341462 | norm 0.1241 | lr 9.30e-04 | (9942.71 ms | 52731 tok/s) step 7002/76294 | train loss 3.317400 | norm 0.1239 | lr 9.30e-04 | (9855.87 ms | 53196 tok/s) step 7003/76294 | train loss 3.295655 | norm 0.1395 | lr 9.30e-04 | (9883.97 ms | 53044 tok/s) step 7004/76294 | train loss 3.382154 | norm 0.1317 | lr 9.30e-04 | (9859.71 ms | 53175 tok/s) step 7005/76294 | train loss 3.311315 | norm 0.1259 | lr 9.29e-04 | (9878.95 ms | 53071 tok/s) step 7006/76294 | train loss 3.281986 | norm 0.1479 | lr 9.29e-04 | (9866.83 ms | 53136 tok/s) step 7007/76294 | train loss 3.248921 | norm 0.1245 | lr 9.29e-04 | (9874.88 ms | 53093 tok/s) step 7008/76294 | train loss 3.419615 | norm 0.2185 | lr 9.29e-04 | (9864.55 ms | 53149 tok/s) step 7009/76294 | train loss 3.298410 | norm 0.1827 | lr 9.29e-04 | (9868.16 ms | 53129 tok/s) step 7010/76294 | train loss 3.219992 | norm 0.1605 | lr 9.29e-04 | (9873.25 ms | 53102 tok/s) step 7011/76294 | train loss 3.288412 | norm 0.1547 | lr 9.29e-04 | (9875.08 ms | 53092 tok/s) step 7012/76294 | train loss 3.309571 | norm 0.1492 | lr 9.29e-04 | (9870.69 ms | 53116 tok/s) step 7013/76294 | train loss 3.289907 | norm 0.1421 | lr 9.29e-04 | (9873.67 ms | 53100 tok/s) step 7014/76294 | train loss 3.299057 | norm 0.1548 | lr 9.29e-04 | (9943.52 ms | 52727 tok/s) step 7015/76294 | train loss 3.353758 | norm 0.1479 | lr 9.29e-04 | (9880.64 ms | 53062 tok/s) step 7016/76294 | train loss 3.301515 | norm 0.1446 | lr 9.29e-04 | (9865.68 ms | 53143 tok/s) step 7017/76294 | train loss 3.298418 | norm 0.1512 | lr 9.28e-04 | (9917.23 ms | 52866 tok/s) step 7018/76294 | train loss 3.312470 | norm 0.1384 | lr 9.28e-04 | (9908.18 ms | 52915 tok/s) step 7019/76294 | train loss 3.323006 | norm 0.1465 | lr 9.28e-04 | (9881.95 ms | 53055 tok/s) step 7020/76294 | train loss 3.278876 | norm 0.1366 | lr 9.28e-04 | (9875.76 ms | 53088 tok/s) step 7021/76294 | train loss 3.335600 | norm 0.1543 | lr 9.28e-04 | (9892.67 ms | 52998 tok/s) step 7022/76294 | train loss 3.267092 | norm 0.1350 | lr 9.28e-04 | (9883.84 ms | 53045 tok/s) step 7023/76294 | train loss 3.352388 | norm 0.1508 | lr 9.28e-04 | (9876.81 ms | 53083 tok/s) step 7024/76294 | train loss 3.325219 | norm 0.1664 | lr 9.28e-04 | (9877.66 ms | 53078 tok/s) step 7025/76294 | train loss 3.271428 | norm 0.1442 | lr 9.28e-04 | (9876.55 ms | 53084 tok/s) step 7026/76294 | train loss 3.293088 | norm 0.1475 | lr 9.28e-04 | (9901.11 ms | 52952 tok/s) step 7027/76294 | train loss 3.278406 | norm 0.1532 | lr 9.28e-04 | (9875.33 ms | 53091 tok/s) step 7028/76294 | train loss 3.260221 | norm 0.1606 | lr 9.28e-04 | (10918.95 ms | 48016 tok/s) step 7029/76294 | train loss 3.281115 | norm 0.1319 | lr 9.28e-04 | (9900.53 ms | 52956 tok/s) step 7030/76294 | train loss 3.340497 | norm 0.1488 | lr 9.27e-04 | (9868.87 ms | 53125 tok/s) step 7031/76294 | train loss 3.301722 | norm 0.1421 | lr 9.27e-04 | (9884.00 ms | 53044 tok/s) step 7032/76294 | train loss 3.306878 | norm 0.1433 | lr 9.27e-04 | (9876.56 ms | 53084 tok/s) step 7033/76294 | train loss 3.335829 | norm 0.1466 | lr 9.27e-04 | (9880.18 ms | 53065 tok/s) step 7034/76294 | train loss 3.429611 | norm 0.1374 | lr 9.27e-04 | (9876.26 ms | 53086 tok/s) step 7035/76294 | train loss 3.287917 | norm 0.1533 | lr 9.27e-04 | (9921.87 ms | 52842 tok/s) step 7036/76294 | train loss 3.296943 | norm 0.1337 | lr 9.27e-04 | (9909.27 ms | 52909 tok/s) step 7037/76294 | train loss 3.347638 | norm 0.1483 | lr 9.27e-04 | (9884.97 ms | 53039 tok/s) step 7038/76294 | train loss 3.302248 | norm 0.1362 | lr 9.27e-04 | (9883.61 ms | 53046 tok/s) step 7039/76294 | train loss 3.280547 | norm 0.1425 | lr 9.27e-04 | (9888.27 ms | 53021 tok/s) step 7040/76294 | train loss 3.304566 | norm 0.1326 | lr 9.27e-04 | (9882.00 ms | 53055 tok/s) step 7041/76294 | train loss 3.335762 | norm 0.1351 | lr 9.27e-04 | (9889.22 ms | 53016 tok/s) step 7042/76294 | train loss 3.343080 | norm 0.1338 | lr 9.27e-04 | (9880.71 ms | 53062 tok/s) step 7043/76294 | train loss 3.276073 | norm 0.1283 | lr 9.26e-04 | (9907.34 ms | 52919 tok/s) step 7044/76294 | train loss 3.256015 | norm 0.1298 | lr 9.26e-04 | (9884.18 ms | 53043 tok/s) step 7045/76294 | train loss 3.237168 | norm 0.1607 | lr 9.26e-04 | (9926.98 ms | 52814 tok/s) step 7046/76294 | train loss 3.286255 | norm 0.1500 | lr 9.26e-04 | (9879.21 ms | 53070 tok/s) step 7047/76294 | train loss 3.295469 | norm 0.1439 | lr 9.26e-04 | (9902.98 ms | 52942 tok/s) step 7048/76294 | train loss 3.292068 | norm 0.1534 | lr 9.26e-04 | (9887.70 ms | 53024 tok/s) step 7049/76294 | train loss 3.325719 | norm 0.1324 | lr 9.26e-04 | (9880.77 ms | 53061 tok/s) step 7050/76294 | train loss 3.257949 | norm 0.1453 | lr 9.26e-04 | (9884.99 ms | 53039 tok/s) step 7051/76294 | train loss 3.237492 | norm 0.1355 | lr 9.26e-04 | (9921.17 ms | 52845 tok/s) step 7052/76294 | train loss 3.307282 | norm 0.1481 | lr 9.26e-04 | (11430.67 ms | 45867 tok/s) step 7053/76294 | train loss 3.311814 | norm 0.1272 | lr 9.26e-04 | (9910.50 ms | 52902 tok/s) step 7054/76294 | train loss 3.325979 | norm 0.1282 | lr 9.26e-04 | (9869.58 ms | 53122 tok/s) step 7055/76294 | train loss 3.323816 | norm 0.1324 | lr 9.26e-04 | (9876.33 ms | 53085 tok/s) step 7056/76294 | train loss 3.346103 | norm 0.1333 | lr 9.25e-04 | (9876.65 ms | 53084 tok/s) step 7057/76294 | train loss 3.261151 | norm 0.1226 | lr 9.25e-04 | (9923.85 ms | 52831 tok/s) step 7058/76294 | train loss 3.307226 | norm 0.1249 | lr 9.25e-04 | (9877.27 ms | 53080 tok/s) step 7059/76294 | train loss 3.281772 | norm 0.1258 | lr 9.25e-04 | (9891.87 ms | 53002 tok/s) step 7060/76294 | train loss 3.321385 | norm 0.1293 | lr 9.25e-04 | (11013.53 ms | 47604 tok/s) step 7061/76294 | train loss 3.311729 | norm 0.1326 | lr 9.25e-04 | (9875.82 ms | 53088 tok/s) step 7062/76294 | train loss 3.318654 | norm 0.1318 | lr 9.25e-04 | (11017.95 ms | 47585 tok/s) step 7063/76294 | train loss 3.281069 | norm 0.1132 | lr 9.25e-04 | (9881.64 ms | 53057 tok/s) step 7064/76294 | train loss 3.245296 | norm 0.1375 | lr 9.25e-04 | (9878.00 ms | 53076 tok/s) step 7065/76294 | train loss 3.270216 | norm 0.1278 | lr 9.25e-04 | (9942.83 ms | 52730 tok/s) step 7066/76294 | train loss 3.273746 | norm 0.1308 | lr 9.25e-04 | (9887.34 ms | 53026 tok/s) step 7067/76294 | train loss 3.223493 | norm 0.1258 | lr 9.25e-04 | (9897.75 ms | 52970 tok/s) step 7068/76294 | train loss 3.261465 | norm 0.1339 | lr 9.24e-04 | (9927.76 ms | 52810 tok/s) step 7069/76294 | train loss 3.382546 | norm 0.1303 | lr 9.24e-04 | (9885.43 ms | 53036 tok/s) step 7070/76294 | train loss 3.256650 | norm 0.1373 | lr 9.24e-04 | (9891.78 ms | 53002 tok/s) step 7071/76294 | train loss 3.286226 | norm 0.1328 | lr 9.24e-04 | (9906.24 ms | 52925 tok/s) step 7072/76294 | train loss 3.307526 | norm 0.1354 | lr 9.24e-04 | (9911.10 ms | 52899 tok/s) step 7073/76294 | train loss 3.309779 | norm 0.1264 | lr 9.24e-04 | (9892.49 ms | 52999 tok/s) step 7074/76294 | train loss 3.260552 | norm 0.1368 | lr 9.24e-04 | (9927.67 ms | 52811 tok/s) step 7075/76294 | train loss 3.302475 | norm 0.1595 | lr 9.24e-04 | (9893.07 ms | 52995 tok/s) step 7076/76294 | train loss 3.316208 | norm 0.1346 | lr 9.24e-04 | (9897.01 ms | 52974 tok/s) step 7077/76294 | train loss 3.287866 | norm 0.1189 | lr 9.24e-04 | (9896.80 ms | 52976 tok/s) step 7078/76294 | train loss 3.266104 | norm 0.1247 | lr 9.24e-04 | (9916.25 ms | 52872 tok/s) step 7079/76294 | train loss 3.269848 | norm 0.1348 | lr 9.24e-04 | (9893.72 ms | 52992 tok/s) step 7080/76294 | train loss 3.312555 | norm 0.1168 | lr 9.24e-04 | (9897.26 ms | 52973 tok/s) step 7081/76294 | train loss 3.268048 | norm 0.1231 | lr 9.23e-04 | (9897.11 ms | 52974 tok/s) step 7082/76294 | train loss 3.205168 | norm 0.1304 | lr 9.23e-04 | (9896.53 ms | 52977 tok/s) step 7083/76294 | train loss 3.401377 | norm 0.1362 | lr 9.23e-04 | (9891.44 ms | 53004 tok/s) step 7084/76294 | train loss 3.283951 | norm 0.1499 | lr 9.23e-04 | (9914.15 ms | 52883 tok/s) step 7085/76294 | train loss 3.269131 | norm 0.1179 | lr 9.23e-04 | (9894.74 ms | 52987 tok/s) step 7086/76294 | train loss 3.268642 | norm 0.1467 | lr 9.23e-04 | (9897.46 ms | 52972 tok/s) step 7087/76294 | train loss 3.309831 | norm 0.1232 | lr 9.23e-04 | (9897.59 ms | 52971 tok/s) step 7088/76294 | train loss 3.354736 | norm 0.1373 | lr 9.23e-04 | (9997.69 ms | 52441 tok/s) step 7089/76294 | train loss 3.344723 | norm 0.1263 | lr 9.23e-04 | (9895.63 ms | 52982 tok/s) step 7090/76294 | train loss 3.330450 | norm 0.1368 | lr 9.23e-04 | (9888.27 ms | 53021 tok/s) step 7091/76294 | train loss 3.297463 | norm 0.1453 | lr 9.23e-04 | (9912.71 ms | 52890 tok/s) step 7092/76294 | train loss 3.303807 | norm 0.1491 | lr 9.23e-04 | (9888.34 ms | 53021 tok/s) step 7093/76294 | train loss 3.273538 | norm 0.1511 | lr 9.23e-04 | (9892.75 ms | 52997 tok/s) step 7094/76294 | train loss 3.298756 | norm 0.1555 | lr 9.22e-04 | (9929.28 ms | 52802 tok/s) step 7095/76294 | train loss 3.312606 | norm 0.1649 | lr 9.22e-04 | (9893.30 ms | 52994 tok/s) step 7096/76294 | train loss 3.301631 | norm 0.1623 | lr 9.22e-04 | (9887.33 ms | 53026 tok/s) step 7097/76294 | train loss 3.272724 | norm 0.1530 | lr 9.22e-04 | (9900.61 ms | 52955 tok/s) step 7098/76294 | train loss 3.313014 | norm 0.1429 | lr 9.22e-04 | (9889.03 ms | 53017 tok/s) step 7099/76294 | train loss 3.322240 | norm 0.1554 | lr 9.22e-04 | (9898.79 ms | 52965 tok/s) step 7100/76294 | train loss 3.327735 | norm 0.1350 | lr 9.22e-04 | (9957.05 ms | 52655 tok/s) step 7101/76294 | train loss 3.336675 | norm 0.1468 | lr 9.22e-04 | (9893.87 ms | 52991 tok/s) step 7102/76294 | train loss 3.277391 | norm 0.1379 | lr 9.22e-04 | (9944.98 ms | 52719 tok/s) step 7103/76294 | train loss 3.263604 | norm 0.1490 | lr 9.22e-04 | (9888.75 ms | 53019 tok/s) step 7104/76294 | train loss 3.302084 | norm 0.1497 | lr 9.22e-04 | (9886.14 ms | 53033 tok/s) step 7105/76294 | train loss 3.360246 | norm 0.1510 | lr 9.22e-04 | (9927.79 ms | 52810 tok/s) step 7106/76294 | train loss 3.272316 | norm 0.1437 | lr 9.22e-04 | (9934.08 ms | 52777 tok/s) step 7107/76294 | train loss 3.357701 | norm 0.1443 | lr 9.21e-04 | (9894.58 ms | 52987 tok/s) step 7108/76294 | train loss 3.270616 | norm 0.1218 | lr 9.21e-04 | (9977.76 ms | 52546 tok/s) step 7109/76294 | train loss 3.300563 | norm 0.1362 | lr 9.21e-04 | (9894.95 ms | 52985 tok/s) step 7110/76294 | train loss 3.300069 | norm 0.1313 | lr 9.21e-04 | (9893.68 ms | 52992 tok/s) step 7111/76294 | train loss 3.302273 | norm 0.1315 | lr 9.21e-04 | (9891.89 ms | 53002 tok/s) step 7112/76294 | train loss 3.352116 | norm 0.1390 | lr 9.21e-04 | (9898.73 ms | 52965 tok/s) step 7113/76294 | train loss 3.303935 | norm 0.1214 | lr 9.21e-04 | (9896.54 ms | 52977 tok/s) step 7114/76294 | train loss 3.258817 | norm 0.1300 | lr 9.21e-04 | (9905.55 ms | 52929 tok/s) step 7115/76294 | train loss 3.340616 | norm 0.1302 | lr 9.21e-04 | (9891.17 ms | 53006 tok/s) step 7116/76294 | train loss 3.394857 | norm 0.1144 | lr 9.21e-04 | (9893.35 ms | 52994 tok/s) step 7117/76294 | train loss 3.297661 | norm 0.1257 | lr 9.21e-04 | (9972.48 ms | 52573 tok/s) step 7118/76294 | train loss 3.361432 | norm 0.1225 | lr 9.21e-04 | (9890.85 ms | 53007 tok/s) step 7119/76294 | train loss 3.335827 | norm 0.1317 | lr 9.20e-04 | (9903.63 ms | 52939 tok/s) step 7120/76294 | train loss 3.300395 | norm 0.1149 | lr 9.20e-04 | (9892.36 ms | 52999 tok/s) step 7121/76294 | train loss 3.324131 | norm 0.1435 | lr 9.20e-04 | (9897.68 ms | 52971 tok/s) step 7122/76294 | train loss 3.301540 | norm 0.1439 | lr 9.20e-04 | (9889.20 ms | 53016 tok/s) step 7123/76294 | train loss 3.319414 | norm 0.1439 | lr 9.20e-04 | (9915.89 ms | 52874 tok/s) step 7124/76294 | train loss 3.262940 | norm 0.1455 | lr 9.20e-04 | (9887.54 ms | 53025 tok/s) step 7125/76294 | train loss 3.333270 | norm 0.1411 | lr 9.20e-04 | (9898.46 ms | 52967 tok/s) step 7126/76294 | train loss 3.281555 | norm 0.1394 | lr 9.20e-04 | (11245.59 ms | 46622 tok/s) step 7127/76294 | train loss 3.187513 | norm 0.1770 | lr 9.20e-04 | (9883.21 ms | 53048 tok/s) step 7128/76294 | train loss 3.275611 | norm 0.1574 | lr 9.20e-04 | (9888.46 ms | 53020 tok/s) step 7129/76294 | train loss 3.306870 | norm 0.1551 | lr 9.20e-04 | (9891.26 ms | 53005 tok/s) step 7130/76294 | train loss 3.320816 | norm 0.1429 | lr 9.20e-04 | (9887.66 ms | 53024 tok/s) step 7131/76294 | train loss 3.301173 | norm 0.1513 | lr 9.20e-04 | (9889.75 ms | 53013 tok/s) step 7132/76294 | train loss 3.335091 | norm 0.1471 | lr 9.19e-04 | (9892.20 ms | 53000 tok/s) step 7133/76294 | train loss 3.353228 | norm 0.1456 | lr 9.19e-04 | (9895.49 ms | 52983 tok/s) step 7134/76294 | train loss 3.309232 | norm 0.1421 | lr 9.19e-04 | (9884.97 ms | 53039 tok/s) step 7135/76294 | train loss 3.266285 | norm 0.1461 | lr 9.19e-04 | (9889.29 ms | 53016 tok/s) step 7136/76294 | train loss 3.253439 | norm 0.1338 | lr 9.19e-04 | (9888.99 ms | 53017 tok/s) step 7137/76294 | train loss 3.325201 | norm 0.1359 | lr 9.19e-04 | (9894.97 ms | 52985 tok/s) step 7138/76294 | train loss 3.265989 | norm 0.1298 | lr 9.19e-04 | (9931.20 ms | 52792 tok/s) step 7139/76294 | train loss 3.315390 | norm 0.1423 | lr 9.19e-04 | (9891.52 ms | 53004 tok/s) step 7140/76294 | train loss 3.290034 | norm 0.1490 | lr 9.19e-04 | (9932.60 ms | 52785 tok/s) step 7141/76294 | train loss 3.295077 | norm 0.1511 | lr 9.19e-04 | (9890.66 ms | 53008 tok/s) step 7142/76294 | train loss 3.332429 | norm 0.1473 | lr 9.19e-04 | (9894.62 ms | 52987 tok/s) step 7143/76294 | train loss 3.300944 | norm 0.1390 | lr 9.19e-04 | (9897.67 ms | 52971 tok/s) step 7144/76294 | train loss 3.323485 | norm 0.1456 | lr 9.19e-04 | (9925.52 ms | 52822 tok/s) step 7145/76294 | train loss 3.391969 | norm 0.1609 | lr 9.18e-04 | (9901.97 ms | 52948 tok/s) step 7146/76294 | train loss 3.326525 | norm 0.1345 | lr 9.18e-04 | (9886.77 ms | 53029 tok/s) step 7147/76294 | train loss 3.335438 | norm 0.1511 | lr 9.18e-04 | (9894.00 ms | 52990 tok/s) step 7148/76294 | train loss 3.281538 | norm 0.1219 | lr 9.18e-04 | (9902.22 ms | 52947 tok/s) step 7149/76294 | train loss 3.290366 | norm 0.1602 | lr 9.18e-04 | (9895.16 ms | 52984 tok/s) step 7150/76294 | train loss 3.332943 | norm 0.1516 | lr 9.18e-04 | (9900.66 ms | 52955 tok/s) step 7151/76294 | train loss 3.311034 | norm 0.1374 | lr 9.18e-04 | (9892.06 ms | 53001 tok/s) step 7152/76294 | train loss 3.277640 | norm 0.1538 | lr 9.18e-04 | (9918.82 ms | 52858 tok/s) step 7153/76294 | train loss 3.332338 | norm 0.1422 | lr 9.18e-04 | (9895.33 ms | 52983 tok/s) step 7154/76294 | train loss 3.312180 | norm 0.1776 | lr 9.18e-04 | (9908.60 ms | 52912 tok/s) step 7155/76294 | train loss 3.298820 | norm 0.1206 | lr 9.18e-04 | (9893.40 ms | 52994 tok/s) step 7156/76294 | train loss 3.388723 | norm 0.1418 | lr 9.18e-04 | (9891.77 ms | 53002 tok/s) step 7157/76294 | train loss 3.269925 | norm 0.1526 | lr 9.17e-04 | (9898.69 ms | 52965 tok/s) step 7158/76294 | train loss 3.301418 | norm 0.1980 | lr 9.17e-04 | (9887.27 ms | 53027 tok/s) step 7159/76294 | train loss 3.265513 | norm 0.1367 | lr 9.17e-04 | (9898.26 ms | 52968 tok/s) step 7160/76294 | train loss 3.318354 | norm 0.1962 | lr 9.17e-04 | (9893.16 ms | 52995 tok/s) step 7161/76294 | train loss 3.286365 | norm 0.1434 | lr 9.17e-04 | (9904.67 ms | 52933 tok/s) step 7162/76294 | train loss 3.307425 | norm 0.2115 | lr 9.17e-04 | (9905.18 ms | 52931 tok/s) step 7163/76294 | train loss 3.279024 | norm 0.1637 | lr 9.17e-04 | (10807.29 ms | 48512 tok/s) step 7164/76294 | train loss 3.290808 | norm 0.1624 | lr 9.17e-04 | (9881.65 ms | 53057 tok/s) step 7165/76294 | train loss 3.370306 | norm 0.1547 | lr 9.17e-04 | (9900.73 ms | 52954 tok/s) step 7166/76294 | train loss 3.295521 | norm 0.1458 | lr 9.17e-04 | (9880.77 ms | 53061 tok/s) step 7167/76294 | train loss 3.355666 | norm 0.1496 | lr 9.17e-04 | (9893.06 ms | 52996 tok/s) step 7168/76294 | train loss 3.289831 | norm 0.1428 | lr 9.17e-04 | (9887.04 ms | 53028 tok/s) step 7169/76294 | train loss 3.314661 | norm 0.1386 | lr 9.17e-04 | (9892.97 ms | 52996 tok/s) step 7170/76294 | train loss 3.270456 | norm 0.1462 | lr 9.16e-04 | (9885.44 ms | 53036 tok/s) step 7171/76294 | train loss 3.273653 | norm 0.1303 | lr 9.16e-04 | (9885.20 ms | 53038 tok/s) step 7172/76294 | train loss 3.380665 | norm 0.1445 | lr 9.16e-04 | (9890.45 ms | 53009 tok/s) step 7173/76294 | train loss 3.322044 | norm 0.1672 | lr 9.16e-04 | (9897.35 ms | 52973 tok/s) step 7174/76294 | train loss 3.234183 | norm 0.1684 | lr 9.16e-04 | (9894.37 ms | 52989 tok/s) step 7175/76294 | train loss 3.267396 | norm 0.1478 | lr 9.16e-04 | (9887.71 ms | 53024 tok/s) step 7176/76294 | train loss 3.381061 | norm 0.1558 | lr 9.16e-04 | (9957.08 ms | 52655 tok/s) step 7177/76294 | train loss 3.302641 | norm 0.1527 | lr 9.16e-04 | (9893.44 ms | 52993 tok/s) step 7178/76294 | train loss 3.300288 | norm 0.1396 | lr 9.16e-04 | (9911.16 ms | 52899 tok/s) step 7179/76294 | train loss 3.338736 | norm 0.1467 | lr 9.16e-04 | (9956.51 ms | 52658 tok/s) step 7180/76294 | train loss 3.283250 | norm 0.1311 | lr 9.16e-04 | (9922.72 ms | 52837 tok/s) step 7181/76294 | train loss 3.289354 | norm 0.1303 | lr 9.16e-04 | (9908.40 ms | 52913 tok/s) step 7182/76294 | train loss 3.315251 | norm 0.1557 | lr 9.16e-04 | (9886.56 ms | 53030 tok/s) step 7183/76294 | train loss 3.288968 | norm 0.1252 | lr 9.15e-04 | (9894.18 ms | 52990 tok/s) step 7184/76294 | train loss 3.402830 | norm 0.1374 | lr 9.15e-04 | (9885.51 ms | 53036 tok/s) step 7185/76294 | train loss 3.299373 | norm 0.1239 | lr 9.15e-04 | (9895.12 ms | 52985 tok/s) step 7186/76294 | train loss 3.267707 | norm 0.1318 | lr 9.15e-04 | (9884.38 ms | 53042 tok/s) step 7187/76294 | train loss 3.326853 | norm 0.1439 | lr 9.15e-04 | (9929.00 ms | 52804 tok/s) step 7188/76294 | train loss 3.281196 | norm 0.1599 | lr 9.15e-04 | (9886.63 ms | 53030 tok/s) step 7189/76294 | train loss 3.258592 | norm 0.1428 | lr 9.15e-04 | (9912.30 ms | 52893 tok/s) step 7190/76294 | train loss 3.286875 | norm 0.1570 | lr 9.15e-04 | (9886.98 ms | 53028 tok/s) step 7191/76294 | train loss 3.351943 | norm 0.1392 | lr 9.15e-04 | (9905.48 ms | 52929 tok/s) step 7192/76294 | train loss 3.279582 | norm 0.1503 | lr 9.15e-04 | (9891.48 ms | 53004 tok/s) step 7193/76294 | train loss 3.344104 | norm 0.1551 | lr 9.15e-04 | (9885.08 ms | 53038 tok/s) step 7194/76294 | train loss 3.299653 | norm 0.1567 | lr 9.15e-04 | (9891.27 ms | 53005 tok/s) step 7195/76294 | train loss 3.321160 | norm 0.1497 | lr 9.14e-04 | (9888.77 ms | 53019 tok/s) step 7196/76294 | train loss 3.272023 | norm 0.1597 | lr 9.14e-04 | (9895.26 ms | 52984 tok/s) step 7197/76294 | train loss 3.326783 | norm 0.1389 | lr 9.14e-04 | (9890.83 ms | 53007 tok/s) step 7198/76294 | train loss 3.380544 | norm 0.1568 | lr 9.14e-04 | (9925.00 ms | 52825 tok/s) step 7199/76294 | train loss 3.351446 | norm 0.1635 | lr 9.14e-04 | (9886.04 ms | 53033 tok/s) step 7200/76294 | train loss 3.327750 | norm 0.1357 | lr 9.14e-04 | (9886.21 ms | 53032 tok/s) step 7201/76294 | train loss 3.375696 | norm 0.1530 | lr 9.14e-04 | (9892.16 ms | 53000 tok/s) step 7202/76294 | train loss 3.294089 | norm 0.1333 | lr 9.14e-04 | (9887.57 ms | 53025 tok/s) step 7203/76294 | train loss 3.223698 | norm 0.1343 | lr 9.14e-04 | (9896.08 ms | 52979 tok/s) step 7204/76294 | train loss 3.309247 | norm 0.1284 | lr 9.14e-04 | (9888.53 ms | 53020 tok/s) step 7205/76294 | train loss 3.343273 | norm 0.1317 | lr 9.14e-04 | (9899.27 ms | 52962 tok/s) step 7206/76294 | train loss 3.265939 | norm 0.1385 | lr 9.14e-04 | (9951.59 ms | 52684 tok/s) step 7207/76294 | train loss 3.286228 | norm 0.1200 | lr 9.14e-04 | (9898.43 ms | 52967 tok/s) step 7208/76294 | train loss 3.312589 | norm 0.1322 | lr 9.13e-04 | (9902.89 ms | 52943 tok/s) step 7209/76294 | train loss 3.265840 | norm 0.1292 | lr 9.13e-04 | (9931.70 ms | 52789 tok/s) step 7210/76294 | train loss 3.268392 | norm 0.1235 | lr 9.13e-04 | (9889.77 ms | 53013 tok/s) step 7211/76294 | train loss 3.382627 | norm 0.1439 | lr 9.13e-04 | (9896.83 ms | 52975 tok/s) step 7212/76294 | train loss 3.285421 | norm 0.1219 | lr 9.13e-04 | (9898.36 ms | 52967 tok/s) step 7213/76294 | train loss 3.318944 | norm 0.1353 | lr 9.13e-04 | (9884.74 ms | 53040 tok/s) step 7214/76294 | train loss 3.295020 | norm 0.1339 | lr 9.13e-04 | (9947.90 ms | 52703 tok/s) step 7215/76294 | train loss 3.282959 | norm 0.1232 | lr 9.13e-04 | (9891.83 ms | 53002 tok/s) step 7216/76294 | train loss 3.302507 | norm 0.1336 | lr 9.13e-04 | (9889.29 ms | 53016 tok/s) step 7217/76294 | train loss 3.371644 | norm 0.1476 | lr 9.13e-04 | (9927.66 ms | 52811 tok/s) step 7218/76294 | train loss 3.400464 | norm 0.1166 | lr 9.13e-04 | (9909.11 ms | 52910 tok/s) step 7219/76294 | train loss 3.307032 | norm 0.1487 | lr 9.13e-04 | (9900.91 ms | 52954 tok/s) step 7220/76294 | train loss 3.292219 | norm 0.1174 | lr 9.12e-04 | (9918.14 ms | 52862 tok/s) step 7221/76294 | train loss 3.324088 | norm 0.1374 | lr 9.12e-04 | (9899.01 ms | 52964 tok/s) step 7222/76294 | train loss 3.348573 | norm 0.1394 | lr 9.12e-04 | (9908.76 ms | 52912 tok/s) step 7223/76294 | train loss 3.331298 | norm 0.1350 | lr 9.12e-04 | (9899.93 ms | 52959 tok/s) step 7224/76294 | train loss 3.296258 | norm 0.1326 | lr 9.12e-04 | (11052.83 ms | 47435 tok/s) step 7225/76294 | train loss 3.292229 | norm 0.1471 | lr 9.12e-04 | (9889.55 ms | 53014 tok/s) step 7226/76294 | train loss 3.274945 | norm 0.1244 | lr 9.12e-04 | (9912.53 ms | 52891 tok/s) step 7227/76294 | train loss 3.290082 | norm 0.1287 | lr 9.12e-04 | (9950.71 ms | 52689 tok/s) step 7228/76294 | train loss 3.284500 | norm 0.1399 | lr 9.12e-04 | (9928.56 ms | 52806 tok/s) step 7229/76294 | train loss 3.311878 | norm 0.1255 | lr 9.12e-04 | (9891.83 ms | 53002 tok/s) step 7230/76294 | train loss 3.330244 | norm 0.1239 | lr 9.12e-04 | (9906.11 ms | 52926 tok/s) step 7231/76294 | train loss 3.283735 | norm 0.1314 | lr 9.12e-04 | (9895.89 ms | 52980 tok/s) step 7232/76294 | train loss 3.350710 | norm 0.1258 | lr 9.12e-04 | (9902.92 ms | 52943 tok/s) step 7233/76294 | train loss 3.271600 | norm 0.1366 | lr 9.11e-04 | (9895.51 ms | 52982 tok/s) step 7234/76294 | train loss 3.269465 | norm 0.1263 | lr 9.11e-04 | (9897.55 ms | 52971 tok/s) step 7235/76294 | train loss 3.308743 | norm 0.1256 | lr 9.11e-04 | (9927.70 ms | 52811 tok/s) step 7236/76294 | train loss 3.365678 | norm 0.1210 | lr 9.11e-04 | (9914.13 ms | 52883 tok/s) step 7237/76294 | train loss 3.299454 | norm 0.1195 | lr 9.11e-04 | (9899.82 ms | 52959 tok/s) step 7238/76294 | train loss 3.329653 | norm 0.1340 | lr 9.11e-04 | (9900.15 ms | 52958 tok/s) step 7239/76294 | train loss 3.325379 | norm 0.1253 | lr 9.11e-04 | (9901.65 ms | 52950 tok/s) step 7240/76294 | train loss 3.317713 | norm 0.1250 | lr 9.11e-04 | (9935.69 ms | 52768 tok/s) step 7241/76294 | train loss 3.375454 | norm 0.1286 | lr 9.11e-04 | (9895.27 ms | 52984 tok/s) step 7242/76294 | train loss 3.272236 | norm 0.1270 | lr 9.11e-04 | (9908.56 ms | 52913 tok/s) step 7243/76294 | train loss 3.301263 | norm 0.1463 | lr 9.11e-04 | (9899.80 ms | 52959 tok/s) step 7244/76294 | train loss 3.305297 | norm 0.1370 | lr 9.11e-04 | (9895.36 ms | 52983 tok/s) step 7245/76294 | train loss 3.312858 | norm 0.1401 | lr 9.10e-04 | (9902.23 ms | 52946 tok/s) step 7246/76294 | train loss 3.314564 | norm 0.1451 | lr 9.10e-04 | (9911.90 ms | 52895 tok/s) step 7247/76294 | train loss 3.365044 | norm 0.1496 | lr 9.10e-04 | (9901.05 ms | 52953 tok/s) step 7248/76294 | train loss 3.247541 | norm 0.1483 | lr 9.10e-04 | (9900.10 ms | 52958 tok/s) step 7249/76294 | train loss 3.321488 | norm 0.1490 | lr 9.10e-04 | (9917.35 ms | 52866 tok/s) step 7250/76294 | train loss 3.310913 | norm 0.2340 | lr 9.10e-04 | (9889.12 ms | 53017 tok/s) val loss: 3.268165 saving model checkpoint to ./results/gpt2-350M-gqa/step_7250.pth step 7251/76294 | train loss 3.351744 | norm 0.1358 | lr 9.10e-04 | (9957.36 ms | 52653 tok/s) step 7252/76294 | train loss 3.360640 | norm 0.1373 | lr 9.10e-04 | (9857.42 ms | 53187 tok/s) step 7253/76294 | train loss 3.368536 | norm 0.1369 | lr 9.10e-04 | (9863.34 ms | 53155 tok/s) step 7254/76294 | train loss 3.326253 | norm 0.1317 | lr 9.10e-04 | (9859.49 ms | 53176 tok/s) step 7255/76294 | train loss 3.315626 | norm 0.1292 | lr 9.10e-04 | (9866.10 ms | 53140 tok/s) step 7256/76294 | train loss 3.253579 | norm 0.1293 | lr 9.10e-04 | (9871.35 ms | 53112 tok/s) step 7257/76294 | train loss 3.389536 | norm 0.1375 | lr 9.10e-04 | (9898.22 ms | 52968 tok/s) step 7258/76294 | train loss 3.332862 | norm 0.4216 | lr 9.09e-04 | (9931.50 ms | 52790 tok/s) step 7259/76294 | train loss 3.243716 | norm 0.1552 | lr 9.09e-04 | (9887.58 ms | 53025 tok/s) step 7260/76294 | train loss 3.245620 | norm 0.1884 | lr 9.09e-04 | (9910.14 ms | 52904 tok/s) step 7261/76294 | train loss 3.375672 | norm 0.1380 | lr 9.09e-04 | (9956.00 ms | 52661 tok/s) step 7262/76294 | train loss 3.289364 | norm 0.1849 | lr 9.09e-04 | (9892.65 ms | 52998 tok/s) step 7263/76294 | train loss 3.257485 | norm 0.1528 | lr 9.09e-04 | (9942.95 ms | 52730 tok/s) step 7264/76294 | train loss 3.252663 | norm 0.1480 | lr 9.09e-04 | (9915.80 ms | 52874 tok/s) step 7265/76294 | train loss 3.211321 | norm 0.1504 | lr 9.09e-04 | (9895.68 ms | 52982 tok/s) step 7266/76294 | train loss 3.266292 | norm 0.1362 | lr 9.09e-04 | (9965.11 ms | 52612 tok/s) step 7267/76294 | train loss 3.287812 | norm 0.1396 | lr 9.09e-04 | (9900.20 ms | 52957 tok/s) step 7268/76294 | train loss 3.297506 | norm 0.1417 | lr 9.09e-04 | (9909.34 ms | 52908 tok/s) step 7269/76294 | train loss 3.250633 | norm 0.1253 | lr 9.09e-04 | (9939.41 ms | 52748 tok/s) step 7270/76294 | train loss 3.253767 | norm 0.1372 | lr 9.08e-04 | (9935.20 ms | 52771 tok/s) step 7271/76294 | train loss 3.292327 | norm 0.1589 | lr 9.08e-04 | (9913.00 ms | 52889 tok/s) step 7272/76294 | train loss 3.346296 | norm 0.1646 | lr 9.08e-04 | (9901.54 ms | 52950 tok/s) step 7273/76294 | train loss 3.243343 | norm 0.1323 | lr 9.08e-04 | (9933.81 ms | 52778 tok/s) step 7274/76294 | train loss 3.350696 | norm 0.1617 | lr 9.08e-04 | (9896.25 ms | 52978 tok/s) step 7275/76294 | train loss 3.298674 | norm 0.1408 | lr 9.08e-04 | (9898.75 ms | 52965 tok/s) step 7276/76294 | train loss 3.298710 | norm 0.1535 | lr 9.08e-04 | (9908.71 ms | 52912 tok/s) step 7277/76294 | train loss 3.308622 | norm 0.1262 | lr 9.08e-04 | (9914.40 ms | 52881 tok/s) step 7278/76294 | train loss 3.286238 | norm 0.1392 | lr 9.08e-04 | (9898.51 ms | 52966 tok/s) step 7279/76294 | train loss 3.284229 | norm 0.1457 | lr 9.08e-04 | (9901.95 ms | 52948 tok/s) step 7280/76294 | train loss 3.302708 | norm 0.1443 | lr 9.08e-04 | (9922.51 ms | 52838 tok/s) step 7281/76294 | train loss 3.273029 | norm 0.1542 | lr 9.08e-04 | (9900.33 ms | 52957 tok/s) step 7282/76294 | train loss 3.374543 | norm 0.1512 | lr 9.08e-04 | (9895.73 ms | 52981 tok/s) step 7283/76294 | train loss 3.303997 | norm 0.1514 | lr 9.07e-04 | (9895.54 ms | 52982 tok/s) step 7284/76294 | train loss 3.267262 | norm 0.1589 | lr 9.07e-04 | (9898.61 ms | 52966 tok/s) step 7285/76294 | train loss 3.221434 | norm 0.1486 | lr 9.07e-04 | (9894.88 ms | 52986 tok/s) step 7286/76294 | train loss 3.327838 | norm 0.1635 | lr 9.07e-04 | (9933.78 ms | 52778 tok/s) step 7287/76294 | train loss 3.272907 | norm 0.1293 | lr 9.07e-04 | (9890.59 ms | 53009 tok/s) step 7288/76294 | train loss 3.251104 | norm 0.1543 | lr 9.07e-04 | (9922.49 ms | 52838 tok/s) step 7289/76294 | train loss 3.304932 | norm 0.1320 | lr 9.07e-04 | (9912.64 ms | 52891 tok/s) step 7290/76294 | train loss 3.307229 | norm 0.1435 | lr 9.07e-04 | (9919.07 ms | 52857 tok/s) step 7291/76294 | train loss 3.327505 | norm 0.1205 | lr 9.07e-04 | (9891.98 ms | 53001 tok/s) step 7292/76294 | train loss 3.215985 | norm 0.1402 | lr 9.07e-04 | (9894.46 ms | 52988 tok/s) step 7293/76294 | train loss 3.271811 | norm 0.1402 | lr 9.07e-04 | (9894.02 ms | 52990 tok/s) step 7294/76294 | train loss 3.271919 | norm 0.1361 | lr 9.07e-04 | (9889.95 ms | 53012 tok/s) step 7295/76294 | train loss 3.272655 | norm 0.1347 | lr 9.06e-04 | (9892.94 ms | 52996 tok/s) step 7296/76294 | train loss 3.282727 | norm 0.1367 | lr 9.06e-04 | (9924.70 ms | 52827 tok/s) step 7297/76294 | train loss 3.286912 | norm 0.1230 | lr 9.06e-04 | (9899.57 ms | 52961 tok/s) step 7298/76294 | train loss 3.303918 | norm 0.1384 | lr 9.06e-04 | (9893.81 ms | 52992 tok/s) step 7299/76294 | train loss 3.263724 | norm 0.1278 | lr 9.06e-04 | (9898.05 ms | 52969 tok/s) step 7300/76294 | train loss 3.319528 | norm 0.1278 | lr 9.06e-04 | (9885.63 ms | 53035 tok/s) step 7301/76294 | train loss 3.265531 | norm 0.1290 | lr 9.06e-04 | (9962.65 ms | 52625 tok/s) step 7302/76294 | train loss 3.260209 | norm 0.1281 | lr 9.06e-04 | (9889.24 ms | 53016 tok/s) step 7303/76294 | train loss 3.334098 | norm 0.1395 | lr 9.06e-04 | (9893.38 ms | 52994 tok/s) step 7304/76294 | train loss 3.218857 | norm 0.1551 | lr 9.06e-04 | (9951.59 ms | 52684 tok/s) step 7305/76294 | train loss 3.329677 | norm 0.1238 | lr 9.06e-04 | (9897.69 ms | 52971 tok/s) step 7306/76294 | train loss 3.238151 | norm 0.1288 | lr 9.06e-04 | (9884.82 ms | 53040 tok/s) step 7307/76294 | train loss 3.326135 | norm 0.1340 | lr 9.06e-04 | (9895.07 ms | 52985 tok/s) step 7308/76294 | train loss 3.318204 | norm 0.1261 | lr 9.05e-04 | (9931.50 ms | 52790 tok/s) step 7309/76294 | train loss 3.341312 | norm 0.1310 | lr 9.05e-04 | (9894.78 ms | 52986 tok/s) step 7310/76294 | train loss 3.241651 | norm 0.1332 | lr 9.05e-04 | (9894.51 ms | 52988 tok/s) step 7311/76294 | train loss 3.297639 | norm 0.1316 | lr 9.05e-04 | (9893.76 ms | 52992 tok/s) step 7312/76294 | train loss 3.250954 | norm 0.1328 | lr 9.05e-04 | (9896.64 ms | 52976 tok/s) step 7313/76294 | train loss 3.262354 | norm 0.1321 | lr 9.05e-04 | (9904.67 ms | 52933 tok/s) step 7314/76294 | train loss 3.305456 | norm 0.1243 | lr 9.05e-04 | (9885.24 ms | 53037 tok/s) step 7315/76294 | train loss 3.296635 | norm 0.1311 | lr 9.05e-04 | (9952.21 ms | 52681 tok/s) step 7316/76294 | train loss 3.302190 | norm 0.1253 | lr 9.05e-04 | (9921.60 ms | 52843 tok/s) step 7317/76294 | train loss 3.279038 | norm 0.1419 | lr 9.05e-04 | (9901.83 ms | 52949 tok/s) step 7318/76294 | train loss 3.233916 | norm 0.1422 | lr 9.05e-04 | (9943.85 ms | 52725 tok/s) step 7319/76294 | train loss 3.256430 | norm 0.1241 | lr 9.05e-04 | (9891.32 ms | 53005 tok/s) step 7320/76294 | train loss 3.189650 | norm 0.1714 | lr 9.04e-04 | (9894.83 ms | 52986 tok/s) step 7321/76294 | train loss 3.271701 | norm 0.1518 | lr 9.04e-04 | (13082.45 ms | 40076 tok/s) step 7322/76294 | train loss 3.258567 | norm 0.1419 | lr 9.04e-04 | (9869.29 ms | 53123 tok/s) step 7323/76294 | train loss 3.281476 | norm 0.1420 | lr 9.04e-04 | (9886.38 ms | 53031 tok/s) step 7324/76294 | train loss 3.214281 | norm 0.1363 | lr 9.04e-04 | (9917.85 ms | 52863 tok/s) step 7325/76294 | train loss 3.322398 | norm 0.1401 | lr 9.04e-04 | (9880.28 ms | 53064 tok/s) step 7326/76294 | train loss 3.212230 | norm 0.1339 | lr 9.04e-04 | (9892.93 ms | 52996 tok/s) step 7327/76294 | train loss 3.332899 | norm 0.1273 | lr 9.04e-04 | (9931.36 ms | 52791 tok/s) step 7328/76294 | train loss 3.269501 | norm 0.1378 | lr 9.04e-04 | (9885.17 ms | 53038 tok/s) step 7329/76294 | train loss 3.240863 | norm 0.1460 | lr 9.04e-04 | (9892.13 ms | 53001 tok/s) step 7330/76294 | train loss 3.265398 | norm 0.1478 | lr 9.04e-04 | (9935.66 ms | 52768 tok/s) step 7331/76294 | train loss 3.254825 | norm 0.1240 | lr 9.04e-04 | (9895.13 ms | 52984 tok/s) step 7332/76294 | train loss 3.314024 | norm 0.1326 | lr 9.04e-04 | (9893.60 ms | 52993 tok/s) step 7333/76294 | train loss 3.352718 | norm 0.1410 | lr 9.03e-04 | (9895.91 ms | 52980 tok/s) step 7334/76294 | train loss 3.295854 | norm 0.1190 | lr 9.03e-04 | (9897.85 ms | 52970 tok/s) step 7335/76294 | train loss 3.214982 | norm 0.1281 | lr 9.03e-04 | (9899.14 ms | 52963 tok/s) step 7336/76294 | train loss 3.266256 | norm 0.1202 | lr 9.03e-04 | (9898.54 ms | 52966 tok/s) step 7337/76294 | train loss 3.261506 | norm 0.1230 | lr 9.03e-04 | (9898.32 ms | 52967 tok/s) step 7338/76294 | train loss 3.294306 | norm 0.1259 | lr 9.03e-04 | (9901.19 ms | 52952 tok/s) step 7339/76294 | train loss 3.266855 | norm 0.1256 | lr 9.03e-04 | (9891.74 ms | 53003 tok/s) step 7340/76294 | train loss 3.328440 | norm 0.1301 | lr 9.03e-04 | (9896.53 ms | 52977 tok/s) step 7341/76294 | train loss 3.239125 | norm 0.1297 | lr 9.03e-04 | (9894.48 ms | 52988 tok/s) step 7342/76294 | train loss 3.241800 | norm 0.1344 | lr 9.03e-04 | (9916.28 ms | 52871 tok/s) step 7343/76294 | train loss 3.354864 | norm 0.1370 | lr 9.03e-04 | (9908.77 ms | 52911 tok/s) step 7344/76294 | train loss 3.262415 | norm 0.1399 | lr 9.03e-04 | (9894.18 ms | 52990 tok/s) step 7345/76294 | train loss 3.308504 | norm 0.1285 | lr 9.02e-04 | (9907.18 ms | 52920 tok/s) step 7346/76294 | train loss 3.273253 | norm 0.1434 | lr 9.02e-04 | (9889.29 ms | 53016 tok/s) step 7347/76294 | train loss 3.284248 | norm 0.1218 | lr 9.02e-04 | (9902.59 ms | 52945 tok/s) step 7348/76294 | train loss 3.254224 | norm 0.1194 | lr 9.02e-04 | (9890.35 ms | 53010 tok/s) step 7349/76294 | train loss 3.268291 | norm 0.1370 | lr 9.02e-04 | (9902.74 ms | 52944 tok/s) step 7350/76294 | train loss 3.286493 | norm 0.1339 | lr 9.02e-04 | (9890.54 ms | 53009 tok/s) step 7351/76294 | train loss 3.316101 | norm 0.1301 | lr 9.02e-04 | (9898.60 ms | 52966 tok/s) step 7352/76294 | train loss 3.281614 | norm 0.1356 | lr 9.02e-04 | (9890.93 ms | 53007 tok/s) step 7353/76294 | train loss 3.234225 | norm 0.1496 | lr 9.02e-04 | (9898.33 ms | 52967 tok/s) step 7354/76294 | train loss 3.296101 | norm 0.1575 | lr 9.02e-04 | (10411.35 ms | 50357 tok/s) step 7355/76294 | train loss 3.249434 | norm 0.1334 | lr 9.02e-04 | (9886.90 ms | 53029 tok/s) step 7356/76294 | train loss 3.287794 | norm 0.1574 | lr 9.02e-04 | (9893.83 ms | 52991 tok/s) step 7357/76294 | train loss 3.288240 | norm 0.1349 | lr 9.02e-04 | (9948.70 ms | 52699 tok/s) step 7358/76294 | train loss 3.232363 | norm 0.1508 | lr 9.01e-04 | (9887.40 ms | 53026 tok/s) step 7359/76294 | train loss 3.222316 | norm 0.1255 | lr 9.01e-04 | (9884.28 ms | 53043 tok/s) step 7360/76294 | train loss 3.262144 | norm 0.1355 | lr 9.01e-04 | (9909.69 ms | 52907 tok/s) step 7361/76294 | train loss 3.313860 | norm 0.1427 | lr 9.01e-04 | (9917.80 ms | 52863 tok/s) step 7362/76294 | train loss 3.229950 | norm 0.1277 | lr 9.01e-04 | (9903.65 ms | 52939 tok/s) step 7363/76294 | train loss 3.295439 | norm 0.1383 | lr 9.01e-04 | (9902.32 ms | 52946 tok/s) step 7364/76294 | train loss 3.298457 | norm 0.1302 | lr 9.01e-04 | (9926.08 ms | 52819 tok/s) step 7365/76294 | train loss 3.233711 | norm 0.1338 | lr 9.01e-04 | (9887.95 ms | 53023 tok/s) step 7366/76294 | train loss 3.275802 | norm 0.1383 | lr 9.01e-04 | (9913.61 ms | 52886 tok/s) step 7367/76294 | train loss 3.224876 | norm 0.1216 | lr 9.01e-04 | (9884.05 ms | 53044 tok/s) step 7368/76294 | train loss 3.282918 | norm 0.1434 | lr 9.01e-04 | (9883.45 ms | 53047 tok/s) step 7369/76294 | train loss 3.307228 | norm 0.1390 | lr 9.01e-04 | (9896.84 ms | 52975 tok/s) step 7370/76294 | train loss 3.298923 | norm 0.1806 | lr 9.00e-04 | (9882.88 ms | 53050 tok/s) step 7371/76294 | train loss 3.243723 | norm 0.1544 | lr 9.00e-04 | (9886.03 ms | 53033 tok/s) step 7372/76294 | train loss 3.281053 | norm 0.1664 | lr 9.00e-04 | (9881.88 ms | 53056 tok/s) step 7373/76294 | train loss 3.235304 | norm 0.1367 | lr 9.00e-04 | (9892.62 ms | 52998 tok/s) step 7374/76294 | train loss 3.321404 | norm 0.1414 | lr 9.00e-04 | (9883.01 ms | 53049 tok/s) step 7375/76294 | train loss 3.225598 | norm 0.1226 | lr 9.00e-04 | (9895.86 ms | 52981 tok/s) step 7376/76294 | train loss 3.222347 | norm 0.1383 | lr 9.00e-04 | (9886.65 ms | 53030 tok/s) step 7377/76294 | train loss 3.303334 | norm 0.1357 | lr 9.00e-04 | (9910.96 ms | 52900 tok/s) step 7378/76294 | train loss 3.288816 | norm 0.1471 | lr 9.00e-04 | (9888.25 ms | 53021 tok/s) step 7379/76294 | train loss 3.274296 | norm 0.1354 | lr 9.00e-04 | (9940.51 ms | 52743 tok/s) step 7380/76294 | train loss 3.329362 | norm 0.1325 | lr 9.00e-04 | (9949.89 ms | 52693 tok/s) step 7381/76294 | train loss 3.321423 | norm 0.1550 | lr 9.00e-04 | (9890.29 ms | 53010 tok/s) step 7382/76294 | train loss 3.315277 | norm 0.1415 | lr 9.00e-04 | (9890.50 ms | 53009 tok/s) step 7383/76294 | train loss 3.298783 | norm 0.1494 | lr 8.99e-04 | (9894.47 ms | 52988 tok/s) step 7384/76294 | train loss 3.276820 | norm 0.1234 | lr 8.99e-04 | (9887.11 ms | 53027 tok/s) step 7385/76294 | train loss 3.316462 | norm 0.1417 | lr 8.99e-04 | (9896.84 ms | 52975 tok/s) step 7386/76294 | train loss 3.216232 | norm 0.1495 | lr 8.99e-04 | (9891.64 ms | 53003 tok/s) step 7387/76294 | train loss 3.222157 | norm 0.1361 | lr 8.99e-04 | (9894.51 ms | 52988 tok/s) step 7388/76294 | train loss 3.298583 | norm 0.1373 | lr 8.99e-04 | (9885.64 ms | 53035 tok/s) step 7389/76294 | train loss 3.320789 | norm 0.1145 | lr 8.99e-04 | (9979.52 ms | 52536 tok/s) step 7390/76294 | train loss 3.246717 | norm 0.1374 | lr 8.99e-04 | (9951.74 ms | 52683 tok/s) step 7391/76294 | train loss 3.312719 | norm 0.1141 | lr 8.99e-04 | (9870.23 ms | 53118 tok/s) step 7392/76294 | train loss 3.284002 | norm 0.1329 | lr 8.99e-04 | (9877.78 ms | 53078 tok/s) step 7393/76294 | train loss 3.237550 | norm 0.1767 | lr 8.99e-04 | (9879.63 ms | 53068 tok/s) step 7394/76294 | train loss 3.264037 | norm 0.1349 | lr 8.99e-04 | (9883.47 ms | 53047 tok/s) step 7395/76294 | train loss 3.258497 | norm 0.1422 | lr 8.98e-04 | (10521.25 ms | 49831 tok/s) step 7396/76294 | train loss 3.236647 | norm 0.1340 | lr 8.98e-04 | (9918.74 ms | 52858 tok/s) step 7397/76294 | train loss 3.333650 | norm 0.1538 | lr 8.98e-04 | (9924.14 ms | 52830 tok/s) step 7398/76294 | train loss 3.272018 | norm 0.1491 | lr 8.98e-04 | (9903.29 ms | 52941 tok/s) step 7399/76294 | train loss 3.242547 | norm 0.1290 | lr 8.98e-04 | (9897.25 ms | 52973 tok/s) step 7400/76294 | train loss 3.307569 | norm 0.1538 | lr 8.98e-04 | (9890.67 ms | 53008 tok/s) step 7401/76294 | train loss 3.260830 | norm 0.1172 | lr 8.98e-04 | (9892.00 ms | 53001 tok/s) step 7402/76294 | train loss 3.242142 | norm 0.1564 | lr 8.98e-04 | (9898.12 ms | 52968 tok/s) step 7403/76294 | train loss 3.264240 | norm 0.1267 | lr 8.98e-04 | (9943.22 ms | 52728 tok/s) step 7404/76294 | train loss 3.299057 | norm 0.1356 | lr 8.98e-04 | (9896.69 ms | 52976 tok/s) step 7405/76294 | train loss 3.355783 | norm 0.1289 | lr 8.98e-04 | (9905.04 ms | 52931 tok/s) step 7406/76294 | train loss 3.201692 | norm 0.1245 | lr 8.98e-04 | (9962.50 ms | 52626 tok/s) step 7407/76294 | train loss 3.197516 | norm 0.1273 | lr 8.97e-04 | (9897.17 ms | 52974 tok/s) step 7408/76294 | train loss 3.317572 | norm 0.1270 | lr 8.97e-04 | (9962.18 ms | 52628 tok/s) step 7409/76294 | train loss 3.249712 | norm 0.1364 | lr 8.97e-04 | (9892.04 ms | 53001 tok/s) step 7410/76294 | train loss 3.264084 | norm 0.1302 | lr 8.97e-04 | (9892.66 ms | 52998 tok/s) step 7411/76294 | train loss 3.328979 | norm 0.1329 | lr 8.97e-04 | (9969.82 ms | 52587 tok/s) step 7412/76294 | train loss 3.272104 | norm 0.1504 | lr 8.97e-04 | (9896.78 ms | 52976 tok/s) step 7413/76294 | train loss 3.281230 | norm 0.1292 | lr 8.97e-04 | (9891.48 ms | 53004 tok/s) step 7414/76294 | train loss 3.239376 | norm 0.1473 | lr 8.97e-04 | (9926.54 ms | 52817 tok/s) step 7415/76294 | train loss 3.263175 | norm 0.1341 | lr 8.97e-04 | (9930.54 ms | 52796 tok/s) step 7416/76294 | train loss 3.272092 | norm 0.1421 | lr 8.97e-04 | (9907.05 ms | 52921 tok/s) step 7417/76294 | train loss 3.247635 | norm 0.1580 | lr 8.97e-04 | (9901.62 ms | 52950 tok/s) step 7418/76294 | train loss 3.264927 | norm 0.1417 | lr 8.97e-04 | (9894.35 ms | 52989 tok/s) step 7419/76294 | train loss 3.214399 | norm 0.1546 | lr 8.97e-04 | (11428.37 ms | 45876 tok/s) step 7420/76294 | train loss 3.246914 | norm 0.1296 | lr 8.96e-04 | (9882.63 ms | 53051 tok/s) step 7421/76294 | train loss 3.261302 | norm 0.1503 | lr 8.96e-04 | (9893.10 ms | 52995 tok/s) step 7422/76294 | train loss 3.292915 | norm 0.1220 | lr 8.96e-04 | (9886.47 ms | 53031 tok/s) step 7423/76294 | train loss 3.245528 | norm 0.1537 | lr 8.96e-04 | (9896.99 ms | 52974 tok/s) step 7424/76294 | train loss 3.265151 | norm 0.1198 | lr 8.96e-04 | (9888.87 ms | 53018 tok/s) step 7425/76294 | train loss 3.365218 | norm 0.1217 | lr 8.96e-04 | (9925.07 ms | 52825 tok/s) step 7426/76294 | train loss 3.292564 | norm 0.1254 | lr 8.96e-04 | (9893.74 ms | 52992 tok/s) step 7427/76294 | train loss 3.207828 | norm 0.1209 | lr 8.96e-04 | (9900.15 ms | 52958 tok/s) step 7428/76294 | train loss 3.242173 | norm 0.1512 | lr 8.96e-04 | (9893.49 ms | 52993 tok/s) step 7429/76294 | train loss 3.288594 | norm 0.1314 | lr 8.96e-04 | (9899.52 ms | 52961 tok/s) step 7430/76294 | train loss 3.305616 | norm 0.1428 | lr 8.96e-04 | (9916.24 ms | 52872 tok/s) step 7431/76294 | train loss 3.289413 | norm 0.1193 | lr 8.96e-04 | (9897.41 ms | 52972 tok/s) step 7432/76294 | train loss 3.271020 | norm 0.1313 | lr 8.95e-04 | (9898.49 ms | 52966 tok/s) step 7433/76294 | train loss 3.275674 | norm 0.1357 | lr 8.95e-04 | (9924.67 ms | 52827 tok/s) step 7434/76294 | train loss 3.311682 | norm 0.1253 | lr 8.95e-04 | (9905.01 ms | 52932 tok/s) step 7435/76294 | train loss 3.319897 | norm 0.1454 | lr 8.95e-04 | (9899.97 ms | 52959 tok/s) step 7436/76294 | train loss 3.236286 | norm 0.1419 | lr 8.95e-04 | (9893.32 ms | 52994 tok/s) step 7437/76294 | train loss 3.328083 | norm 0.1674 | lr 8.95e-04 | (9897.48 ms | 52972 tok/s) step 7438/76294 | train loss 3.243570 | norm 0.1376 | lr 8.95e-04 | (9894.35 ms | 52989 tok/s) step 7439/76294 | train loss 3.363811 | norm 0.1610 | lr 8.95e-04 | (9903.40 ms | 52940 tok/s) step 7440/76294 | train loss 3.288809 | norm 0.1539 | lr 8.95e-04 | (9913.61 ms | 52886 tok/s) step 7441/76294 | train loss 3.272002 | norm 0.1438 | lr 8.95e-04 | (9927.41 ms | 52812 tok/s) step 7442/76294 | train loss 3.344632 | norm 0.1581 | lr 8.95e-04 | (9892.89 ms | 52996 tok/s) step 7443/76294 | train loss 3.302403 | norm 0.1396 | lr 8.95e-04 | (11049.69 ms | 47448 tok/s) step 7444/76294 | train loss 3.162302 | norm 0.1632 | lr 8.94e-04 | (9881.67 ms | 53057 tok/s) step 7445/76294 | train loss 3.265449 | norm 0.1526 | lr 8.94e-04 | (9896.61 ms | 52977 tok/s) step 7446/76294 | train loss 3.253462 | norm 0.1479 | lr 8.94e-04 | (9892.48 ms | 52999 tok/s) step 7447/76294 | train loss 3.275213 | norm 0.1683 | lr 8.94e-04 | (9933.33 ms | 52781 tok/s) step 7448/76294 | train loss 3.316450 | norm 0.1239 | lr 8.94e-04 | (9892.29 ms | 53000 tok/s) step 7449/76294 | train loss 3.220524 | norm 0.1601 | lr 8.94e-04 | (9901.75 ms | 52949 tok/s) step 7450/76294 | train loss 3.342807 | norm 0.1408 | lr 8.94e-04 | (11057.52 ms | 47415 tok/s) step 7451/76294 | train loss 3.238648 | norm 0.1436 | lr 8.94e-04 | (9887.91 ms | 53023 tok/s) step 7452/76294 | train loss 3.234747 | norm 0.1327 | lr 8.94e-04 | (9969.82 ms | 52588 tok/s) step 7453/76294 | train loss 3.325632 | norm 0.1512 | lr 8.94e-04 | (10991.58 ms | 47699 tok/s) step 7454/76294 | train loss 3.269842 | norm 0.1310 | lr 8.94e-04 | (9883.64 ms | 53046 tok/s) step 7455/76294 | train loss 3.232690 | norm 0.1387 | lr 8.94e-04 | (9971.88 ms | 52577 tok/s) step 7456/76294 | train loss 3.358757 | norm 0.1353 | lr 8.94e-04 | (9886.24 ms | 53032 tok/s) step 7457/76294 | train loss 3.442313 | norm 0.1459 | lr 8.93e-04 | (9955.61 ms | 52663 tok/s) step 7458/76294 | train loss 3.223933 | norm 0.1606 | lr 8.93e-04 | (9888.60 ms | 53019 tok/s) step 7459/76294 | train loss 3.318951 | norm 0.1416 | lr 8.93e-04 | (9905.36 ms | 52930 tok/s) step 7460/76294 | train loss 3.293212 | norm 0.1381 | lr 8.93e-04 | (9898.71 ms | 52965 tok/s) step 7461/76294 | train loss 3.266264 | norm 0.1343 | lr 8.93e-04 | (9898.88 ms | 52964 tok/s) step 7462/76294 | train loss 3.251441 | norm 0.1352 | lr 8.93e-04 | (9932.80 ms | 52783 tok/s) step 7463/76294 | train loss 3.233598 | norm 0.1223 | lr 8.93e-04 | (9921.84 ms | 52842 tok/s) step 7464/76294 | train loss 3.229913 | norm 0.1238 | lr 8.93e-04 | (9898.34 ms | 52967 tok/s) step 7465/76294 | train loss 3.216384 | norm 0.1263 | lr 8.93e-04 | (9900.08 ms | 52958 tok/s) step 7466/76294 | train loss 3.224153 | norm 0.1401 | lr 8.93e-04 | (9934.96 ms | 52772 tok/s) step 7467/76294 | train loss 3.220312 | norm 0.1237 | lr 8.93e-04 | (9917.31 ms | 52866 tok/s) step 7468/76294 | train loss 3.223095 | norm 0.1369 | lr 8.93e-04 | (9970.07 ms | 52586 tok/s) step 7469/76294 | train loss 3.253178 | norm 0.1297 | lr 8.92e-04 | (9894.83 ms | 52986 tok/s) step 7470/76294 | train loss 3.345283 | norm 0.1389 | lr 8.92e-04 | (9932.72 ms | 52784 tok/s) step 7471/76294 | train loss 3.259495 | norm 0.1640 | lr 8.92e-04 | (9897.78 ms | 52970 tok/s) step 7472/76294 | train loss 3.199968 | norm 0.1236 | lr 8.92e-04 | (9921.82 ms | 52842 tok/s) step 7473/76294 | train loss 3.360153 | norm 0.1429 | lr 8.92e-04 | (9939.41 ms | 52748 tok/s) step 7474/76294 | train loss 3.256057 | norm 0.1358 | lr 8.92e-04 | (9888.60 ms | 53019 tok/s) step 7475/76294 | train loss 3.365873 | norm 0.1345 | lr 8.92e-04 | (9898.82 ms | 52965 tok/s) step 7476/76294 | train loss 3.237684 | norm 0.1558 | lr 8.92e-04 | (9892.09 ms | 53001 tok/s) step 7477/76294 | train loss 3.215955 | norm 0.1348 | lr 8.92e-04 | (9900.41 ms | 52956 tok/s) step 7478/76294 | train loss 3.267199 | norm 0.1351 | lr 8.92e-04 | (9892.89 ms | 52996 tok/s) step 7479/76294 | train loss 3.337087 | norm 0.1258 | lr 8.92e-04 | (9931.27 ms | 52792 tok/s) step 7480/76294 | train loss 3.180954 | norm 0.1577 | lr 8.92e-04 | (9954.74 ms | 52667 tok/s) step 7481/76294 | train loss 3.327529 | norm 0.1332 | lr 8.91e-04 | (9894.84 ms | 52986 tok/s) step 7482/76294 | train loss 3.330254 | norm 0.1521 | lr 8.91e-04 | (9912.45 ms | 52892 tok/s) step 7483/76294 | train loss 3.278965 | norm 0.1235 | lr 8.91e-04 | (9896.51 ms | 52977 tok/s) step 7484/76294 | train loss 3.288677 | norm 0.1358 | lr 8.91e-04 | (9893.75 ms | 52992 tok/s) step 7485/76294 | train loss 3.244588 | norm 0.1272 | lr 8.91e-04 | (9896.12 ms | 52979 tok/s) step 7486/76294 | train loss 3.222341 | norm 0.1401 | lr 8.91e-04 | (9899.57 ms | 52961 tok/s) step 7487/76294 | train loss 3.282761 | norm 0.1365 | lr 8.91e-04 | (9933.16 ms | 52782 tok/s) step 7488/76294 | train loss 3.263409 | norm 0.1375 | lr 8.91e-04 | (9889.72 ms | 53013 tok/s) step 7489/76294 | train loss 3.258945 | norm 0.1303 | lr 8.91e-04 | (9901.03 ms | 52953 tok/s) step 7490/76294 | train loss 3.226398 | norm 0.1381 | lr 8.91e-04 | (9890.74 ms | 53008 tok/s) step 7491/76294 | train loss 3.272898 | norm 0.1256 | lr 8.91e-04 | (9899.61 ms | 52960 tok/s) step 7492/76294 | train loss 3.266588 | norm 0.1308 | lr 8.91e-04 | (9892.87 ms | 52997 tok/s) step 7493/76294 | train loss 3.281939 | norm 0.1332 | lr 8.91e-04 | (9913.85 ms | 52884 tok/s) step 7494/76294 | train loss 3.301572 | norm 0.1368 | lr 8.90e-04 | (9924.78 ms | 52826 tok/s) step 7495/76294 | train loss 3.249621 | norm 0.1380 | lr 8.90e-04 | (9940.75 ms | 52741 tok/s) step 7496/76294 | train loss 3.295283 | norm 0.1164 | lr 8.90e-04 | (9926.63 ms | 52816 tok/s) step 7497/76294 | train loss 3.285155 | norm 0.1399 | lr 8.90e-04 | (9898.58 ms | 52966 tok/s) step 7498/76294 | train loss 3.288872 | norm 0.1307 | lr 8.90e-04 | (9898.61 ms | 52966 tok/s) step 7499/76294 | train loss 3.312505 | norm 0.1421 | lr 8.90e-04 | (9955.63 ms | 52662 tok/s) step 7500/76294 | train loss 3.299493 | norm 0.1248 | lr 8.90e-04 | (9892.89 ms | 52996 tok/s) val loss: 3.259662 saving model checkpoint to ./results/gpt2-350M-gqa/step_7500.pth step 7501/76294 | train loss 3.260103 | norm 0.1310 | lr 8.90e-04 | (9993.15 ms | 52465 tok/s) step 7502/76294 | train loss 3.283345 | norm 0.1254 | lr 8.90e-04 | (9872.95 ms | 53103 tok/s) step 7503/76294 | train loss 3.314971 | norm 0.1252 | lr 8.90e-04 | (9878.82 ms | 53072 tok/s) step 7504/76294 | train loss 3.262970 | norm 0.1327 | lr 8.90e-04 | (9870.24 ms | 53118 tok/s) step 7505/76294 | train loss 3.233717 | norm 0.1249 | lr 8.90e-04 | (9945.09 ms | 52718 tok/s) step 7506/76294 | train loss 3.205676 | norm 0.1203 | lr 8.89e-04 | (9884.04 ms | 53044 tok/s) step 7507/76294 | train loss 3.218966 | norm 0.1455 | lr 8.89e-04 | (9890.13 ms | 53011 tok/s) step 7508/76294 | train loss 3.307922 | norm 0.1910 | lr 8.89e-04 | (9907.88 ms | 52916 tok/s) step 7509/76294 | train loss 3.278713 | norm 0.1408 | lr 8.89e-04 | (9902.62 ms | 52944 tok/s) step 7510/76294 | train loss 3.272139 | norm 0.1418 | lr 8.89e-04 | (9909.48 ms | 52908 tok/s) step 7511/76294 | train loss 3.206128 | norm 0.1501 | lr 8.89e-04 | (9894.86 ms | 52986 tok/s) step 7512/76294 | train loss 3.299784 | norm 0.1613 | lr 8.89e-04 | (9953.03 ms | 52676 tok/s) step 7513/76294 | train loss 3.220956 | norm 0.1201 | lr 8.89e-04 | (9890.99 ms | 53007 tok/s) step 7514/76294 | train loss 3.277126 | norm 0.1539 | lr 8.89e-04 | (9930.44 ms | 52796 tok/s) step 7515/76294 | train loss 3.292740 | norm 0.1287 | lr 8.89e-04 | (9917.74 ms | 52864 tok/s) step 7516/76294 | train loss 3.186422 | norm 0.1299 | lr 8.89e-04 | (9894.07 ms | 52990 tok/s) step 7517/76294 | train loss 3.266595 | norm 0.1430 | lr 8.89e-04 | (11302.64 ms | 46386 tok/s) step 7518/76294 | train loss 3.233906 | norm 0.1251 | lr 8.88e-04 | (9945.28 ms | 52717 tok/s) step 7519/76294 | train loss 3.269246 | norm 0.1322 | lr 8.88e-04 | (9897.35 ms | 52973 tok/s) step 7520/76294 | train loss 3.279601 | norm 0.1650 | lr 8.88e-04 | (9893.76 ms | 52992 tok/s) step 7521/76294 | train loss 3.237744 | norm 0.1463 | lr 8.88e-04 | (9909.89 ms | 52906 tok/s) step 7522/76294 | train loss 3.252145 | norm 0.1400 | lr 8.88e-04 | (9897.03 ms | 52974 tok/s) step 7523/76294 | train loss 3.294918 | norm 0.1283 | lr 8.88e-04 | (9892.49 ms | 52999 tok/s) step 7524/76294 | train loss 3.244871 | norm 0.1468 | lr 8.88e-04 | (9920.82 ms | 52847 tok/s) step 7525/76294 | train loss 3.262545 | norm 0.1296 | lr 8.88e-04 | (9904.15 ms | 52936 tok/s) step 7526/76294 | train loss 3.279885 | norm 0.1307 | lr 8.88e-04 | (9896.32 ms | 52978 tok/s) step 7527/76294 | train loss 3.249566 | norm 0.1198 | lr 8.88e-04 | (9901.80 ms | 52949 tok/s) step 7528/76294 | train loss 3.286469 | norm 0.1219 | lr 8.88e-04 | (9891.63 ms | 53003 tok/s) step 7529/76294 | train loss 3.272641 | norm 0.1365 | lr 8.88e-04 | (9910.38 ms | 52903 tok/s) step 7530/76294 | train loss 3.251245 | norm 0.1217 | lr 8.87e-04 | (9894.12 ms | 52990 tok/s) step 7531/76294 | train loss 3.271800 | norm 0.1318 | lr 8.87e-04 | (9901.19 ms | 52952 tok/s) step 7532/76294 | train loss 3.237816 | norm 0.1218 | lr 8.87e-04 | (9893.19 ms | 52995 tok/s) step 7533/76294 | train loss 3.238986 | norm 0.1468 | lr 8.87e-04 | (9900.79 ms | 52954 tok/s) step 7534/76294 | train loss 3.318821 | norm 0.1232 | lr 8.87e-04 | (9892.26 ms | 53000 tok/s) step 7535/76294 | train loss 3.241808 | norm 0.1383 | lr 8.87e-04 | (9903.76 ms | 52938 tok/s) step 7536/76294 | train loss 3.233139 | norm 0.1330 | lr 8.87e-04 | (9896.64 ms | 52976 tok/s) step 7537/76294 | train loss 3.269487 | norm 0.1287 | lr 8.87e-04 | (9927.01 ms | 52814 tok/s) step 7538/76294 | train loss 3.309083 | norm 0.1404 | lr 8.87e-04 | (9887.51 ms | 53025 tok/s) step 7539/76294 | train loss 3.292274 | norm 0.1334 | lr 8.87e-04 | (9895.59 ms | 52982 tok/s) step 7540/76294 | train loss 3.225924 | norm 0.1308 | lr 8.87e-04 | (9895.88 ms | 52980 tok/s) step 7541/76294 | train loss 3.244644 | norm 0.1442 | lr 8.87e-04 | (9929.49 ms | 52801 tok/s) step 7542/76294 | train loss 3.210649 | norm 0.1203 | lr 8.87e-04 | (9968.50 ms | 52594 tok/s) step 7543/76294 | train loss 3.321983 | norm 0.1456 | lr 8.86e-04 | (9887.72 ms | 53024 tok/s) step 7544/76294 | train loss 3.283091 | norm 0.1287 | lr 8.86e-04 | (9953.76 ms | 52672 tok/s) step 7545/76294 | train loss 3.193870 | norm 0.1221 | lr 8.86e-04 | (10511.07 ms | 49880 tok/s) step 7546/76294 | train loss 3.183115 | norm 0.1286 | lr 8.86e-04 | (9962.16 ms | 52628 tok/s) step 7547/76294 | train loss 3.322539 | norm 0.1306 | lr 8.86e-04 | (9907.38 ms | 52919 tok/s) step 7548/76294 | train loss 3.374012 | norm 0.1361 | lr 8.86e-04 | (9881.89 ms | 53055 tok/s) step 7549/76294 | train loss 3.260341 | norm 0.1352 | lr 8.86e-04 | (9892.90 ms | 52996 tok/s) step 7550/76294 | train loss 3.222925 | norm 0.1263 | lr 8.86e-04 | (9881.40 ms | 53058 tok/s) step 7551/76294 | train loss 3.311337 | norm 0.1344 | lr 8.86e-04 | (9899.79 ms | 52959 tok/s) step 7552/76294 | train loss 3.217665 | norm 0.1332 | lr 8.86e-04 | (9882.84 ms | 53050 tok/s) step 7553/76294 | train loss 3.234923 | norm 0.1287 | lr 8.86e-04 | (9952.93 ms | 52677 tok/s) step 7554/76294 | train loss 3.265534 | norm 0.1580 | lr 8.86e-04 | (9926.30 ms | 52818 tok/s) step 7555/76294 | train loss 3.214227 | norm 0.1844 | lr 8.85e-04 | (9890.57 ms | 53009 tok/s) step 7556/76294 | train loss 3.302233 | norm 0.1286 | lr 8.85e-04 | (9923.13 ms | 52835 tok/s) step 7557/76294 | train loss 3.276742 | norm 0.1367 | lr 8.85e-04 | (9886.54 ms | 53030 tok/s) step 7558/76294 | train loss 3.318038 | norm 0.1469 | lr 8.85e-04 | (9892.31 ms | 53000 tok/s) step 7559/76294 | train loss 3.295008 | norm 0.1611 | lr 8.85e-04 | (9926.82 ms | 52815 tok/s) step 7560/76294 | train loss 3.213266 | norm 0.1728 | lr 8.85e-04 | (9882.90 ms | 53050 tok/s) step 7561/76294 | train loss 3.273030 | norm 0.1447 | lr 8.85e-04 | (9896.77 ms | 52976 tok/s) step 7562/76294 | train loss 3.305590 | norm 0.1675 | lr 8.85e-04 | (9885.15 ms | 53038 tok/s) step 7563/76294 | train loss 3.343783 | norm 0.1626 | lr 8.85e-04 | (9896.82 ms | 52975 tok/s) step 7564/76294 | train loss 3.244720 | norm 0.1374 | lr 8.85e-04 | (9887.94 ms | 53023 tok/s) step 7565/76294 | train loss 3.304059 | norm 0.1494 | lr 8.85e-04 | (9890.39 ms | 53010 tok/s) step 7566/76294 | train loss 3.263739 | norm 0.1380 | lr 8.85e-04 | (9897.29 ms | 52973 tok/s) step 7567/76294 | train loss 3.313535 | norm 0.1309 | lr 8.84e-04 | (9889.21 ms | 53016 tok/s) step 7568/76294 | train loss 3.255115 | norm 0.1315 | lr 8.84e-04 | (9908.57 ms | 52913 tok/s) step 7569/76294 | train loss 3.277353 | norm 0.1253 | lr 8.84e-04 | (9890.34 ms | 53010 tok/s) step 7570/76294 | train loss 3.256834 | norm 0.1454 | lr 8.84e-04 | (9891.59 ms | 53003 tok/s) step 7571/76294 | train loss 3.188924 | norm 0.1319 | lr 8.84e-04 | (9887.40 ms | 53026 tok/s) step 7572/76294 | train loss 3.281964 | norm 0.1329 | lr 8.84e-04 | (9891.78 ms | 53002 tok/s) step 7573/76294 | train loss 3.289779 | norm 0.1387 | lr 8.84e-04 | (9888.92 ms | 53018 tok/s) step 7574/76294 | train loss 3.313183 | norm 0.1327 | lr 8.84e-04 | (9891.75 ms | 53003 tok/s) step 7575/76294 | train loss 3.310827 | norm 0.1723 | lr 8.84e-04 | (9910.56 ms | 52902 tok/s) step 7576/76294 | train loss 3.291040 | norm 0.1370 | lr 8.84e-04 | (9892.23 ms | 53000 tok/s) step 7577/76294 | train loss 3.251963 | norm 0.1288 | lr 8.84e-04 | (9926.83 ms | 52815 tok/s) step 7578/76294 | train loss 3.320262 | norm 0.1366 | lr 8.84e-04 | (9892.09 ms | 53001 tok/s) step 7579/76294 | train loss 3.250234 | norm 0.1233 | lr 8.83e-04 | (9948.32 ms | 52701 tok/s) step 7580/76294 | train loss 3.308850 | norm 0.1161 | lr 8.83e-04 | (9886.78 ms | 53029 tok/s) step 7581/76294 | train loss 3.274153 | norm 0.1278 | lr 8.83e-04 | (9888.57 ms | 53020 tok/s) step 7582/76294 | train loss 3.354283 | norm 0.1317 | lr 8.83e-04 | (9895.17 ms | 52984 tok/s) step 7583/76294 | train loss 3.253649 | norm 0.1373 | lr 8.83e-04 | (9900.76 ms | 52954 tok/s) step 7584/76294 | train loss 3.264236 | norm 0.1385 | lr 8.83e-04 | (9884.22 ms | 53043 tok/s) step 7585/76294 | train loss 3.286452 | norm 0.1308 | lr 8.83e-04 | (9897.93 ms | 52969 tok/s) step 7586/76294 | train loss 3.237059 | norm 0.1486 | lr 8.83e-04 | (9977.90 ms | 52545 tok/s) step 7587/76294 | train loss 3.283239 | norm 0.1426 | lr 8.83e-04 | (9885.78 ms | 53035 tok/s) step 7588/76294 | train loss 3.252604 | norm 0.1303 | lr 8.83e-04 | (9974.69 ms | 52562 tok/s) step 7589/76294 | train loss 3.364632 | norm 0.1653 | lr 8.83e-04 | (9887.45 ms | 53026 tok/s) step 7590/76294 | train loss 3.304168 | norm 0.1589 | lr 8.83e-04 | (9885.00 ms | 53039 tok/s) step 7591/76294 | train loss 3.248651 | norm 0.1414 | lr 8.82e-04 | (9888.39 ms | 53021 tok/s) step 7592/76294 | train loss 3.277434 | norm 0.1547 | lr 8.82e-04 | (9928.23 ms | 52808 tok/s) step 7593/76294 | train loss 3.228504 | norm 0.1473 | lr 8.82e-04 | (9892.13 ms | 53000 tok/s) step 7594/76294 | train loss 3.375105 | norm 0.1454 | lr 8.82e-04 | (9900.35 ms | 52957 tok/s) step 7595/76294 | train loss 3.366312 | norm 0.1648 | lr 8.82e-04 | (9932.41 ms | 52786 tok/s) step 7596/76294 | train loss 3.314651 | norm 0.1857 | lr 8.82e-04 | (9889.07 ms | 53017 tok/s) step 7597/76294 | train loss 3.271859 | norm 0.1489 | lr 8.82e-04 | (9932.16 ms | 52787 tok/s) step 7598/76294 | train loss 3.273969 | norm 0.1577 | lr 8.82e-04 | (9888.33 ms | 53021 tok/s) step 7599/76294 | train loss 3.238225 | norm 0.1305 | lr 8.82e-04 | (9900.88 ms | 52954 tok/s) step 7600/76294 | train loss 3.338034 | norm 0.1759 | lr 8.82e-04 | (9888.53 ms | 53020 tok/s) step 7601/76294 | train loss 3.257272 | norm 0.1413 | lr 8.82e-04 | (9898.33 ms | 52967 tok/s) step 7602/76294 | train loss 3.246739 | norm 0.1544 | lr 8.82e-04 | (9941.26 ms | 52739 tok/s) step 7603/76294 | train loss 3.309277 | norm 0.1362 | lr 8.82e-04 | (9890.58 ms | 53009 tok/s) step 7604/76294 | train loss 3.215076 | norm 0.1376 | lr 8.81e-04 | (9893.18 ms | 52995 tok/s) step 7605/76294 | train loss 3.245839 | norm 0.1333 | lr 8.81e-04 | (9895.38 ms | 52983 tok/s) step 7606/76294 | train loss 3.191742 | norm 0.1553 | lr 8.81e-04 | (9898.24 ms | 52968 tok/s) step 7607/76294 | train loss 3.317328 | norm 0.1388 | lr 8.81e-04 | (9892.59 ms | 52998 tok/s) step 7608/76294 | train loss 3.237993 | norm 0.1458 | lr 8.81e-04 | (9901.87 ms | 52948 tok/s) step 7609/76294 | train loss 3.235403 | norm 0.1321 | lr 8.81e-04 | (9907.13 ms | 52920 tok/s) step 7610/76294 | train loss 3.182909 | norm 0.1504 | lr 8.81e-04 | (9897.13 ms | 52974 tok/s) step 7611/76294 | train loss 3.179825 | norm 0.1467 | lr 8.81e-04 | (9891.84 ms | 53002 tok/s) step 7612/76294 | train loss 3.256761 | norm 0.1466 | lr 8.81e-04 | (9899.47 ms | 52961 tok/s) step 7613/76294 | train loss 3.222732 | norm 0.1292 | lr 8.81e-04 | (9897.29 ms | 52973 tok/s) step 7614/76294 | train loss 3.308860 | norm 0.1431 | lr 8.81e-04 | (11233.07 ms | 46674 tok/s) step 7615/76294 | train loss 3.214260 | norm 0.1304 | lr 8.81e-04 | (9878.28 ms | 53075 tok/s) step 7616/76294 | train loss 3.318874 | norm 0.1407 | lr 8.80e-04 | (9899.60 ms | 52961 tok/s) step 7617/76294 | train loss 3.224486 | norm 0.1380 | lr 8.80e-04 | (9926.45 ms | 52817 tok/s) step 7618/76294 | train loss 3.250817 | norm 0.1311 | lr 8.80e-04 | (9911.47 ms | 52897 tok/s) step 7619/76294 | train loss 3.302087 | norm 0.1427 | lr 8.80e-04 | (9900.19 ms | 52957 tok/s) step 7620/76294 | train loss 3.397631 | norm 0.1901 | lr 8.80e-04 | (9895.24 ms | 52984 tok/s) step 7621/76294 | train loss 3.271934 | norm 0.1750 | lr 8.80e-04 | (9959.34 ms | 52643 tok/s) step 7622/76294 | train loss 3.273798 | norm 0.1639 | lr 8.80e-04 | (9891.68 ms | 53003 tok/s) step 7623/76294 | train loss 3.279669 | norm 0.1645 | lr 8.80e-04 | (9905.76 ms | 52928 tok/s) step 7624/76294 | train loss 3.192483 | norm 0.1546 | lr 8.80e-04 | (9890.89 ms | 53007 tok/s) step 7625/76294 | train loss 3.344125 | norm 0.1723 | lr 8.80e-04 | (9910.56 ms | 52902 tok/s) step 7626/76294 | train loss 3.297294 | norm 0.1445 | lr 8.80e-04 | (9895.72 ms | 52981 tok/s) step 7627/76294 | train loss 3.273930 | norm 0.1624 | lr 8.80e-04 | (9905.96 ms | 52927 tok/s) step 7628/76294 | train loss 3.197093 | norm 0.1399 | lr 8.79e-04 | (9895.28 ms | 52984 tok/s) step 7629/76294 | train loss 3.261024 | norm 0.1365 | lr 8.79e-04 | (9899.50 ms | 52961 tok/s) step 7630/76294 | train loss 3.232028 | norm 0.1441 | lr 8.79e-04 | (9924.27 ms | 52829 tok/s) step 7631/76294 | train loss 3.302942 | norm 0.1314 | lr 8.79e-04 | (9895.20 ms | 52984 tok/s) step 7632/76294 | train loss 3.364437 | norm 0.1489 | lr 8.79e-04 | (9919.79 ms | 52853 tok/s) step 7633/76294 | train loss 3.232851 | norm 0.1335 | lr 8.79e-04 | (9913.38 ms | 52887 tok/s) step 7634/76294 | train loss 3.253581 | norm 0.1316 | lr 8.79e-04 | (9895.16 ms | 52984 tok/s) step 7635/76294 | train loss 3.281556 | norm 0.1227 | lr 8.79e-04 | (9900.89 ms | 52954 tok/s) step 7636/76294 | train loss 3.234152 | norm 0.1350 | lr 8.79e-04 | (9891.62 ms | 53003 tok/s) step 7637/76294 | train loss 3.261395 | norm 0.1318 | lr 8.79e-04 | (9903.36 ms | 52940 tok/s) step 7638/76294 | train loss 3.276326 | norm 0.1372 | lr 8.79e-04 | (9890.06 ms | 53012 tok/s) step 7639/76294 | train loss 3.185077 | norm 0.1193 | lr 8.79e-04 | (9909.68 ms | 52907 tok/s) step 7640/76294 | train loss 3.277544 | norm 0.1479 | lr 8.78e-04 | (9898.68 ms | 52965 tok/s) step 7641/76294 | train loss 3.327174 | norm 0.1371 | lr 8.78e-04 | (9957.94 ms | 52650 tok/s) step 7642/76294 | train loss 3.267470 | norm 0.1441 | lr 8.78e-04 | (9897.24 ms | 52973 tok/s) step 7643/76294 | train loss 3.305226 | norm 0.1287 | lr 8.78e-04 | (9898.64 ms | 52966 tok/s) step 7644/76294 | train loss 3.305031 | norm 0.1383 | lr 8.78e-04 | (9932.69 ms | 52784 tok/s) step 7645/76294 | train loss 3.258992 | norm 0.1275 | lr 8.78e-04 | (9897.54 ms | 52972 tok/s) step 7646/76294 | train loss 3.271654 | norm 0.1286 | lr 8.78e-04 | (9899.45 ms | 52961 tok/s) step 7647/76294 | train loss 3.226479 | norm 0.1307 | lr 8.78e-04 | (9907.17 ms | 52920 tok/s) step 7648/76294 | train loss 3.320372 | norm 0.1523 | lr 8.78e-04 | (9890.13 ms | 53011 tok/s) step 7649/76294 | train loss 3.231267 | norm 0.1314 | lr 8.78e-04 | (9894.70 ms | 52987 tok/s) step 7650/76294 | train loss 3.301233 | norm 0.1341 | lr 8.78e-04 | (9931.53 ms | 52790 tok/s) step 7651/76294 | train loss 3.263896 | norm 0.1282 | lr 8.78e-04 | (9891.95 ms | 53001 tok/s) step 7652/76294 | train loss 3.326137 | norm 0.1262 | lr 8.77e-04 | (9898.26 ms | 52968 tok/s) step 7653/76294 | train loss 3.221987 | norm 0.1337 | lr 8.77e-04 | (9894.36 ms | 52989 tok/s) step 7654/76294 | train loss 3.254723 | norm 0.1203 | lr 8.77e-04 | (9904.65 ms | 52934 tok/s) step 7655/76294 | train loss 3.260458 | norm 0.1331 | lr 8.77e-04 | (9892.28 ms | 53000 tok/s) step 7656/76294 | train loss 3.272007 | norm 0.1328 | lr 8.77e-04 | (9900.33 ms | 52957 tok/s) step 7657/76294 | train loss 3.276430 | norm 0.1520 | lr 8.77e-04 | (9893.60 ms | 52993 tok/s) step 7658/76294 | train loss 3.255859 | norm 0.1166 | lr 8.77e-04 | (9894.56 ms | 52987 tok/s) step 7659/76294 | train loss 3.269523 | norm 0.1562 | lr 8.77e-04 | (9892.41 ms | 52999 tok/s) step 7660/76294 | train loss 3.298569 | norm 0.1343 | lr 8.77e-04 | (9895.73 ms | 52981 tok/s) step 7661/76294 | train loss 3.276127 | norm 0.1339 | lr 8.77e-04 | (9893.11 ms | 52995 tok/s) step 7662/76294 | train loss 3.356269 | norm 0.1695 | lr 8.77e-04 | (9909.48 ms | 52908 tok/s) step 7663/76294 | train loss 3.260502 | norm 0.1431 | lr 8.77e-04 | (9894.94 ms | 52985 tok/s) step 7664/76294 | train loss 3.231632 | norm 0.1294 | lr 8.76e-04 | (9899.72 ms | 52960 tok/s) step 7665/76294 | train loss 3.251594 | norm 0.1278 | lr 8.76e-04 | (9894.86 ms | 52986 tok/s) step 7666/76294 | train loss 3.224958 | norm 0.1223 | lr 8.76e-04 | (9894.69 ms | 52987 tok/s) step 7667/76294 | train loss 3.233067 | norm 0.1422 | lr 8.76e-04 | (9898.83 ms | 52965 tok/s) step 7668/76294 | train loss 3.242907 | norm 0.1424 | lr 8.76e-04 | (9937.04 ms | 52761 tok/s) step 7669/76294 | train loss 3.284859 | norm 0.1327 | lr 8.76e-04 | (9958.10 ms | 52649 tok/s) step 7670/76294 | train loss 3.283942 | norm 0.1308 | lr 8.76e-04 | (9892.21 ms | 53000 tok/s) step 7671/76294 | train loss 3.237952 | norm 0.1414 | lr 8.76e-04 | (9955.67 ms | 52662 tok/s) step 7672/76294 | train loss 3.234214 | norm 0.1398 | lr 8.76e-04 | (9894.53 ms | 52988 tok/s) step 7673/76294 | train loss 3.304963 | norm 0.1361 | lr 8.76e-04 | (9897.43 ms | 52972 tok/s) step 7674/76294 | train loss 3.222054 | norm 0.1244 | lr 8.76e-04 | (9940.30 ms | 52744 tok/s) step 7675/76294 | train loss 3.307059 | norm 0.1457 | lr 8.76e-04 | (9894.74 ms | 52987 tok/s) step 7676/76294 | train loss 3.298670 | norm 0.1200 | lr 8.76e-04 | (9905.19 ms | 52931 tok/s) step 7677/76294 | train loss 3.270068 | norm 0.1229 | lr 8.75e-04 | (9957.64 ms | 52652 tok/s) step 7678/76294 | train loss 3.265605 | norm 0.1323 | lr 8.75e-04 | (9886.59 ms | 53030 tok/s) step 7679/76294 | train loss 3.311477 | norm 0.1353 | lr 8.75e-04 | (9961.52 ms | 52631 tok/s) step 7680/76294 | train loss 3.275567 | norm 0.1443 | lr 8.75e-04 | (9892.40 ms | 52999 tok/s) step 7681/76294 | train loss 3.266466 | norm 0.1293 | lr 8.75e-04 | (9944.99 ms | 52719 tok/s) step 7682/76294 | train loss 3.368745 | norm 0.1403 | lr 8.75e-04 | (9892.69 ms | 52998 tok/s) step 7683/76294 | train loss 3.297896 | norm 0.1684 | lr 8.75e-04 | (9891.18 ms | 53006 tok/s) step 7684/76294 | train loss 3.228390 | norm 0.1165 | lr 8.75e-04 | (9901.12 ms | 52952 tok/s) step 7685/76294 | train loss 3.240220 | norm 0.1394 | lr 8.75e-04 | (9966.48 ms | 52605 tok/s) step 7686/76294 | train loss 3.232420 | norm 0.1403 | lr 8.75e-04 | (9890.00 ms | 53012 tok/s) step 7687/76294 | train loss 3.240085 | norm 0.1385 | lr 8.75e-04 | (9943.41 ms | 52727 tok/s) step 7688/76294 | train loss 3.268029 | norm 0.1361 | lr 8.75e-04 | (9895.64 ms | 52982 tok/s) step 7689/76294 | train loss 3.218919 | norm 0.1266 | lr 8.74e-04 | (9897.11 ms | 52974 tok/s) step 7690/76294 | train loss 3.304317 | norm 0.1457 | lr 8.74e-04 | (9896.58 ms | 52977 tok/s) step 7691/76294 | train loss 3.240077 | norm 0.1208 | lr 8.74e-04 | (9904.04 ms | 52937 tok/s) step 7692/76294 | train loss 3.273255 | norm 0.1310 | lr 8.74e-04 | (9885.75 ms | 53035 tok/s) step 7693/76294 | train loss 3.242776 | norm 0.1311 | lr 8.74e-04 | (9890.71 ms | 53008 tok/s) step 7694/76294 | train loss 3.289907 | norm 0.1167 | lr 8.74e-04 | (9899.22 ms | 52963 tok/s) step 7695/76294 | train loss 3.232619 | norm 0.1385 | lr 8.74e-04 | (9937.30 ms | 52760 tok/s) step 7696/76294 | train loss 3.291927 | norm 0.1368 | lr 8.74e-04 | (9888.29 ms | 53021 tok/s) step 7697/76294 | train loss 3.364305 | norm 0.1208 | lr 8.74e-04 | (9898.70 ms | 52965 tok/s) step 7698/76294 | train loss 3.325634 | norm 0.1515 | lr 8.74e-04 | (9910.30 ms | 52903 tok/s) step 7699/76294 | train loss 3.165259 | norm 0.1286 | lr 8.74e-04 | (9895.62 ms | 52982 tok/s) step 7700/76294 | train loss 3.443427 | norm 0.1527 | lr 8.74e-04 | (9888.72 ms | 53019 tok/s) step 7701/76294 | train loss 3.193736 | norm 0.1214 | lr 8.73e-04 | (9896.97 ms | 52975 tok/s) step 7702/76294 | train loss 3.269704 | norm 0.1318 | lr 8.73e-04 | (9888.24 ms | 53021 tok/s) step 7703/76294 | train loss 3.263680 | norm 0.1279 | lr 8.73e-04 | (9902.25 ms | 52946 tok/s) step 7704/76294 | train loss 3.238547 | norm 0.1250 | lr 8.73e-04 | (9922.35 ms | 52839 tok/s) step 7705/76294 | train loss 3.243932 | norm 0.1363 | lr 8.73e-04 | (9890.99 ms | 53007 tok/s) step 7706/76294 | train loss 3.316519 | norm 0.1447 | lr 8.73e-04 | (9892.05 ms | 53001 tok/s) step 7707/76294 | train loss 3.236391 | norm 0.1353 | lr 8.73e-04 | (9894.62 ms | 52987 tok/s) step 7708/76294 | train loss 3.289957 | norm 0.1322 | lr 8.73e-04 | (9897.56 ms | 52971 tok/s) step 7709/76294 | train loss 3.337122 | norm 0.1591 | lr 8.73e-04 | (9892.17 ms | 53000 tok/s) step 7710/76294 | train loss 3.246439 | norm 0.1302 | lr 8.73e-04 | (9913.21 ms | 52888 tok/s) step 7711/76294 | train loss 3.274388 | norm 0.1547 | lr 8.73e-04 | (9886.93 ms | 53028 tok/s) step 7712/76294 | train loss 3.295572 | norm 0.1303 | lr 8.73e-04 | (11123.78 ms | 47132 tok/s) step 7713/76294 | train loss 3.272541 | norm 0.1484 | lr 8.72e-04 | (9875.50 ms | 53090 tok/s) step 7714/76294 | train loss 3.292718 | norm 0.1352 | lr 8.72e-04 | (9882.43 ms | 53053 tok/s) step 7715/76294 | train loss 3.201602 | norm 0.1348 | lr 8.72e-04 | (9925.47 ms | 52822 tok/s) step 7716/76294 | train loss 3.311138 | norm 0.1440 | lr 8.72e-04 | (9884.09 ms | 53044 tok/s) step 7717/76294 | train loss 3.299023 | norm 0.1291 | lr 8.72e-04 | (9925.65 ms | 52822 tok/s) step 7718/76294 | train loss 3.218154 | norm 0.1381 | lr 8.72e-04 | (9941.40 ms | 52738 tok/s) step 7719/76294 | train loss 3.289264 | norm 0.1598 | lr 8.72e-04 | (9916.05 ms | 52873 tok/s) step 7720/76294 | train loss 3.251606 | norm 0.1123 | lr 8.72e-04 | (9889.43 ms | 53015 tok/s) step 7721/76294 | train loss 3.268476 | norm 0.1474 | lr 8.72e-04 | (9890.18 ms | 53011 tok/s) step 7722/76294 | train loss 3.286236 | norm 0.1287 | lr 8.72e-04 | (9916.27 ms | 52872 tok/s) step 7723/76294 | train loss 3.292675 | norm 0.1434 | lr 8.72e-04 | (9894.62 ms | 52987 tok/s) step 7724/76294 | train loss 3.346372 | norm 0.1394 | lr 8.72e-04 | (9946.11 ms | 52713 tok/s) step 7725/76294 | train loss 3.242698 | norm 0.1217 | lr 8.71e-04 | (9884.46 ms | 53042 tok/s) step 7726/76294 | train loss 3.298087 | norm 0.1351 | lr 8.71e-04 | (9915.06 ms | 52878 tok/s) step 7727/76294 | train loss 3.251212 | norm 0.1336 | lr 8.71e-04 | (9889.07 ms | 53017 tok/s) step 7728/76294 | train loss 3.224690 | norm 0.1348 | lr 8.71e-04 | (9895.79 ms | 52981 tok/s) step 7729/76294 | train loss 3.250001 | norm 0.1360 | lr 8.71e-04 | (9889.78 ms | 53013 tok/s) step 7730/76294 | train loss 3.220966 | norm 0.1383 | lr 8.71e-04 | (9897.37 ms | 52972 tok/s) step 7731/76294 | train loss 3.294603 | norm 0.1505 | lr 8.71e-04 | (9920.09 ms | 52851 tok/s) step 7732/76294 | train loss 3.253654 | norm 0.1509 | lr 8.71e-04 | (9880.79 ms | 53061 tok/s) step 7733/76294 | train loss 3.216636 | norm 0.1326 | lr 8.71e-04 | (9889.12 ms | 53017 tok/s) step 7734/76294 | train loss 3.250812 | norm 0.1375 | lr 8.71e-04 | (9893.66 ms | 52992 tok/s) step 7735/76294 | train loss 3.305298 | norm 0.1589 | lr 8.71e-04 | (10233.76 ms | 51231 tok/s) step 7736/76294 | train loss 3.288075 | norm 0.1317 | lr 8.71e-04 | (9889.23 ms | 53016 tok/s) step 7737/76294 | train loss 3.240561 | norm 0.1285 | lr 8.70e-04 | (9930.53 ms | 52796 tok/s) step 7738/76294 | train loss 3.259034 | norm 0.1402 | lr 8.70e-04 | (9883.12 ms | 53049 tok/s) step 7739/76294 | train loss 3.264773 | norm 0.1286 | lr 8.70e-04 | (9887.81 ms | 53024 tok/s) step 7740/76294 | train loss 3.293315 | norm 0.1393 | lr 8.70e-04 | (9886.11 ms | 53033 tok/s) step 7741/76294 | train loss 3.449672 | norm 0.1665 | lr 8.70e-04 | (9910.68 ms | 52901 tok/s) step 7742/76294 | train loss 3.236239 | norm 0.1415 | lr 8.70e-04 | (9879.09 ms | 53070 tok/s) step 7743/76294 | train loss 3.300210 | norm 0.1461 | lr 8.70e-04 | (9889.34 ms | 53015 tok/s) step 7744/76294 | train loss 3.250338 | norm 0.1371 | lr 8.70e-04 | (9879.85 ms | 53066 tok/s) step 7745/76294 | train loss 3.244680 | norm 0.1476 | lr 8.70e-04 | (9891.58 ms | 53003 tok/s) step 7746/76294 | train loss 3.297799 | norm 0.1452 | lr 8.70e-04 | (9879.01 ms | 53071 tok/s) step 7747/76294 | train loss 3.244769 | norm 0.1389 | lr 8.70e-04 | (9894.75 ms | 52986 tok/s) step 7748/76294 | train loss 3.228033 | norm 0.1342 | lr 8.70e-04 | (9880.55 ms | 53063 tok/s) step 7749/76294 | train loss 3.238527 | norm 0.1274 | lr 8.69e-04 | (9883.35 ms | 53048 tok/s) step 7750/76294 | train loss 3.316863 | norm 0.1416 | lr 8.69e-04 | (9889.08 ms | 53017 tok/s) val loss: 3.247243 saving model checkpoint to ./results/gpt2-350M-gqa/step_7750.pth step 7751/76294 | train loss 3.256718 | norm 0.1262 | lr 8.69e-04 | (9949.59 ms | 52694 tok/s) step 7752/76294 | train loss 3.278781 | norm 0.1388 | lr 8.69e-04 | (9860.55 ms | 53170 tok/s) step 7753/76294 | train loss 3.297040 | norm 0.1522 | lr 8.69e-04 | (9869.62 ms | 53121 tok/s) step 7754/76294 | train loss 3.312103 | norm 0.1527 | lr 8.69e-04 | (9873.17 ms | 53102 tok/s) step 7755/76294 | train loss 3.345464 | norm 0.1268 | lr 8.69e-04 | (9881.30 ms | 53059 tok/s) step 7756/76294 | train loss 3.282274 | norm 0.1277 | lr 8.69e-04 | (9886.46 ms | 53031 tok/s) step 7757/76294 | train loss 3.248953 | norm 0.1343 | lr 8.69e-04 | (9893.72 ms | 52992 tok/s) step 7758/76294 | train loss 3.228115 | norm 0.1208 | lr 8.69e-04 | (9914.27 ms | 52882 tok/s) step 7759/76294 | train loss 3.390464 | norm 0.1443 | lr 8.69e-04 | (9942.85 ms | 52730 tok/s) step 7760/76294 | train loss 3.283742 | norm 0.1334 | lr 8.69e-04 | (9886.33 ms | 53032 tok/s) step 7761/76294 | train loss 3.255707 | norm 0.1213 | lr 8.68e-04 | (9886.56 ms | 53030 tok/s) step 7762/76294 | train loss 3.262722 | norm 0.1240 | lr 8.68e-04 | (9960.69 ms | 52636 tok/s) step 7763/76294 | train loss 3.308214 | norm 0.1300 | lr 8.68e-04 | (9897.16 ms | 52974 tok/s) step 7764/76294 | train loss 3.224608 | norm 0.1353 | lr 8.68e-04 | (9925.05 ms | 52825 tok/s) step 7765/76294 | train loss 3.253347 | norm 0.1302 | lr 8.68e-04 | (9891.13 ms | 53006 tok/s) step 7766/76294 | train loss 3.265600 | norm 0.1236 | lr 8.68e-04 | (9899.98 ms | 52959 tok/s) step 7767/76294 | train loss 3.241508 | norm 0.1212 | lr 8.68e-04 | (9908.12 ms | 52915 tok/s) step 7768/76294 | train loss 3.259507 | norm 0.1213 | lr 8.68e-04 | (9897.50 ms | 52972 tok/s) step 7769/76294 | train loss 3.322502 | norm 0.1210 | lr 8.68e-04 | (9902.58 ms | 52945 tok/s) step 7770/76294 | train loss 3.294171 | norm 0.1293 | lr 8.68e-04 | (9894.90 ms | 52986 tok/s) step 7771/76294 | train loss 3.262151 | norm 0.1331 | lr 8.68e-04 | (9942.65 ms | 52731 tok/s) step 7772/76294 | train loss 3.287437 | norm 0.1277 | lr 8.68e-04 | (9897.69 ms | 52971 tok/s) step 7773/76294 | train loss 3.235582 | norm 0.1494 | lr 8.67e-04 | (9890.99 ms | 53007 tok/s) step 7774/76294 | train loss 3.236570 | norm 0.1292 | lr 8.67e-04 | (9919.29 ms | 52855 tok/s) step 7775/76294 | train loss 3.252326 | norm 0.1334 | lr 8.67e-04 | (9893.31 ms | 52994 tok/s) step 7776/76294 | train loss 3.285914 | norm 0.1348 | lr 8.67e-04 | (9900.25 ms | 52957 tok/s) step 7777/76294 | train loss 3.247579 | norm 0.1226 | lr 8.67e-04 | (9949.16 ms | 52697 tok/s) step 7778/76294 | train loss 3.318765 | norm 0.1573 | lr 8.67e-04 | (9891.24 ms | 53005 tok/s) step 7779/76294 | train loss 3.278177 | norm 0.1819 | lr 8.67e-04 | (9914.20 ms | 52883 tok/s) step 7780/76294 | train loss 3.258071 | norm 0.1591 | lr 8.67e-04 | (9893.09 ms | 52995 tok/s) step 7781/76294 | train loss 3.333555 | norm 0.1450 | lr 8.67e-04 | (9900.54 ms | 52956 tok/s) step 7782/76294 | train loss 3.220973 | norm 0.1449 | lr 8.67e-04 | (9899.86 ms | 52959 tok/s) step 7783/76294 | train loss 3.270170 | norm 0.1442 | lr 8.67e-04 | (9909.24 ms | 52909 tok/s) step 7784/76294 | train loss 3.247217 | norm 0.1658 | lr 8.67e-04 | (9888.92 ms | 53018 tok/s) step 7785/76294 | train loss 3.361921 | norm 0.1306 | lr 8.66e-04 | (9892.65 ms | 52998 tok/s) step 7786/76294 | train loss 3.227496 | norm 0.1443 | lr 8.66e-04 | (9891.74 ms | 53003 tok/s) step 7787/76294 | train loss 3.290879 | norm 0.1340 | lr 8.66e-04 | (9940.53 ms | 52742 tok/s) step 7788/76294 | train loss 3.224871 | norm 0.1329 | lr 8.66e-04 | (9894.74 ms | 52987 tok/s) step 7789/76294 | train loss 3.312862 | norm 0.1369 | lr 8.66e-04 | (9916.93 ms | 52868 tok/s) step 7790/76294 | train loss 3.414935 | norm 0.1338 | lr 8.66e-04 | (9892.83 ms | 52997 tok/s) step 7791/76294 | train loss 3.268074 | norm 0.1389 | lr 8.66e-04 | (9901.48 ms | 52950 tok/s) step 7792/76294 | train loss 3.237870 | norm 0.1296 | lr 8.66e-04 | (9890.78 ms | 53008 tok/s) step 7793/76294 | train loss 3.281486 | norm 0.1382 | lr 8.66e-04 | (9897.36 ms | 52973 tok/s) step 7794/76294 | train loss 3.202607 | norm 0.1244 | lr 8.66e-04 | (9893.30 ms | 52994 tok/s) step 7795/76294 | train loss 3.309257 | norm 0.1391 | lr 8.66e-04 | (9936.76 ms | 52762 tok/s) step 7796/76294 | train loss 3.252723 | norm 0.1255 | lr 8.66e-04 | (9990.48 ms | 52479 tok/s) step 7797/76294 | train loss 3.265738 | norm 0.1373 | lr 8.65e-04 | (9944.14 ms | 52723 tok/s) step 7798/76294 | train loss 3.248968 | norm 0.1153 | lr 8.65e-04 | (9905.13 ms | 52931 tok/s) step 7799/76294 | train loss 3.293457 | norm 0.1299 | lr 8.65e-04 | (9938.40 ms | 52754 tok/s) step 7800/76294 | train loss 3.259609 | norm 0.1209 | lr 8.65e-04 | (9891.39 ms | 53004 tok/s) step 7801/76294 | train loss 3.262108 | norm 0.1178 | lr 8.65e-04 | (9904.31 ms | 52935 tok/s) step 7802/76294 | train loss 3.223411 | norm 0.1314 | lr 8.65e-04 | (9891.36 ms | 53005 tok/s) step 7803/76294 | train loss 3.294634 | norm 0.1143 | lr 8.65e-04 | (9901.19 ms | 52952 tok/s) step 7804/76294 | train loss 3.302446 | norm 0.1352 | lr 8.65e-04 | (9891.56 ms | 53004 tok/s) step 7805/76294 | train loss 3.245251 | norm 0.1129 | lr 8.65e-04 | (9900.73 ms | 52954 tok/s) step 7806/76294 | train loss 3.175932 | norm 0.1251 | lr 8.65e-04 | (9893.44 ms | 52994 tok/s) step 7807/76294 | train loss 3.316283 | norm 0.1341 | lr 8.65e-04 | (9902.56 ms | 52945 tok/s) step 7808/76294 | train loss 3.261807 | norm 0.1356 | lr 8.65e-04 | (9892.14 ms | 53000 tok/s) step 7809/76294 | train loss 3.237404 | norm 0.1251 | lr 8.64e-04 | (9903.66 ms | 52939 tok/s) step 7810/76294 | train loss 3.368726 | norm 0.1273 | lr 8.64e-04 | (11577.85 ms | 45284 tok/s) step 7811/76294 | train loss 3.259247 | norm 0.1338 | lr 8.64e-04 | (9885.82 ms | 53034 tok/s) step 7812/76294 | train loss 3.236107 | norm 0.1336 | lr 8.64e-04 | (9893.96 ms | 52991 tok/s) step 7813/76294 | train loss 3.233718 | norm 0.1344 | lr 8.64e-04 | (9886.55 ms | 53030 tok/s) step 7814/76294 | train loss 3.279117 | norm 0.1250 | lr 8.64e-04 | (9922.89 ms | 52836 tok/s) step 7815/76294 | train loss 3.268990 | norm 0.1299 | lr 8.64e-04 | (9894.47 ms | 52988 tok/s) step 7816/76294 | train loss 3.285440 | norm 0.1398 | lr 8.64e-04 | (9928.78 ms | 52805 tok/s) step 7817/76294 | train loss 3.299476 | norm 0.1482 | lr 8.64e-04 | (9892.18 ms | 53000 tok/s) step 7818/76294 | train loss 3.205873 | norm 0.1371 | lr 8.64e-04 | (9900.30 ms | 52957 tok/s) step 7819/76294 | train loss 3.204968 | norm 0.1247 | lr 8.64e-04 | (9898.05 ms | 52969 tok/s) step 7820/76294 | train loss 3.366913 | norm 0.1273 | lr 8.64e-04 | (9953.40 ms | 52674 tok/s) step 7821/76294 | train loss 3.261438 | norm 0.1225 | lr 8.63e-04 | (9902.65 ms | 52944 tok/s) step 7822/76294 | train loss 3.325919 | norm 0.1298 | lr 8.63e-04 | (9957.92 ms | 52650 tok/s) step 7823/76294 | train loss 3.245178 | norm 0.1152 | lr 8.63e-04 | (9893.01 ms | 52996 tok/s) step 7824/76294 | train loss 3.242439 | norm 0.1281 | lr 8.63e-04 | (10041.36 ms | 52213 tok/s) step 7825/76294 | train loss 3.175401 | norm 0.1210 | lr 8.63e-04 | (9967.73 ms | 52599 tok/s) step 7826/76294 | train loss 3.301439 | norm 0.1318 | lr 8.63e-04 | (9953.28 ms | 52675 tok/s) step 7827/76294 | train loss 3.258727 | norm 0.1286 | lr 8.63e-04 | (9891.02 ms | 53006 tok/s) step 7828/76294 | train loss 3.241820 | norm 0.1170 | lr 8.63e-04 | (9904.69 ms | 52933 tok/s) step 7829/76294 | train loss 3.212684 | norm 0.1264 | lr 8.63e-04 | (9933.96 ms | 52777 tok/s) step 7830/76294 | train loss 3.216843 | norm 0.1322 | lr 8.63e-04 | (9886.79 ms | 53029 tok/s) step 7831/76294 | train loss 3.178046 | norm 0.1305 | lr 8.63e-04 | (9955.63 ms | 52662 tok/s) step 7832/76294 | train loss 3.253473 | norm 0.1278 | lr 8.63e-04 | (9891.66 ms | 53003 tok/s) step 7833/76294 | train loss 3.263523 | norm 0.1383 | lr 8.62e-04 | (9919.02 ms | 52857 tok/s) step 7834/76294 | train loss 3.271972 | norm 0.1299 | lr 8.62e-04 | (11668.35 ms | 44933 tok/s) step 7835/76294 | train loss 3.251920 | norm 0.1277 | lr 8.62e-04 | (9873.67 ms | 53100 tok/s) step 7836/76294 | train loss 3.282297 | norm 0.1244 | lr 8.62e-04 | (9983.39 ms | 52516 tok/s) step 7837/76294 | train loss 3.298968 | norm 0.1448 | lr 8.62e-04 | (9888.50 ms | 53020 tok/s) step 7838/76294 | train loss 3.260879 | norm 0.1245 | lr 8.62e-04 | (9906.41 ms | 52924 tok/s) step 7839/76294 | train loss 3.305062 | norm 0.1251 | lr 8.62e-04 | (9888.50 ms | 53020 tok/s) step 7840/76294 | train loss 3.283012 | norm 0.1480 | lr 8.62e-04 | (9890.97 ms | 53007 tok/s) step 7841/76294 | train loss 3.246436 | norm 0.1402 | lr 8.62e-04 | (9895.12 ms | 52984 tok/s) step 7842/76294 | train loss 3.228855 | norm 0.1256 | lr 8.62e-04 | (9897.43 ms | 52972 tok/s) step 7843/76294 | train loss 3.290095 | norm 0.1299 | lr 8.62e-04 | (11240.45 ms | 46643 tok/s) step 7844/76294 | train loss 3.137153 | norm 0.1318 | lr 8.62e-04 | (9881.34 ms | 53058 tok/s) step 7845/76294 | train loss 3.247801 | norm 0.1219 | lr 8.61e-04 | (9897.89 ms | 52970 tok/s) step 7846/76294 | train loss 3.230697 | norm 0.1252 | lr 8.61e-04 | (11304.81 ms | 46377 tok/s) step 7847/76294 | train loss 3.236266 | norm 0.1348 | lr 8.61e-04 | (9877.61 ms | 53078 tok/s) step 7848/76294 | train loss 3.252581 | norm 0.1208 | lr 8.61e-04 | (9895.37 ms | 52983 tok/s) step 7849/76294 | train loss 3.324962 | norm 0.1188 | lr 8.61e-04 | (9924.56 ms | 52827 tok/s) step 7850/76294 | train loss 3.243164 | norm 0.1298 | lr 8.61e-04 | (9891.45 ms | 53004 tok/s) step 7851/76294 | train loss 3.298718 | norm 0.1430 | lr 8.61e-04 | (9897.89 ms | 52970 tok/s) step 7852/76294 | train loss 3.254091 | norm 0.1319 | lr 8.61e-04 | (9928.02 ms | 52809 tok/s) step 7853/76294 | train loss 3.288354 | norm 0.1380 | lr 8.61e-04 | (10103.93 ms | 51889 tok/s) step 7854/76294 | train loss 3.279297 | norm 0.1458 | lr 8.61e-04 | (9894.78 ms | 52986 tok/s) step 7855/76294 | train loss 3.266661 | norm 0.1273 | lr 8.61e-04 | (9903.06 ms | 52942 tok/s) step 7856/76294 | train loss 3.242055 | norm 0.1320 | lr 8.61e-04 | (9892.39 ms | 52999 tok/s) step 7857/76294 | train loss 3.252926 | norm 0.1370 | lr 8.60e-04 | (9930.83 ms | 52794 tok/s) step 7858/76294 | train loss 3.265824 | norm 0.1301 | lr 8.60e-04 | (9929.71 ms | 52800 tok/s) step 7859/76294 | train loss 3.350096 | norm 0.1368 | lr 8.60e-04 | (9892.08 ms | 53001 tok/s) step 7860/76294 | train loss 3.242284 | norm 0.1227 | lr 8.60e-04 | (9904.40 ms | 52935 tok/s) step 7861/76294 | train loss 3.215160 | norm 0.1706 | lr 8.60e-04 | (9893.06 ms | 52996 tok/s) step 7862/76294 | train loss 3.324773 | norm 0.1403 | lr 8.60e-04 | (9897.53 ms | 52972 tok/s) step 7863/76294 | train loss 3.263128 | norm 0.1315 | lr 8.60e-04 | (9893.12 ms | 52995 tok/s) step 7864/76294 | train loss 3.300329 | norm 0.1411 | lr 8.60e-04 | (9896.87 ms | 52975 tok/s) step 7865/76294 | train loss 3.309639 | norm 0.1485 | lr 8.60e-04 | (9895.37 ms | 52983 tok/s) step 7866/76294 | train loss 3.282158 | norm 0.1543 | lr 8.60e-04 | (9898.84 ms | 52965 tok/s) step 7867/76294 | train loss 3.255615 | norm 0.1307 | lr 8.60e-04 | (10086.78 ms | 51978 tok/s) step 7868/76294 | train loss 3.243814 | norm 0.1665 | lr 8.60e-04 | (9894.75 ms | 52986 tok/s) step 7869/76294 | train loss 3.253864 | norm 0.1433 | lr 8.59e-04 | (9960.46 ms | 52637 tok/s) step 7870/76294 | train loss 3.243152 | norm 0.1383 | lr 8.59e-04 | (9893.52 ms | 52993 tok/s) step 7871/76294 | train loss 3.285479 | norm 0.1335 | lr 8.59e-04 | (9896.40 ms | 52978 tok/s) step 7872/76294 | train loss 3.393458 | norm 0.1356 | lr 8.59e-04 | (9953.64 ms | 52673 tok/s) step 7873/76294 | train loss 3.305602 | norm 0.1547 | lr 8.59e-04 | (9896.39 ms | 52978 tok/s) step 7874/76294 | train loss 3.281694 | norm 0.1512 | lr 8.59e-04 | (9884.80 ms | 53040 tok/s) step 7875/76294 | train loss 3.290156 | norm 0.1561 | lr 8.59e-04 | (9952.47 ms | 52679 tok/s) step 7876/76294 | train loss 3.299705 | norm 0.1634 | lr 8.59e-04 | (9887.62 ms | 53025 tok/s) step 7877/76294 | train loss 3.342490 | norm 0.1688 | lr 8.59e-04 | (9886.16 ms | 53033 tok/s) step 7878/76294 | train loss 3.268337 | norm 0.1746 | lr 8.59e-04 | (9899.07 ms | 52963 tok/s) step 7879/76294 | train loss 3.255589 | norm 0.1600 | lr 8.59e-04 | (9970.05 ms | 52586 tok/s) step 7880/76294 | train loss 3.269769 | norm 0.1412 | lr 8.59e-04 | (9887.27 ms | 53027 tok/s) step 7881/76294 | train loss 3.241650 | norm 0.1563 | lr 8.58e-04 | (9895.59 ms | 52982 tok/s) step 7882/76294 | train loss 3.283968 | norm 0.1347 | lr 8.58e-04 | (9886.93 ms | 53028 tok/s) step 7883/76294 | train loss 3.250845 | norm 0.1327 | lr 8.58e-04 | (9901.15 ms | 52952 tok/s) step 7884/76294 | train loss 3.364735 | norm 0.1408 | lr 8.58e-04 | (9923.45 ms | 52833 tok/s) step 7885/76294 | train loss 3.268949 | norm 0.1265 | lr 8.58e-04 | (9887.14 ms | 53027 tok/s) step 7886/76294 | train loss 3.276557 | norm 0.1323 | lr 8.58e-04 | (9948.26 ms | 52701 tok/s) step 7887/76294 | train loss 3.277923 | norm 0.1192 | lr 8.58e-04 | (9891.43 ms | 53004 tok/s) step 7888/76294 | train loss 3.271025 | norm 0.1221 | lr 8.58e-04 | (9886.83 ms | 53029 tok/s) step 7889/76294 | train loss 3.274788 | norm 0.1289 | lr 8.58e-04 | (9896.10 ms | 52979 tok/s) step 7890/76294 | train loss 3.293225 | norm 0.1182 | lr 8.58e-04 | (9927.95 ms | 52809 tok/s) step 7891/76294 | train loss 3.274376 | norm 0.1234 | lr 8.58e-04 | (9888.98 ms | 53017 tok/s) step 7892/76294 | train loss 3.201298 | norm 0.1320 | lr 8.58e-04 | (10237.82 ms | 51211 tok/s) step 7893/76294 | train loss 3.216426 | norm 0.1355 | lr 8.57e-04 | (9891.10 ms | 53006 tok/s) step 7894/76294 | train loss 3.196283 | norm 0.1225 | lr 8.57e-04 | (9921.11 ms | 52846 tok/s) step 7895/76294 | train loss 3.208534 | norm 0.1356 | lr 8.57e-04 | (9893.55 ms | 52993 tok/s) step 7896/76294 | train loss 3.296464 | norm 0.1265 | lr 8.57e-04 | (9933.30 ms | 52781 tok/s) step 7897/76294 | train loss 3.252716 | norm 0.1181 | lr 8.57e-04 | (9890.60 ms | 53009 tok/s) step 7898/76294 | train loss 3.178719 | norm 0.1339 | lr 8.57e-04 | (9913.27 ms | 52887 tok/s) step 7899/76294 | train loss 3.245443 | norm 0.1223 | lr 8.57e-04 | (9889.43 ms | 53015 tok/s) step 7900/76294 | train loss 3.338767 | norm 0.1260 | lr 8.57e-04 | (9908.72 ms | 52912 tok/s) step 7901/76294 | train loss 3.380315 | norm 0.1175 | lr 8.57e-04 | (9886.56 ms | 53030 tok/s) step 7902/76294 | train loss 3.283947 | norm 0.1151 | lr 8.57e-04 | (9898.92 ms | 52964 tok/s) step 7903/76294 | train loss 3.200204 | norm 0.1498 | lr 8.57e-04 | (9889.94 ms | 53012 tok/s) step 7904/76294 | train loss 3.269087 | norm 0.1326 | lr 8.57e-04 | (9907.41 ms | 52919 tok/s) step 7905/76294 | train loss 3.398849 | norm 0.1322 | lr 8.56e-04 | (9909.83 ms | 52906 tok/s) step 7906/76294 | train loss 3.233417 | norm 0.1351 | lr 8.56e-04 | (9886.08 ms | 53033 tok/s) step 7907/76294 | train loss 3.326900 | norm 0.1294 | lr 8.56e-04 | (9890.98 ms | 53007 tok/s) step 7908/76294 | train loss 3.345287 | norm 0.1400 | lr 8.56e-04 | (11624.89 ms | 45100 tok/s) step 7909/76294 | train loss 3.310822 | norm 0.1372 | lr 8.56e-04 | (9873.18 ms | 53102 tok/s) step 7910/76294 | train loss 3.292746 | norm 0.1496 | lr 8.56e-04 | (9899.06 ms | 52963 tok/s) step 7911/76294 | train loss 3.334730 | norm 0.1639 | lr 8.56e-04 | (9883.54 ms | 53047 tok/s) step 7912/76294 | train loss 3.255996 | norm 0.1491 | lr 8.56e-04 | (9884.78 ms | 53040 tok/s) step 7913/76294 | train loss 3.237734 | norm 0.1414 | lr 8.56e-04 | (9881.16 ms | 53059 tok/s) step 7914/76294 | train loss 3.229968 | norm 0.1331 | lr 8.56e-04 | (9894.11 ms | 52990 tok/s) step 7915/76294 | train loss 3.207753 | norm 0.1262 | lr 8.56e-04 | (9884.01 ms | 53044 tok/s) step 7916/76294 | train loss 3.257612 | norm 0.1274 | lr 8.56e-04 | (9892.27 ms | 53000 tok/s) step 7917/76294 | train loss 3.267329 | norm 0.1298 | lr 8.55e-04 | (9881.76 ms | 53056 tok/s) step 7918/76294 | train loss 3.214777 | norm 0.1257 | lr 8.55e-04 | (9896.17 ms | 52979 tok/s) step 7919/76294 | train loss 3.270512 | norm 0.1267 | lr 8.55e-04 | (9909.81 ms | 52906 tok/s) step 7920/76294 | train loss 3.332581 | norm 0.1435 | lr 8.55e-04 | (9925.62 ms | 52822 tok/s) step 7921/76294 | train loss 3.233904 | norm 0.1343 | lr 8.55e-04 | (9896.35 ms | 52978 tok/s) step 7922/76294 | train loss 3.370049 | norm 0.1422 | lr 8.55e-04 | (9899.19 ms | 52963 tok/s) step 7923/76294 | train loss 3.316767 | norm 0.1238 | lr 8.55e-04 | (9895.25 ms | 52984 tok/s) step 7924/76294 | train loss 3.266843 | norm 0.1252 | lr 8.55e-04 | (9949.30 ms | 52696 tok/s) step 7925/76294 | train loss 3.261959 | norm 0.1321 | lr 8.55e-04 | (9890.44 ms | 53010 tok/s) step 7926/76294 | train loss 3.239257 | norm 0.1255 | lr 8.55e-04 | (10633.40 ms | 49306 tok/s) step 7927/76294 | train loss 3.236714 | norm 0.1409 | lr 8.55e-04 | (9896.67 ms | 52976 tok/s) step 7928/76294 | train loss 3.249436 | norm 0.1480 | lr 8.55e-04 | (9882.26 ms | 53053 tok/s) step 7929/76294 | train loss 3.216664 | norm 0.1186 | lr 8.54e-04 | (9888.61 ms | 53019 tok/s) step 7930/76294 | train loss 3.242330 | norm 0.1329 | lr 8.54e-04 | (9897.06 ms | 52974 tok/s) step 7931/76294 | train loss 3.236584 | norm 0.1240 | lr 8.54e-04 | (9882.74 ms | 53051 tok/s) step 7932/76294 | train loss 3.250116 | norm 0.1273 | lr 8.54e-04 | (9919.88 ms | 52852 tok/s) step 7933/76294 | train loss 3.263181 | norm 0.1342 | lr 8.54e-04 | (9950.97 ms | 52687 tok/s) step 7934/76294 | train loss 3.244299 | norm 0.1214 | lr 8.54e-04 | (9891.38 ms | 53005 tok/s) step 7935/76294 | train loss 3.271161 | norm 0.1306 | lr 8.54e-04 | (9905.82 ms | 52927 tok/s) step 7936/76294 | train loss 3.229594 | norm 0.1242 | lr 8.54e-04 | (9924.77 ms | 52826 tok/s) step 7937/76294 | train loss 3.235359 | norm 0.1435 | lr 8.54e-04 | (9889.78 ms | 53013 tok/s) step 7938/76294 | train loss 3.226375 | norm 0.1283 | lr 8.54e-04 | (9896.85 ms | 52975 tok/s) step 7939/76294 | train loss 3.282032 | norm 0.1496 | lr 8.54e-04 | (9923.35 ms | 52834 tok/s) step 7940/76294 | train loss 3.169876 | norm 0.1528 | lr 8.54e-04 | (9905.62 ms | 52928 tok/s) step 7941/76294 | train loss 3.255731 | norm 0.1185 | lr 8.53e-04 | (9889.88 ms | 53013 tok/s) step 7942/76294 | train loss 3.188722 | norm 0.1544 | lr 8.53e-04 | (9889.72 ms | 53013 tok/s) step 7943/76294 | train loss 3.225716 | norm 0.1184 | lr 8.53e-04 | (9959.35 ms | 52643 tok/s) step 7944/76294 | train loss 3.255751 | norm 0.1449 | lr 8.53e-04 | (9898.36 ms | 52967 tok/s) step 7945/76294 | train loss 3.279687 | norm 0.1306 | lr 8.53e-04 | (9894.26 ms | 52989 tok/s) step 7946/76294 | train loss 3.180345 | norm 0.1361 | lr 8.53e-04 | (9924.27 ms | 52829 tok/s) step 7947/76294 | train loss 3.342836 | norm 0.1506 | lr 8.53e-04 | (9891.37 ms | 53005 tok/s) step 7948/76294 | train loss 3.217660 | norm 0.1215 | lr 8.53e-04 | (9893.48 ms | 52993 tok/s) step 7949/76294 | train loss 3.264027 | norm 0.1339 | lr 8.53e-04 | (9890.42 ms | 53010 tok/s) step 7950/76294 | train loss 3.231720 | norm 0.1264 | lr 8.53e-04 | (9899.80 ms | 52959 tok/s) step 7951/76294 | train loss 3.303107 | norm 0.1422 | lr 8.53e-04 | (9895.53 ms | 52982 tok/s) step 7952/76294 | train loss 3.233170 | norm 0.1391 | lr 8.53e-04 | (9897.60 ms | 52971 tok/s) step 7953/76294 | train loss 3.264048 | norm 0.1375 | lr 8.52e-04 | (9892.34 ms | 52999 tok/s) step 7954/76294 | train loss 3.149355 | norm 0.1316 | lr 8.52e-04 | (9896.46 ms | 52977 tok/s) step 7955/76294 | train loss 3.268798 | norm 0.1296 | lr 8.52e-04 | (9899.23 ms | 52962 tok/s) step 7956/76294 | train loss 3.206178 | norm 0.1277 | lr 8.52e-04 | (9933.67 ms | 52779 tok/s) step 7957/76294 | train loss 3.377098 | norm 0.1375 | lr 8.52e-04 | (9924.78 ms | 52826 tok/s) step 7958/76294 | train loss 3.204132 | norm 0.1206 | lr 8.52e-04 | (9927.87 ms | 52810 tok/s) step 7959/76294 | train loss 3.242843 | norm 0.1345 | lr 8.52e-04 | (9915.28 ms | 52877 tok/s) step 7960/76294 | train loss 3.234653 | norm 0.1364 | lr 8.52e-04 | (9895.30 ms | 52984 tok/s) step 7961/76294 | train loss 3.242543 | norm 0.1264 | lr 8.52e-04 | (9898.22 ms | 52968 tok/s) step 7962/76294 | train loss 3.178450 | norm 0.1217 | lr 8.52e-04 | (9896.83 ms | 52975 tok/s) step 7963/76294 | train loss 3.255520 | norm 0.1207 | lr 8.52e-04 | (9892.40 ms | 52999 tok/s) step 7964/76294 | train loss 3.187318 | norm 0.1359 | lr 8.51e-04 | (9896.98 ms | 52975 tok/s) step 7965/76294 | train loss 3.239633 | norm 0.1217 | lr 8.51e-04 | (9901.84 ms | 52949 tok/s) step 7966/76294 | train loss 3.296863 | norm 0.1378 | lr 8.51e-04 | (9931.29 ms | 52792 tok/s) step 7967/76294 | train loss 3.237665 | norm 0.1406 | lr 8.51e-04 | (9892.37 ms | 52999 tok/s) step 7968/76294 | train loss 3.183410 | norm 0.1461 | lr 8.51e-04 | (9904.14 ms | 52936 tok/s) step 7969/76294 | train loss 3.280257 | norm 0.1287 | lr 8.51e-04 | (9953.64 ms | 52673 tok/s) step 7970/76294 | train loss 3.237768 | norm 0.1317 | lr 8.51e-04 | (9886.43 ms | 53031 tok/s) step 7971/76294 | train loss 3.291039 | norm 0.1233 | lr 8.51e-04 | (9900.77 ms | 52954 tok/s) step 7972/76294 | train loss 3.197431 | norm 0.1314 | lr 8.51e-04 | (9886.78 ms | 53029 tok/s) step 7973/76294 | train loss 3.247334 | norm 0.1246 | lr 8.51e-04 | (9940.68 ms | 52742 tok/s) step 7974/76294 | train loss 3.190352 | norm 0.1628 | lr 8.51e-04 | (9894.81 ms | 52986 tok/s) step 7975/76294 | train loss 3.218643 | norm 0.1348 | lr 8.51e-04 | (9891.06 ms | 53006 tok/s) step 7976/76294 | train loss 3.232669 | norm 0.1577 | lr 8.50e-04 | (9942.57 ms | 52732 tok/s) step 7977/76294 | train loss 3.216694 | norm 0.1395 | lr 8.50e-04 | (9906.08 ms | 52926 tok/s) step 7978/76294 | train loss 3.193039 | norm 0.1461 | lr 8.50e-04 | (9904.63 ms | 52934 tok/s) step 7979/76294 | train loss 3.237452 | norm 0.1527 | lr 8.50e-04 | (9894.50 ms | 52988 tok/s) step 7980/76294 | train loss 3.233475 | norm 0.1381 | lr 8.50e-04 | (9899.39 ms | 52962 tok/s) step 7981/76294 | train loss 3.242537 | norm 0.1511 | lr 8.50e-04 | (9963.06 ms | 52623 tok/s) step 7982/76294 | train loss 3.335124 | norm 0.1481 | lr 8.50e-04 | (9886.58 ms | 53030 tok/s) step 7983/76294 | train loss 3.206693 | norm 0.1431 | lr 8.50e-04 | (9922.37 ms | 52839 tok/s) step 7984/76294 | train loss 3.257107 | norm 0.1364 | lr 8.50e-04 | (9888.71 ms | 53019 tok/s) step 7985/76294 | train loss 3.200446 | norm 0.1359 | lr 8.50e-04 | (9900.09 ms | 52958 tok/s) step 7986/76294 | train loss 3.246031 | norm 0.1362 | lr 8.50e-04 | (9938.32 ms | 52754 tok/s) step 7987/76294 | train loss 3.200969 | norm 0.1314 | lr 8.50e-04 | (9895.18 ms | 52984 tok/s) step 7988/76294 | train loss 3.323095 | norm 0.1350 | lr 8.49e-04 | (9905.50 ms | 52929 tok/s) step 7989/76294 | train loss 3.167138 | norm 0.1249 | lr 8.49e-04 | (9894.36 ms | 52989 tok/s) step 7990/76294 | train loss 3.192791 | norm 0.2005 | lr 8.49e-04 | (9900.19 ms | 52957 tok/s) step 7991/76294 | train loss 3.197909 | norm 0.1392 | lr 8.49e-04 | (9898.95 ms | 52964 tok/s) step 7992/76294 | train loss 3.158518 | norm 0.1488 | lr 8.49e-04 | (9899.11 ms | 52963 tok/s) step 7993/76294 | train loss 3.168349 | norm 0.1374 | lr 8.49e-04 | (9895.17 ms | 52984 tok/s) step 7994/76294 | train loss 3.284118 | norm 0.1573 | lr 8.49e-04 | (9949.68 ms | 52694 tok/s) step 7995/76294 | train loss 3.238366 | norm 0.1479 | lr 8.49e-04 | (9892.46 ms | 52999 tok/s) step 7996/76294 | train loss 3.292161 | norm 0.1491 | lr 8.49e-04 | (9901.31 ms | 52951 tok/s) step 7997/76294 | train loss 3.277142 | norm 0.1393 | lr 8.49e-04 | (9909.54 ms | 52907 tok/s) step 7998/76294 | train loss 3.300157 | norm 0.1460 | lr 8.49e-04 | (9891.60 ms | 53003 tok/s) step 7999/76294 | train loss 3.219647 | norm 0.1414 | lr 8.49e-04 | (9901.02 ms | 52953 tok/s) step 8000/76294 | train loss 3.252188 | norm 0.1291 | lr 8.48e-04 | (9894.10 ms | 52990 tok/s) val loss: 3.244706 saving model checkpoint to ./results/gpt2-350M-gqa/step_8000.pth step 8001/76294 | train loss 3.201024 | norm 0.1440 | lr 8.48e-04 | (9951.82 ms | 52683 tok/s) step 8002/76294 | train loss 3.304193 | norm 0.1236 | lr 8.48e-04 | (9872.26 ms | 53107 tok/s) step 8003/76294 | train loss 3.287669 | norm 0.1290 | lr 8.48e-04 | (9894.15 ms | 52990 tok/s) step 8004/76294 | train loss 3.271073 | norm 0.1211 | lr 8.48e-04 | (9881.75 ms | 53056 tok/s) step 8005/76294 | train loss 3.275006 | norm 0.1269 | lr 8.48e-04 | (10886.57 ms | 48159 tok/s) step 8006/76294 | train loss 3.245544 | norm 0.1402 | lr 8.48e-04 | (9885.84 ms | 53034 tok/s) step 8007/76294 | train loss 3.242517 | norm 0.1302 | lr 8.48e-04 | (9888.96 ms | 53018 tok/s) step 8008/76294 | train loss 3.217240 | norm 0.1454 | lr 8.48e-04 | (9900.43 ms | 52956 tok/s) step 8009/76294 | train loss 3.226919 | norm 0.1280 | lr 8.48e-04 | (9890.11 ms | 53011 tok/s) step 8010/76294 | train loss 3.282757 | norm 0.1318 | lr 8.48e-04 | (9884.54 ms | 53041 tok/s) step 8011/76294 | train loss 3.192938 | norm 0.1544 | lr 8.48e-04 | (9889.99 ms | 53012 tok/s) step 8012/76294 | train loss 3.330418 | norm 0.1371 | lr 8.47e-04 | (9961.83 ms | 52630 tok/s) step 8013/76294 | train loss 3.201733 | norm 0.1450 | lr 8.47e-04 | (9893.31 ms | 52994 tok/s) step 8014/76294 | train loss 3.254256 | norm 0.1190 | lr 8.47e-04 | (9905.37 ms | 52930 tok/s) step 8015/76294 | train loss 3.150075 | norm 0.1329 | lr 8.47e-04 | (9937.08 ms | 52761 tok/s) step 8016/76294 | train loss 3.255004 | norm 0.1358 | lr 8.47e-04 | (9888.63 ms | 53019 tok/s) step 8017/76294 | train loss 3.177095 | norm 0.1333 | lr 8.47e-04 | (9905.13 ms | 52931 tok/s) step 8018/76294 | train loss 3.207362 | norm 0.1362 | lr 8.47e-04 | (9893.20 ms | 52995 tok/s) step 8019/76294 | train loss 3.222757 | norm 0.1468 | lr 8.47e-04 | (9955.14 ms | 52665 tok/s) step 8020/76294 | train loss 3.263230 | norm 0.1359 | lr 8.47e-04 | (9897.73 ms | 52971 tok/s) step 8021/76294 | train loss 3.235264 | norm 0.1446 | lr 8.47e-04 | (9901.52 ms | 52950 tok/s) step 8022/76294 | train loss 3.215895 | norm 0.1282 | lr 8.47e-04 | (9887.92 ms | 53023 tok/s) step 8023/76294 | train loss 3.294370 | norm 0.1376 | lr 8.47e-04 | (9917.48 ms | 52865 tok/s) step 8024/76294 | train loss 3.237937 | norm 0.1356 | lr 8.46e-04 | (9893.78 ms | 52992 tok/s) step 8025/76294 | train loss 3.196891 | norm 0.1518 | lr 8.46e-04 | (9915.80 ms | 52874 tok/s) step 8026/76294 | train loss 3.241673 | norm 0.1252 | lr 8.46e-04 | (9892.80 ms | 52997 tok/s) step 8027/76294 | train loss 3.193542 | norm 0.1392 | lr 8.46e-04 | (9906.75 ms | 52922 tok/s) step 8028/76294 | train loss 3.296761 | norm 0.1366 | lr 8.46e-04 | (9891.38 ms | 53005 tok/s) step 8029/76294 | train loss 3.169773 | norm 0.1296 | lr 8.46e-04 | (9901.72 ms | 52949 tok/s) step 8030/76294 | train loss 3.310144 | norm 0.1472 | lr 8.46e-04 | (9889.84 ms | 53013 tok/s) step 8031/76294 | train loss 3.208173 | norm 0.1276 | lr 8.46e-04 | (9916.20 ms | 52872 tok/s) step 8032/76294 | train loss 3.238118 | norm 0.1370 | lr 8.46e-04 | (9889.81 ms | 53013 tok/s) step 8033/76294 | train loss 3.246844 | norm 0.1181 | lr 8.46e-04 | (9909.91 ms | 52905 tok/s) step 8034/76294 | train loss 3.289535 | norm 0.1305 | lr 8.46e-04 | (9932.70 ms | 52784 tok/s) step 8035/76294 | train loss 3.238849 | norm 0.1286 | lr 8.46e-04 | (9892.22 ms | 53000 tok/s) step 8036/76294 | train loss 3.245007 | norm 0.1340 | lr 8.45e-04 | (9897.48 ms | 52972 tok/s) step 8037/76294 | train loss 3.194024 | norm 0.1315 | lr 8.45e-04 | (9895.88 ms | 52980 tok/s) step 8038/76294 | train loss 3.258249 | norm 0.1242 | lr 8.45e-04 | (9895.15 ms | 52984 tok/s) step 8039/76294 | train loss 3.171711 | norm 0.1312 | lr 8.45e-04 | (9891.96 ms | 53001 tok/s) step 8040/76294 | train loss 3.266956 | norm 0.1490 | lr 8.45e-04 | (9902.21 ms | 52947 tok/s) step 8041/76294 | train loss 3.257330 | norm 0.1372 | lr 8.45e-04 | (9959.83 ms | 52640 tok/s) step 8042/76294 | train loss 3.469468 | norm 0.1420 | lr 8.45e-04 | (9895.79 ms | 52981 tok/s) step 8043/76294 | train loss 3.236073 | norm 0.1351 | lr 8.45e-04 | (9890.94 ms | 53007 tok/s) step 8044/76294 | train loss 3.237448 | norm 0.1236 | lr 8.45e-04 | (9932.15 ms | 52787 tok/s) step 8045/76294 | train loss 3.199428 | norm 0.1214 | lr 8.45e-04 | (9929.37 ms | 52802 tok/s) step 8046/76294 | train loss 3.287282 | norm 0.1253 | lr 8.45e-04 | (9890.43 ms | 53010 tok/s) step 8047/76294 | train loss 3.208993 | norm 0.1263 | lr 8.44e-04 | (9899.95 ms | 52959 tok/s) step 8048/76294 | train loss 3.250870 | norm 0.1339 | lr 8.44e-04 | (9925.82 ms | 52821 tok/s) step 8049/76294 | train loss 3.198996 | norm 0.1299 | lr 8.44e-04 | (9891.38 ms | 53005 tok/s) step 8050/76294 | train loss 3.229793 | norm 0.1356 | lr 8.44e-04 | (9890.75 ms | 53008 tok/s) step 8051/76294 | train loss 3.209406 | norm 0.1265 | lr 8.44e-04 | (9895.98 ms | 52980 tok/s) step 8052/76294 | train loss 3.238425 | norm 0.1300 | lr 8.44e-04 | (9894.87 ms | 52986 tok/s) step 8053/76294 | train loss 3.189931 | norm 0.1228 | lr 8.44e-04 | (9891.77 ms | 53002 tok/s) step 8054/76294 | train loss 3.285462 | norm 0.1259 | lr 8.44e-04 | (9892.83 ms | 52997 tok/s) step 8055/76294 | train loss 3.235979 | norm 0.1528 | lr 8.44e-04 | (9896.48 ms | 52977 tok/s) step 8056/76294 | train loss 3.272404 | norm 0.1260 | lr 8.44e-04 | (9938.24 ms | 52755 tok/s) step 8057/76294 | train loss 3.213310 | norm 0.1385 | lr 8.44e-04 | (9896.90 ms | 52975 tok/s) step 8058/76294 | train loss 3.231872 | norm 0.1390 | lr 8.44e-04 | (9958.18 ms | 52649 tok/s) step 8059/76294 | train loss 3.219345 | norm 0.1292 | lr 8.43e-04 | (9884.74 ms | 53040 tok/s) step 8060/76294 | train loss 3.251806 | norm 0.1510 | lr 8.43e-04 | (9887.31 ms | 53026 tok/s) step 8061/76294 | train loss 3.198793 | norm 0.1394 | lr 8.43e-04 | (9963.04 ms | 52623 tok/s) step 8062/76294 | train loss 3.337207 | norm 0.1264 | lr 8.43e-04 | (9891.00 ms | 53007 tok/s) step 8063/76294 | train loss 3.213802 | norm 0.1373 | lr 8.43e-04 | (9880.87 ms | 53061 tok/s) step 8064/76294 | train loss 3.265143 | norm 0.1259 | lr 8.43e-04 | (9879.46 ms | 53068 tok/s) step 8065/76294 | train loss 3.218826 | norm 0.1219 | lr 8.43e-04 | (9903.94 ms | 52937 tok/s) step 8066/76294 | train loss 3.392441 | norm 0.1392 | lr 8.43e-04 | (9925.64 ms | 52822 tok/s) step 8067/76294 | train loss 3.204541 | norm 0.1374 | lr 8.43e-04 | (9887.41 ms | 53026 tok/s) step 8068/76294 | train loss 3.272099 | norm 0.1493 | lr 8.43e-04 | (9895.60 ms | 52982 tok/s) step 8069/76294 | train loss 3.224933 | norm 0.1360 | lr 8.43e-04 | (9940.05 ms | 52745 tok/s) step 8070/76294 | train loss 3.248573 | norm 0.1333 | lr 8.43e-04 | (9882.31 ms | 53053 tok/s) step 8071/76294 | train loss 3.156990 | norm 0.1314 | lr 8.42e-04 | (9963.93 ms | 52619 tok/s) step 8072/76294 | train loss 3.262648 | norm 0.1233 | lr 8.42e-04 | (9885.32 ms | 53037 tok/s) step 8073/76294 | train loss 3.193220 | norm 0.1335 | lr 8.42e-04 | (9886.98 ms | 53028 tok/s) step 8074/76294 | train loss 3.186033 | norm 0.1341 | lr 8.42e-04 | (9897.49 ms | 52972 tok/s) step 8075/76294 | train loss 3.297244 | norm 0.1357 | lr 8.42e-04 | (9939.28 ms | 52749 tok/s) step 8076/76294 | train loss 3.252805 | norm 0.1415 | lr 8.42e-04 | (9884.30 ms | 53042 tok/s) step 8077/76294 | train loss 3.241899 | norm 0.1294 | lr 8.42e-04 | (9897.34 ms | 52973 tok/s) step 8078/76294 | train loss 3.301782 | norm 0.1295 | lr 8.42e-04 | (9889.65 ms | 53014 tok/s) step 8079/76294 | train loss 3.245231 | norm 0.1139 | lr 8.42e-04 | (9890.12 ms | 53011 tok/s) step 8080/76294 | train loss 3.267549 | norm 0.1335 | lr 8.42e-04 | (9889.93 ms | 53012 tok/s) step 8081/76294 | train loss 3.256296 | norm 0.1363 | lr 8.42e-04 | (9959.19 ms | 52644 tok/s) step 8082/76294 | train loss 3.235269 | norm 0.1239 | lr 8.42e-04 | (9894.58 ms | 52987 tok/s) step 8083/76294 | train loss 3.142647 | norm 0.1625 | lr 8.41e-04 | (9888.93 ms | 53018 tok/s) step 8084/76294 | train loss 3.263456 | norm 0.1357 | lr 8.41e-04 | (9923.04 ms | 52835 tok/s) step 8085/76294 | train loss 3.260722 | norm 0.1305 | lr 8.41e-04 | (9926.48 ms | 52817 tok/s) step 8086/76294 | train loss 3.323230 | norm 0.1264 | lr 8.41e-04 | (9887.59 ms | 53025 tok/s) step 8087/76294 | train loss 3.248339 | norm 0.1362 | lr 8.41e-04 | (9896.31 ms | 52978 tok/s) step 8088/76294 | train loss 3.247792 | norm 0.1256 | lr 8.41e-04 | (9884.12 ms | 53043 tok/s) step 8089/76294 | train loss 3.212232 | norm 0.1227 | lr 8.41e-04 | (9894.62 ms | 52987 tok/s) step 8090/76294 | train loss 3.327859 | norm 0.1854 | lr 8.41e-04 | (9888.72 ms | 53019 tok/s) step 8091/76294 | train loss 3.271570 | norm 0.1739 | lr 8.41e-04 | (9897.73 ms | 52971 tok/s) step 8092/76294 | train loss 3.243613 | norm 0.1629 | lr 8.41e-04 | (9882.87 ms | 53050 tok/s) step 8093/76294 | train loss 3.255169 | norm 0.1593 | lr 8.41e-04 | (9895.60 ms | 52982 tok/s) step 8094/76294 | train loss 3.230953 | norm 0.1538 | lr 8.41e-04 | (9890.23 ms | 53011 tok/s) step 8095/76294 | train loss 3.235269 | norm 0.1254 | lr 8.40e-04 | (9952.90 ms | 52677 tok/s) step 8096/76294 | train loss 3.222937 | norm 0.1533 | lr 8.40e-04 | (9883.58 ms | 53046 tok/s) step 8097/76294 | train loss 3.221686 | norm 0.1227 | lr 8.40e-04 | (9912.30 ms | 52893 tok/s) step 8098/76294 | train loss 3.246706 | norm 0.1369 | lr 8.40e-04 | (9918.87 ms | 52858 tok/s) step 8099/76294 | train loss 3.242471 | norm 0.1294 | lr 8.40e-04 | (9896.75 ms | 52976 tok/s) step 8100/76294 | train loss 3.216326 | norm 0.1228 | lr 8.40e-04 | (9891.38 ms | 53005 tok/s) step 8101/76294 | train loss 3.232357 | norm 0.1313 | lr 8.40e-04 | (9891.08 ms | 53006 tok/s) step 8102/76294 | train loss 3.258843 | norm 0.1465 | lr 8.40e-04 | (9889.55 ms | 53014 tok/s) step 8103/76294 | train loss 3.229210 | norm 0.1198 | lr 8.40e-04 | (10932.36 ms | 47957 tok/s) step 8104/76294 | train loss 3.218608 | norm 0.1455 | lr 8.40e-04 | (9878.42 ms | 53074 tok/s) step 8105/76294 | train loss 3.302621 | norm 0.1167 | lr 8.40e-04 | (9897.23 ms | 52973 tok/s) step 8106/76294 | train loss 3.158369 | norm 0.1385 | lr 8.39e-04 | (9952.65 ms | 52678 tok/s) step 8107/76294 | train loss 3.261531 | norm 0.1311 | lr 8.39e-04 | (9891.79 ms | 53002 tok/s) step 8108/76294 | train loss 3.180728 | norm 0.1126 | lr 8.39e-04 | (9898.81 ms | 52965 tok/s) step 8109/76294 | train loss 3.201626 | norm 0.1337 | lr 8.39e-04 | (9925.48 ms | 52822 tok/s) step 8110/76294 | train loss 3.180636 | norm 0.1163 | lr 8.39e-04 | (9920.42 ms | 52849 tok/s) step 8111/76294 | train loss 3.253261 | norm 0.1274 | lr 8.39e-04 | (9967.73 ms | 52599 tok/s) step 8112/76294 | train loss 3.236161 | norm 0.1238 | lr 8.39e-04 | (9896.68 ms | 52976 tok/s) step 8113/76294 | train loss 3.226376 | norm 0.1220 | lr 8.39e-04 | (9962.68 ms | 52625 tok/s) step 8114/76294 | train loss 3.234229 | norm 0.1220 | lr 8.39e-04 | (9894.56 ms | 52987 tok/s) step 8115/76294 | train loss 3.256244 | norm 0.1231 | lr 8.39e-04 | (9897.24 ms | 52973 tok/s) step 8116/76294 | train loss 3.246622 | norm 0.1645 | lr 8.39e-04 | (9937.95 ms | 52756 tok/s) step 8117/76294 | train loss 3.265499 | norm 0.1257 | lr 8.39e-04 | (10334.23 ms | 50733 tok/s) step 8118/76294 | train loss 3.268609 | norm 0.2104 | lr 8.38e-04 | (9966.95 ms | 52603 tok/s) step 8119/76294 | train loss 3.294648 | norm 0.1288 | lr 8.38e-04 | (9963.74 ms | 52620 tok/s) step 8120/76294 | train loss 3.287890 | norm 0.1501 | lr 8.38e-04 | (9907.44 ms | 52919 tok/s) step 8121/76294 | train loss 3.311902 | norm 0.1308 | lr 8.38e-04 | (9898.84 ms | 52965 tok/s) step 8122/76294 | train loss 3.375902 | norm 0.1506 | lr 8.38e-04 | (9907.97 ms | 52916 tok/s) step 8123/76294 | train loss 3.254043 | norm 0.1313 | lr 8.38e-04 | (9895.12 ms | 52984 tok/s) step 8124/76294 | train loss 3.254992 | norm 0.1287 | lr 8.38e-04 | (9891.65 ms | 53003 tok/s) step 8125/76294 | train loss 3.288318 | norm 0.1251 | lr 8.38e-04 | (9896.81 ms | 52975 tok/s) step 8126/76294 | train loss 3.264384 | norm 0.1367 | lr 8.38e-04 | (9890.47 ms | 53009 tok/s) step 8127/76294 | train loss 3.249076 | norm 0.1161 | lr 8.38e-04 | (9897.66 ms | 52971 tok/s) step 8128/76294 | train loss 3.308035 | norm 0.1521 | lr 8.38e-04 | (9895.13 ms | 52984 tok/s) step 8129/76294 | train loss 3.298635 | norm 0.1201 | lr 8.38e-04 | (9956.27 ms | 52659 tok/s) step 8130/76294 | train loss 3.255413 | norm 0.1359 | lr 8.37e-04 | (9926.42 ms | 52817 tok/s) step 8131/76294 | train loss 3.385565 | norm 0.1305 | lr 8.37e-04 | (9906.77 ms | 52922 tok/s) step 8132/76294 | train loss 3.264588 | norm 0.1343 | lr 8.37e-04 | (9928.17 ms | 52808 tok/s) step 8133/76294 | train loss 3.239223 | norm 0.1393 | lr 8.37e-04 | (9894.48 ms | 52988 tok/s) step 8134/76294 | train loss 3.281093 | norm 0.1412 | lr 8.37e-04 | (9925.83 ms | 52821 tok/s) step 8135/76294 | train loss 3.312426 | norm 0.1337 | lr 8.37e-04 | (9901.37 ms | 52951 tok/s) step 8136/76294 | train loss 3.247427 | norm 0.1248 | lr 8.37e-04 | (9899.15 ms | 52963 tok/s) step 8137/76294 | train loss 3.225631 | norm 0.1402 | lr 8.37e-04 | (9892.89 ms | 52996 tok/s) step 8138/76294 | train loss 3.278309 | norm 0.1333 | lr 8.37e-04 | (9890.22 ms | 53011 tok/s) step 8139/76294 | train loss 3.334605 | norm 0.1824 | lr 8.37e-04 | (9942.65 ms | 52731 tok/s) step 8140/76294 | train loss 3.248746 | norm 0.1653 | lr 8.37e-04 | (9915.60 ms | 52875 tok/s) step 8141/76294 | train loss 3.272581 | norm 0.1497 | lr 8.37e-04 | (9902.60 ms | 52944 tok/s) step 8142/76294 | train loss 3.250217 | norm 0.1480 | lr 8.36e-04 | (9915.95 ms | 52873 tok/s) step 8143/76294 | train loss 3.235265 | norm 0.1643 | lr 8.36e-04 | (9908.96 ms | 52911 tok/s) step 8144/76294 | train loss 3.261922 | norm 0.1280 | lr 8.36e-04 | (9932.11 ms | 52787 tok/s) step 8145/76294 | train loss 3.235455 | norm 0.1525 | lr 8.36e-04 | (9898.28 ms | 52968 tok/s) step 8146/76294 | train loss 3.310596 | norm 0.1323 | lr 8.36e-04 | (9899.90 ms | 52959 tok/s) step 8147/76294 | train loss 3.261180 | norm 0.1462 | lr 8.36e-04 | (9897.37 ms | 52972 tok/s) step 8148/76294 | train loss 3.254220 | norm 0.1432 | lr 8.36e-04 | (9899.14 ms | 52963 tok/s) step 8149/76294 | train loss 3.286904 | norm 0.1412 | lr 8.36e-04 | (9897.82 ms | 52970 tok/s) step 8150/76294 | train loss 3.327708 | norm 0.1387 | lr 8.36e-04 | (9898.88 ms | 52964 tok/s) step 8151/76294 | train loss 3.242655 | norm 0.1263 | lr 8.36e-04 | (9892.93 ms | 52996 tok/s) step 8152/76294 | train loss 3.253601 | norm 0.1518 | lr 8.36e-04 | (9898.52 ms | 52966 tok/s) step 8153/76294 | train loss 3.236888 | norm 0.1456 | lr 8.35e-04 | (9901.67 ms | 52949 tok/s) step 8154/76294 | train loss 3.316652 | norm 0.1460 | lr 8.35e-04 | (9936.94 ms | 52762 tok/s) step 8155/76294 | train loss 3.315062 | norm 0.1366 | lr 8.35e-04 | (9891.56 ms | 53004 tok/s) step 8156/76294 | train loss 3.301524 | norm 0.1354 | lr 8.35e-04 | (9895.47 ms | 52983 tok/s) step 8157/76294 | train loss 3.291559 | norm 0.1315 | lr 8.35e-04 | (9935.48 ms | 52769 tok/s) step 8158/76294 | train loss 3.226900 | norm 0.1281 | lr 8.35e-04 | (9887.30 ms | 53026 tok/s) step 8159/76294 | train loss 3.235747 | norm 0.1458 | lr 8.35e-04 | (9896.52 ms | 52977 tok/s) step 8160/76294 | train loss 3.279492 | norm 0.1412 | lr 8.35e-04 | (9888.15 ms | 53022 tok/s) step 8161/76294 | train loss 3.268059 | norm 0.1334 | lr 8.35e-04 | (9916.45 ms | 52871 tok/s) step 8162/76294 | train loss 3.265863 | norm 0.1530 | lr 8.35e-04 | (9884.85 ms | 53040 tok/s) step 8163/76294 | train loss 3.199729 | norm 0.1260 | lr 8.35e-04 | (9896.99 ms | 52974 tok/s) step 8164/76294 | train loss 3.298000 | norm 0.1535 | lr 8.35e-04 | (9930.55 ms | 52795 tok/s) step 8165/76294 | train loss 3.275418 | norm 0.1155 | lr 8.34e-04 | (9892.22 ms | 53000 tok/s) step 8166/76294 | train loss 3.202737 | norm 0.1361 | lr 8.34e-04 | (9895.89 ms | 52980 tok/s) step 8167/76294 | train loss 3.255047 | norm 0.1388 | lr 8.34e-04 | (9889.57 ms | 53014 tok/s) step 8168/76294 | train loss 3.289593 | norm 0.1287 | lr 8.34e-04 | (9969.40 ms | 52590 tok/s) step 8169/76294 | train loss 3.317788 | norm 0.1338 | lr 8.34e-04 | (9891.03 ms | 53006 tok/s) step 8170/76294 | train loss 3.269661 | norm 0.1298 | lr 8.34e-04 | (9894.56 ms | 52987 tok/s) step 8171/76294 | train loss 3.290925 | norm 0.1224 | lr 8.34e-04 | (9956.06 ms | 52660 tok/s) step 8172/76294 | train loss 3.288252 | norm 0.1359 | lr 8.34e-04 | (9893.93 ms | 52991 tok/s) step 8173/76294 | train loss 3.265079 | norm 0.1152 | lr 8.34e-04 | (9910.21 ms | 52904 tok/s) step 8174/76294 | train loss 3.312864 | norm 0.1252 | lr 8.34e-04 | (9926.55 ms | 52817 tok/s) step 8175/76294 | train loss 3.257796 | norm 0.1187 | lr 8.34e-04 | (9895.63 ms | 52982 tok/s) step 8176/76294 | train loss 3.268749 | norm 0.1315 | lr 8.34e-04 | (9903.26 ms | 52941 tok/s) step 8177/76294 | train loss 3.266161 | norm 0.1298 | lr 8.33e-04 | (9901.49 ms | 52950 tok/s) step 8178/76294 | train loss 3.263973 | norm 0.1231 | lr 8.33e-04 | (9889.12 ms | 53017 tok/s) step 8179/76294 | train loss 3.252236 | norm 0.1347 | lr 8.33e-04 | (9929.42 ms | 52801 tok/s) step 8180/76294 | train loss 3.269259 | norm 0.1233 | lr 8.33e-04 | (9897.82 ms | 52970 tok/s) step 8181/76294 | train loss 3.260825 | norm 0.1381 | lr 8.33e-04 | (9958.45 ms | 52648 tok/s) step 8182/76294 | train loss 3.293098 | norm 0.1168 | lr 8.33e-04 | (9892.38 ms | 52999 tok/s) step 8183/76294 | train loss 3.256865 | norm 0.1507 | lr 8.33e-04 | (9903.76 ms | 52938 tok/s) step 8184/76294 | train loss 3.226441 | norm 0.1452 | lr 8.33e-04 | (9924.40 ms | 52828 tok/s) step 8185/76294 | train loss 3.264521 | norm 0.1331 | lr 8.33e-04 | (9891.32 ms | 53005 tok/s) step 8186/76294 | train loss 3.337121 | norm 0.1445 | lr 8.33e-04 | (9910.36 ms | 52903 tok/s) step 8187/76294 | train loss 3.232709 | norm 0.1357 | lr 8.33e-04 | (9913.55 ms | 52886 tok/s) step 8188/76294 | train loss 3.247039 | norm 0.1312 | lr 8.33e-04 | (9894.14 ms | 52990 tok/s) step 8189/76294 | train loss 3.312013 | norm 0.1229 | lr 8.32e-04 | (9889.67 ms | 53014 tok/s) step 8190/76294 | train loss 3.238560 | norm 0.1528 | lr 8.32e-04 | (9892.66 ms | 52998 tok/s) step 8191/76294 | train loss 3.306594 | norm 0.1264 | lr 8.32e-04 | (9895.70 ms | 52981 tok/s) step 8192/76294 | train loss 3.279794 | norm 0.1564 | lr 8.32e-04 | (9890.99 ms | 53007 tok/s) step 8193/76294 | train loss 3.262179 | norm 0.1433 | lr 8.32e-04 | (9934.36 ms | 52775 tok/s) step 8194/76294 | train loss 3.269141 | norm 0.1411 | lr 8.32e-04 | (9888.75 ms | 53019 tok/s) step 8195/76294 | train loss 3.288260 | norm 0.1292 | lr 8.32e-04 | (9896.30 ms | 52978 tok/s) step 8196/76294 | train loss 3.213042 | norm 0.1332 | lr 8.32e-04 | (9886.29 ms | 53032 tok/s) step 8197/76294 | train loss 3.279415 | norm 0.1163 | lr 8.32e-04 | (9894.90 ms | 52986 tok/s) step 8198/76294 | train loss 3.313427 | norm 0.1392 | lr 8.32e-04 | (9885.19 ms | 53038 tok/s) step 8199/76294 | train loss 3.270347 | norm 0.1230 | lr 8.32e-04 | (9908.99 ms | 52910 tok/s) step 8200/76294 | train loss 3.272626 | norm 0.1251 | lr 8.31e-04 | (11183.51 ms | 46880 tok/s) step 8201/76294 | train loss 3.181902 | norm 0.1396 | lr 8.31e-04 | (9878.19 ms | 53075 tok/s) step 8202/76294 | train loss 3.331238 | norm 0.1322 | lr 8.31e-04 | (9885.29 ms | 53037 tok/s) step 8203/76294 | train loss 3.333252 | norm 0.1444 | lr 8.31e-04 | (9884.99 ms | 53039 tok/s) step 8204/76294 | train loss 3.202327 | norm 0.1203 | lr 8.31e-04 | (9887.45 ms | 53026 tok/s) step 8205/76294 | train loss 3.215292 | norm 0.1359 | lr 8.31e-04 | (9885.02 ms | 53039 tok/s) step 8206/76294 | train loss 3.274406 | norm 0.1096 | lr 8.31e-04 | (9909.31 ms | 52909 tok/s) step 8207/76294 | train loss 3.225214 | norm 0.1279 | lr 8.31e-04 | (9888.54 ms | 53020 tok/s) step 8208/76294 | train loss 3.299774 | norm 0.1283 | lr 8.31e-04 | (9910.73 ms | 52901 tok/s) step 8209/76294 | train loss 3.278384 | norm 0.1185 | lr 8.31e-04 | (9985.13 ms | 52507 tok/s) step 8210/76294 | train loss 3.219925 | norm 0.1291 | lr 8.31e-04 | (9884.66 ms | 53041 tok/s) step 8211/76294 | train loss 3.306756 | norm 0.1195 | lr 8.31e-04 | (9925.09 ms | 52825 tok/s) step 8212/76294 | train loss 3.240570 | norm 0.1291 | lr 8.30e-04 | (9885.39 ms | 53037 tok/s) step 8213/76294 | train loss 3.234435 | norm 0.1403 | lr 8.30e-04 | (9895.42 ms | 52983 tok/s) step 8214/76294 | train loss 3.251649 | norm 0.1832 | lr 8.30e-04 | (9887.13 ms | 53027 tok/s) step 8215/76294 | train loss 3.302721 | norm 0.1888 | lr 8.30e-04 | (9892.72 ms | 52997 tok/s) step 8216/76294 | train loss 3.269488 | norm 0.2039 | lr 8.30e-04 | (9928.43 ms | 52807 tok/s) step 8217/76294 | train loss 3.267490 | norm 0.1574 | lr 8.30e-04 | (9892.15 ms | 53000 tok/s) step 8218/76294 | train loss 3.245098 | norm 0.1291 | lr 8.30e-04 | (9900.25 ms | 52957 tok/s) step 8219/76294 | train loss 3.287729 | norm 0.1456 | lr 8.30e-04 | (9954.29 ms | 52670 tok/s) step 8220/76294 | train loss 3.302708 | norm 0.1369 | lr 8.30e-04 | (9887.70 ms | 53024 tok/s) step 8221/76294 | train loss 3.319820 | norm 0.1533 | lr 8.30e-04 | (9954.71 ms | 52667 tok/s) step 8222/76294 | train loss 3.287861 | norm 0.1290 | lr 8.30e-04 | (9884.03 ms | 53044 tok/s) step 8223/76294 | train loss 3.282814 | norm 0.1337 | lr 8.30e-04 | (9885.46 ms | 53036 tok/s) step 8224/76294 | train loss 3.222769 | norm 0.1393 | lr 8.29e-04 | (9938.23 ms | 52755 tok/s) step 8225/76294 | train loss 3.277610 | norm 0.1182 | lr 8.29e-04 | (9888.46 ms | 53020 tok/s) step 8226/76294 | train loss 3.245804 | norm 0.1349 | lr 8.29e-04 | (10148.28 ms | 51663 tok/s) step 8227/76294 | train loss 3.184924 | norm 0.1174 | lr 8.29e-04 | (11324.78 ms | 46296 tok/s) step 8228/76294 | train loss 3.308414 | norm 0.1366 | lr 8.29e-04 | (9873.56 ms | 53100 tok/s) step 8229/76294 | train loss 3.226537 | norm 0.1312 | lr 8.29e-04 | (9944.58 ms | 52721 tok/s) step 8230/76294 | train loss 3.307644 | norm 0.1325 | lr 8.29e-04 | (9876.97 ms | 53082 tok/s) step 8231/76294 | train loss 3.270768 | norm 0.1369 | lr 8.29e-04 | (9891.57 ms | 53004 tok/s) step 8232/76294 | train loss 3.244181 | norm 0.1437 | lr 8.29e-04 | (9885.05 ms | 53039 tok/s) step 8233/76294 | train loss 3.275941 | norm 0.1405 | lr 8.29e-04 | (9914.84 ms | 52879 tok/s) step 8234/76294 | train loss 3.191447 | norm 0.1341 | lr 8.29e-04 | (11059.82 ms | 47405 tok/s) step 8235/76294 | train loss 3.345815 | norm 0.1661 | lr 8.28e-04 | (9907.89 ms | 52916 tok/s) step 8236/76294 | train loss 3.263192 | norm 0.1338 | lr 8.28e-04 | (9894.98 ms | 52985 tok/s) step 8237/76294 | train loss 3.264925 | norm 0.1581 | lr 8.28e-04 | (9886.86 ms | 53029 tok/s) step 8238/76294 | train loss 3.218830 | norm 0.1501 | lr 8.28e-04 | (9887.23 ms | 53027 tok/s) step 8239/76294 | train loss 3.275828 | norm 0.1352 | lr 8.28e-04 | (11609.74 ms | 45159 tok/s) step 8240/76294 | train loss 3.289428 | norm 0.1548 | lr 8.28e-04 | (9876.31 ms | 53085 tok/s) step 8241/76294 | train loss 3.220387 | norm 0.1400 | lr 8.28e-04 | (9887.56 ms | 53025 tok/s) step 8242/76294 | train loss 3.199011 | norm 0.1499 | lr 8.28e-04 | (9889.23 ms | 53016 tok/s) step 8243/76294 | train loss 3.302029 | norm 0.1582 | lr 8.28e-04 | (9897.63 ms | 52971 tok/s) step 8244/76294 | train loss 3.290673 | norm 0.1318 | lr 8.28e-04 | (9917.75 ms | 52864 tok/s) step 8245/76294 | train loss 3.332147 | norm 0.1388 | lr 8.28e-04 | (9923.36 ms | 52834 tok/s) step 8246/76294 | train loss 3.245773 | norm 0.1397 | lr 8.28e-04 | (9890.83 ms | 53007 tok/s) step 8247/76294 | train loss 3.227936 | norm 0.1229 | lr 8.27e-04 | (9916.53 ms | 52870 tok/s) step 8248/76294 | train loss 3.211080 | norm 0.1313 | lr 8.27e-04 | (9892.69 ms | 52998 tok/s) step 8249/76294 | train loss 3.270453 | norm 0.1186 | lr 8.27e-04 | (9895.94 ms | 52980 tok/s) step 8250/76294 | train loss 3.379176 | norm 0.1421 | lr 8.27e-04 | (9893.31 ms | 52994 tok/s) val loss: 3.241382 saving model checkpoint to ./results/gpt2-350M-gqa/step_8250.pth step 8251/76294 | train loss 3.220940 | norm 0.1183 | lr 8.27e-04 | (9942.05 ms | 52734 tok/s) step 8252/76294 | train loss 3.270386 | norm 0.1211 | lr 8.27e-04 | (9870.78 ms | 53115 tok/s) step 8253/76294 | train loss 3.268525 | norm 0.1223 | lr 8.27e-04 | (9920.36 ms | 52850 tok/s) step 8254/76294 | train loss 3.290075 | norm 0.1323 | lr 8.27e-04 | (9881.49 ms | 53058 tok/s) step 8255/76294 | train loss 3.262379 | norm 0.1171 | lr 8.27e-04 | (9893.50 ms | 52993 tok/s) step 8256/76294 | train loss 3.239268 | norm 0.1206 | lr 8.27e-04 | (9889.85 ms | 53013 tok/s) step 8257/76294 | train loss 3.294769 | norm 0.1219 | lr 8.27e-04 | (9900.72 ms | 52955 tok/s) step 8258/76294 | train loss 3.253648 | norm 0.1443 | lr 8.27e-04 | (9901.36 ms | 52951 tok/s) step 8259/76294 | train loss 3.349044 | norm 0.1311 | lr 8.26e-04 | (9903.79 ms | 52938 tok/s) step 8260/76294 | train loss 3.459213 | norm 0.1616 | lr 8.26e-04 | (9898.30 ms | 52967 tok/s) step 8261/76294 | train loss 3.221624 | norm 0.1282 | lr 8.26e-04 | (9906.65 ms | 52923 tok/s) step 8262/76294 | train loss 3.233130 | norm 0.1330 | lr 8.26e-04 | (9894.23 ms | 52989 tok/s) step 8263/76294 | train loss 3.255448 | norm 0.1147 | lr 8.26e-04 | (9926.12 ms | 52819 tok/s) step 8264/76294 | train loss 3.285998 | norm 0.1367 | lr 8.26e-04 | (9894.76 ms | 52986 tok/s) step 8265/76294 | train loss 3.285149 | norm 0.1091 | lr 8.26e-04 | (9899.22 ms | 52963 tok/s) step 8266/76294 | train loss 3.296405 | norm 0.1123 | lr 8.26e-04 | (9916.46 ms | 52870 tok/s) step 8267/76294 | train loss 3.292726 | norm 0.1311 | lr 8.26e-04 | (9901.96 ms | 52948 tok/s) step 8268/76294 | train loss 3.285408 | norm 0.1353 | lr 8.26e-04 | (9951.52 ms | 52684 tok/s) step 8269/76294 | train loss 3.272511 | norm 0.1370 | lr 8.26e-04 | (9901.31 ms | 52951 tok/s) step 8270/76294 | train loss 3.266428 | norm 0.1325 | lr 8.25e-04 | (9917.01 ms | 52868 tok/s) step 8271/76294 | train loss 3.279063 | norm 0.1382 | lr 8.25e-04 | (9893.42 ms | 52994 tok/s) step 8272/76294 | train loss 3.269128 | norm 0.1280 | lr 8.25e-04 | (9902.07 ms | 52947 tok/s) step 8273/76294 | train loss 3.293886 | norm 0.1260 | lr 8.25e-04 | (9898.06 ms | 52969 tok/s) step 8274/76294 | train loss 3.237406 | norm 0.1364 | lr 8.25e-04 | (9935.91 ms | 52767 tok/s) step 8275/76294 | train loss 3.203147 | norm 0.1351 | lr 8.25e-04 | (9898.47 ms | 52967 tok/s) step 8276/76294 | train loss 3.368644 | norm 0.1753 | lr 8.25e-04 | (9895.13 ms | 52984 tok/s) step 8277/76294 | train loss 3.295722 | norm 0.1538 | lr 8.25e-04 | (9929.74 ms | 52800 tok/s) step 8278/76294 | train loss 3.220216 | norm 0.1668 | lr 8.25e-04 | (9908.63 ms | 52912 tok/s) step 8279/76294 | train loss 3.268668 | norm 0.1501 | lr 8.25e-04 | (9962.02 ms | 52629 tok/s) step 8280/76294 | train loss 3.368709 | norm 0.1643 | lr 8.25e-04 | (9896.02 ms | 52980 tok/s) step 8281/76294 | train loss 3.275102 | norm 0.1631 | lr 8.25e-04 | (9940.45 ms | 52743 tok/s) step 8282/76294 | train loss 3.298685 | norm 0.1567 | lr 8.24e-04 | (9895.90 ms | 52980 tok/s) step 8283/76294 | train loss 3.275562 | norm 0.1584 | lr 8.24e-04 | (9924.82 ms | 52826 tok/s) step 8284/76294 | train loss 3.197375 | norm 0.1587 | lr 8.24e-04 | (10515.55 ms | 49858 tok/s) step 8285/76294 | train loss 3.253136 | norm 0.1339 | lr 8.24e-04 | (9881.73 ms | 53056 tok/s) step 8286/76294 | train loss 3.272066 | norm 0.1464 | lr 8.24e-04 | (9920.80 ms | 52847 tok/s) step 8287/76294 | train loss 3.281732 | norm 0.1319 | lr 8.24e-04 | (9957.26 ms | 52654 tok/s) step 8288/76294 | train loss 3.249361 | norm 0.1268 | lr 8.24e-04 | (9895.59 ms | 52982 tok/s) step 8289/76294 | train loss 3.247517 | norm 0.1306 | lr 8.24e-04 | (9890.72 ms | 53008 tok/s) step 8290/76294 | train loss 3.339794 | norm 0.1232 | lr 8.24e-04 | (9927.00 ms | 52814 tok/s) step 8291/76294 | train loss 3.309258 | norm 0.1363 | lr 8.24e-04 | (9887.44 ms | 53026 tok/s) step 8292/76294 | train loss 3.234114 | norm 0.1263 | lr 8.24e-04 | (9893.79 ms | 52992 tok/s) step 8293/76294 | train loss 3.263276 | norm 0.1279 | lr 8.24e-04 | (9883.12 ms | 53049 tok/s) step 8294/76294 | train loss 3.240674 | norm 0.1464 | lr 8.23e-04 | (9898.62 ms | 52966 tok/s) step 8295/76294 | train loss 3.296068 | norm 0.1529 | lr 8.23e-04 | (9888.47 ms | 53020 tok/s) step 8296/76294 | train loss 3.303694 | norm 0.1236 | lr 8.23e-04 | (9892.60 ms | 52998 tok/s) step 8297/76294 | train loss 3.257939 | norm 0.1344 | lr 8.23e-04 | (9886.18 ms | 53032 tok/s) step 8298/76294 | train loss 3.221998 | norm 0.1379 | lr 8.23e-04 | (11228.98 ms | 46691 tok/s) step 8299/76294 | train loss 3.270944 | norm 0.1278 | lr 8.23e-04 | (9877.27 ms | 53080 tok/s) step 8300/76294 | train loss 3.196304 | norm 0.1277 | lr 8.23e-04 | (11959.95 ms | 43837 tok/s) step 8301/76294 | train loss 3.360898 | norm 0.1315 | lr 8.23e-04 | (9910.60 ms | 52902 tok/s) step 8302/76294 | train loss 3.232973 | norm 0.1408 | lr 8.23e-04 | (9874.21 ms | 53097 tok/s) step 8303/76294 | train loss 3.307613 | norm 0.1241 | lr 8.23e-04 | (9879.65 ms | 53067 tok/s) step 8304/76294 | train loss 3.262861 | norm 0.1561 | lr 8.23e-04 | (9924.02 ms | 52830 tok/s) step 8305/76294 | train loss 3.346110 | norm 0.1297 | lr 8.22e-04 | (9878.78 ms | 53072 tok/s) step 8306/76294 | train loss 3.334882 | norm 0.1418 | lr 8.22e-04 | (9888.39 ms | 53021 tok/s) step 8307/76294 | train loss 3.275021 | norm 0.1407 | lr 8.22e-04 | (9881.96 ms | 53055 tok/s) step 8308/76294 | train loss 3.282747 | norm 0.1457 | lr 8.22e-04 | (10653.34 ms | 49213 tok/s) step 8309/76294 | train loss 3.329934 | norm 0.1335 | lr 8.22e-04 | (9876.51 ms | 53084 tok/s) step 8310/76294 | train loss 3.275543 | norm 0.1500 | lr 8.22e-04 | (9882.17 ms | 53054 tok/s) step 8311/76294 | train loss 3.237459 | norm 0.1337 | lr 8.22e-04 | (9880.15 ms | 53065 tok/s) step 8312/76294 | train loss 3.212632 | norm 0.1285 | lr 8.22e-04 | (9898.11 ms | 52968 tok/s) step 8313/76294 | train loss 3.233629 | norm 0.1409 | lr 8.22e-04 | (9884.94 ms | 53039 tok/s) step 8314/76294 | train loss 3.265476 | norm 0.1233 | lr 8.22e-04 | (9889.06 ms | 53017 tok/s) step 8315/76294 | train loss 3.265374 | norm 0.1189 | lr 8.22e-04 | (9887.97 ms | 53023 tok/s) step 8316/76294 | train loss 3.230990 | norm 0.1244 | lr 8.22e-04 | (9891.67 ms | 53003 tok/s) step 8317/76294 | train loss 3.265456 | norm 0.1227 | lr 8.21e-04 | (9887.82 ms | 53024 tok/s) step 8318/76294 | train loss 3.238724 | norm 0.1168 | lr 8.21e-04 | (9925.40 ms | 52823 tok/s) step 8319/76294 | train loss 3.250926 | norm 0.1317 | lr 8.21e-04 | (9958.90 ms | 52645 tok/s) step 8320/76294 | train loss 3.309045 | norm 0.1195 | lr 8.21e-04 | (9886.86 ms | 53029 tok/s) step 8321/76294 | train loss 3.213670 | norm 0.1117 | lr 8.21e-04 | (9899.70 ms | 52960 tok/s) step 8322/76294 | train loss 3.272852 | norm 0.1258 | lr 8.21e-04 | (9884.72 ms | 53040 tok/s) step 8323/76294 | train loss 3.216439 | norm 0.1124 | lr 8.21e-04 | (9928.62 ms | 52806 tok/s) step 8324/76294 | train loss 3.198357 | norm 0.1242 | lr 8.21e-04 | (9887.26 ms | 53027 tok/s) step 8325/76294 | train loss 3.259631 | norm 0.1287 | lr 8.21e-04 | (9894.76 ms | 52986 tok/s) step 8326/76294 | train loss 3.194854 | norm 0.1201 | lr 8.21e-04 | (9885.82 ms | 53034 tok/s) step 8327/76294 | train loss 3.234232 | norm 0.1331 | lr 8.21e-04 | (9962.09 ms | 52628 tok/s) step 8328/76294 | train loss 3.301019 | norm 0.1129 | lr 8.21e-04 | (9894.92 ms | 52986 tok/s) step 8329/76294 | train loss 3.264452 | norm 0.1220 | lr 8.20e-04 | (9902.12 ms | 52947 tok/s) step 8330/76294 | train loss 3.252642 | norm 0.1204 | lr 8.20e-04 | (9892.54 ms | 52998 tok/s) step 8331/76294 | train loss 3.262937 | norm 0.1582 | lr 8.20e-04 | (9938.99 ms | 52751 tok/s) step 8332/76294 | train loss 3.268450 | norm 0.1375 | lr 8.20e-04 | (9886.06 ms | 53033 tok/s) step 8333/76294 | train loss 3.231576 | norm 0.1597 | lr 8.20e-04 | (9938.57 ms | 52753 tok/s) step 8334/76294 | train loss 3.209187 | norm 0.1510 | lr 8.20e-04 | (9891.64 ms | 53003 tok/s) step 8335/76294 | train loss 3.251131 | norm 0.1695 | lr 8.20e-04 | (9896.98 ms | 52975 tok/s) step 8336/76294 | train loss 3.287472 | norm 0.1499 | lr 8.20e-04 | (9937.37 ms | 52759 tok/s) step 8337/76294 | train loss 3.239273 | norm 0.1595 | lr 8.20e-04 | (9932.53 ms | 52785 tok/s) step 8338/76294 | train loss 3.253083 | norm 0.1603 | lr 8.20e-04 | (9945.56 ms | 52716 tok/s) step 8339/76294 | train loss 3.237609 | norm 0.1533 | lr 8.20e-04 | (9900.52 ms | 52956 tok/s) step 8340/76294 | train loss 3.285545 | norm 0.1434 | lr 8.19e-04 | (9889.56 ms | 53014 tok/s) step 8341/76294 | train loss 3.259156 | norm 0.1638 | lr 8.19e-04 | (9912.20 ms | 52893 tok/s) step 8342/76294 | train loss 3.245440 | norm 0.1290 | lr 8.19e-04 | (9940.53 ms | 52742 tok/s) step 8343/76294 | train loss 3.288673 | norm 0.1440 | lr 8.19e-04 | (9894.24 ms | 52989 tok/s) step 8344/76294 | train loss 3.281727 | norm 0.1328 | lr 8.19e-04 | (9898.02 ms | 52969 tok/s) step 8345/76294 | train loss 3.220736 | norm 0.1443 | lr 8.19e-04 | (9893.86 ms | 52991 tok/s) step 8346/76294 | train loss 3.229139 | norm 0.1304 | lr 8.19e-04 | (9894.74 ms | 52987 tok/s) step 8347/76294 | train loss 3.256782 | norm 0.1254 | lr 8.19e-04 | (9893.11 ms | 52995 tok/s) step 8348/76294 | train loss 3.187678 | norm 0.1258 | lr 8.19e-04 | (9897.91 ms | 52970 tok/s) step 8349/76294 | train loss 3.221455 | norm 0.1260 | lr 8.19e-04 | (9928.43 ms | 52807 tok/s) step 8350/76294 | train loss 3.308229 | norm 0.1311 | lr 8.19e-04 | (9894.44 ms | 52988 tok/s) step 8351/76294 | train loss 3.247962 | norm 0.1226 | lr 8.19e-04 | (9896.49 ms | 52977 tok/s) step 8352/76294 | train loss 3.250451 | norm 0.1177 | lr 8.18e-04 | (9896.36 ms | 52978 tok/s) step 8353/76294 | train loss 3.215389 | norm 0.1202 | lr 8.18e-04 | (9894.44 ms | 52988 tok/s) step 8354/76294 | train loss 3.256060 | norm 0.1409 | lr 8.18e-04 | (9894.33 ms | 52989 tok/s) step 8355/76294 | train loss 3.279632 | norm 0.1267 | lr 8.18e-04 | (9949.16 ms | 52697 tok/s) step 8356/76294 | train loss 3.219602 | norm 0.1319 | lr 8.18e-04 | (9897.41 ms | 52972 tok/s) step 8357/76294 | train loss 3.239628 | norm 0.1138 | lr 8.18e-04 | (9898.02 ms | 52969 tok/s) step 8358/76294 | train loss 3.272081 | norm 0.1381 | lr 8.18e-04 | (9896.01 ms | 52980 tok/s) step 8359/76294 | train loss 3.231026 | norm 0.1148 | lr 8.18e-04 | (9893.84 ms | 52991 tok/s) step 8360/76294 | train loss 3.219571 | norm 0.1387 | lr 8.18e-04 | (9899.72 ms | 52960 tok/s) step 8361/76294 | train loss 3.488363 | norm 0.1254 | lr 8.18e-04 | (9896.14 ms | 52979 tok/s) step 8362/76294 | train loss 3.270213 | norm 0.1405 | lr 8.18e-04 | (9891.51 ms | 53004 tok/s) step 8363/76294 | train loss 3.280132 | norm 0.1574 | lr 8.18e-04 | (9928.56 ms | 52806 tok/s) step 8364/76294 | train loss 3.291527 | norm 0.1350 | lr 8.17e-04 | (9925.59 ms | 52822 tok/s) step 8365/76294 | train loss 3.247644 | norm 0.1499 | lr 8.17e-04 | (9891.68 ms | 53003 tok/s) step 8366/76294 | train loss 3.216717 | norm 0.1258 | lr 8.17e-04 | (9952.10 ms | 52681 tok/s) step 8367/76294 | train loss 3.244358 | norm 0.1356 | lr 8.17e-04 | (9887.92 ms | 53023 tok/s) step 8368/76294 | train loss 3.192185 | norm 0.1288 | lr 8.17e-04 | (9885.33 ms | 53037 tok/s) step 8369/76294 | train loss 3.248440 | norm 0.1876 | lr 8.17e-04 | (9899.96 ms | 52959 tok/s) step 8370/76294 | train loss 3.326230 | norm 0.1845 | lr 8.17e-04 | (9937.72 ms | 52757 tok/s) step 8371/76294 | train loss 3.176921 | norm 0.1407 | lr 8.17e-04 | (9890.69 ms | 53008 tok/s) step 8372/76294 | train loss 3.206878 | norm 0.1472 | lr 8.17e-04 | (9891.93 ms | 53002 tok/s) step 8373/76294 | train loss 3.243509 | norm 0.1448 | lr 8.17e-04 | (9922.33 ms | 52839 tok/s) step 8374/76294 | train loss 3.163917 | norm 0.1531 | lr 8.17e-04 | (9950.03 ms | 52692 tok/s) step 8375/76294 | train loss 3.329819 | norm 0.1439 | lr 8.16e-04 | (9906.01 ms | 52926 tok/s) step 8376/76294 | train loss 3.301360 | norm 0.1346 | lr 8.16e-04 | (9884.95 ms | 53039 tok/s) step 8377/76294 | train loss 3.279861 | norm 0.1486 | lr 8.16e-04 | (9916.30 ms | 52871 tok/s) step 8378/76294 | train loss 3.267066 | norm 0.1301 | lr 8.16e-04 | (9881.26 ms | 53059 tok/s) step 8379/76294 | train loss 3.260333 | norm 0.1363 | lr 8.16e-04 | (9890.57 ms | 53009 tok/s) step 8380/76294 | train loss 3.242274 | norm 0.1347 | lr 8.16e-04 | (9923.87 ms | 52831 tok/s) step 8381/76294 | train loss 3.223038 | norm 0.1304 | lr 8.16e-04 | (9887.75 ms | 53024 tok/s) step 8382/76294 | train loss 3.233089 | norm 0.1319 | lr 8.16e-04 | (9930.42 ms | 52796 tok/s) step 8383/76294 | train loss 3.226319 | norm 0.1270 | lr 8.16e-04 | (9891.68 ms | 53003 tok/s) step 8384/76294 | train loss 3.186051 | norm 0.1250 | lr 8.16e-04 | (9890.89 ms | 53007 tok/s) step 8385/76294 | train loss 3.315788 | norm 0.1254 | lr 8.16e-04 | (9885.72 ms | 53035 tok/s) step 8386/76294 | train loss 3.218300 | norm 0.1348 | lr 8.16e-04 | (9892.59 ms | 52998 tok/s) step 8387/76294 | train loss 3.200365 | norm 0.1236 | lr 8.15e-04 | (9892.37 ms | 52999 tok/s) step 8388/76294 | train loss 3.339875 | norm 0.1294 | lr 8.15e-04 | (9884.76 ms | 53040 tok/s) step 8389/76294 | train loss 3.232744 | norm 0.1200 | lr 8.15e-04 | (9932.40 ms | 52786 tok/s) step 8390/76294 | train loss 3.263963 | norm 0.1254 | lr 8.15e-04 | (9885.68 ms | 53035 tok/s) step 8391/76294 | train loss 3.294888 | norm 0.1177 | lr 8.15e-04 | (9888.54 ms | 53020 tok/s) step 8392/76294 | train loss 3.239890 | norm 0.1196 | lr 8.15e-04 | (9917.79 ms | 52863 tok/s) step 8393/76294 | train loss 3.361159 | norm 0.1227 | lr 8.15e-04 | (9888.29 ms | 53021 tok/s) step 8394/76294 | train loss 3.177800 | norm 0.1328 | lr 8.15e-04 | (9890.58 ms | 53009 tok/s) step 8395/76294 | train loss 3.280069 | norm 0.1284 | lr 8.15e-04 | (9886.29 ms | 53032 tok/s) step 8396/76294 | train loss 3.226284 | norm 0.1320 | lr 8.15e-04 | (11424.31 ms | 45892 tok/s) step 8397/76294 | train loss 3.314382 | norm 0.1367 | lr 8.15e-04 | (9876.84 ms | 53083 tok/s) step 8398/76294 | train loss 3.245154 | norm 0.1278 | lr 8.14e-04 | (9880.46 ms | 53063 tok/s) step 8399/76294 | train loss 3.253686 | norm 0.1434 | lr 8.14e-04 | (9884.68 ms | 53040 tok/s) step 8400/76294 | train loss 3.258542 | norm 0.1258 | lr 8.14e-04 | (9881.04 ms | 53060 tok/s) step 8401/76294 | train loss 3.274008 | norm 0.1248 | lr 8.14e-04 | (9886.01 ms | 53033 tok/s) step 8402/76294 | train loss 3.210238 | norm 0.1252 | lr 8.14e-04 | (9889.43 ms | 53015 tok/s) step 8403/76294 | train loss 3.284169 | norm 0.1271 | lr 8.14e-04 | (9889.63 ms | 53014 tok/s) step 8404/76294 | train loss 3.227133 | norm 0.1419 | lr 8.14e-04 | (9910.30 ms | 52903 tok/s) step 8405/76294 | train loss 3.254990 | norm 0.1287 | lr 8.14e-04 | (9922.72 ms | 52837 tok/s) step 8406/76294 | train loss 3.207515 | norm 0.1253 | lr 8.14e-04 | (9886.04 ms | 53033 tok/s) step 8407/76294 | train loss 3.214515 | norm 0.1291 | lr 8.14e-04 | (9899.23 ms | 52962 tok/s) step 8408/76294 | train loss 3.266288 | norm 0.1329 | lr 8.14e-04 | (9887.63 ms | 53025 tok/s) step 8409/76294 | train loss 3.271530 | norm 0.1194 | lr 8.14e-04 | (9983.57 ms | 52515 tok/s) step 8410/76294 | train loss 3.224643 | norm 0.1220 | lr 8.13e-04 | (9967.49 ms | 52600 tok/s) step 8411/76294 | train loss 3.263804 | norm 0.1233 | lr 8.13e-04 | (9904.68 ms | 52933 tok/s) step 8412/76294 | train loss 3.238011 | norm 0.1195 | lr 8.13e-04 | (9895.19 ms | 52984 tok/s) step 8413/76294 | train loss 3.258020 | norm 0.1458 | lr 8.13e-04 | (9901.58 ms | 52950 tok/s) step 8414/76294 | train loss 3.261482 | norm 0.1255 | lr 8.13e-04 | (9925.48 ms | 52822 tok/s) step 8415/76294 | train loss 3.249094 | norm 0.1292 | lr 8.13e-04 | (9890.69 ms | 53008 tok/s) step 8416/76294 | train loss 3.193518 | norm 0.1194 | lr 8.13e-04 | (9898.53 ms | 52966 tok/s) step 8417/76294 | train loss 3.244922 | norm 0.1253 | lr 8.13e-04 | (9900.12 ms | 52958 tok/s) step 8418/76294 | train loss 3.317709 | norm 0.1210 | lr 8.13e-04 | (9956.75 ms | 52657 tok/s) step 8419/76294 | train loss 3.238685 | norm 0.1198 | lr 8.13e-04 | (9890.41 ms | 53010 tok/s) step 8420/76294 | train loss 3.245509 | norm 0.1370 | lr 8.13e-04 | (9928.32 ms | 52807 tok/s) step 8421/76294 | train loss 3.203439 | norm 0.1479 | lr 8.13e-04 | (9916.08 ms | 52872 tok/s) step 8422/76294 | train loss 3.205583 | norm 0.1345 | lr 8.12e-04 | (9887.90 ms | 53023 tok/s) step 8423/76294 | train loss 3.279775 | norm 0.1128 | lr 8.12e-04 | (9896.05 ms | 52980 tok/s) step 8424/76294 | train loss 3.249248 | norm 0.1341 | lr 8.12e-04 | (9936.47 ms | 52764 tok/s) step 8425/76294 | train loss 3.247710 | norm 0.1135 | lr 8.12e-04 | (9892.64 ms | 52998 tok/s) step 8426/76294 | train loss 3.282312 | norm 0.1379 | lr 8.12e-04 | (9898.33 ms | 52967 tok/s) step 8427/76294 | train loss 3.267663 | norm 0.1301 | lr 8.12e-04 | (9891.28 ms | 53005 tok/s) step 8428/76294 | train loss 3.228056 | norm 0.1230 | lr 8.12e-04 | (9953.25 ms | 52675 tok/s) step 8429/76294 | train loss 3.360384 | norm 0.1498 | lr 8.12e-04 | (9896.79 ms | 52976 tok/s) step 8430/76294 | train loss 3.269988 | norm 0.1362 | lr 8.12e-04 | (9939.46 ms | 52748 tok/s) step 8431/76294 | train loss 3.212701 | norm 0.1371 | lr 8.12e-04 | (9962.19 ms | 52628 tok/s) step 8432/76294 | train loss 3.284514 | norm 0.1363 | lr 8.12e-04 | (9896.59 ms | 52977 tok/s) step 8433/76294 | train loss 3.279993 | norm 0.1395 | lr 8.11e-04 | (9894.58 ms | 52987 tok/s) step 8434/76294 | train loss 3.172071 | norm 0.1421 | lr 8.11e-04 | (9902.52 ms | 52945 tok/s) step 8435/76294 | train loss 3.280755 | norm 0.1243 | lr 8.11e-04 | (9942.33 ms | 52733 tok/s) step 8436/76294 | train loss 3.267171 | norm 0.1516 | lr 8.11e-04 | (9891.77 ms | 53002 tok/s) step 8437/76294 | train loss 3.271367 | norm 0.1151 | lr 8.11e-04 | (9907.38 ms | 52919 tok/s) step 8438/76294 | train loss 3.276318 | norm 0.1326 | lr 8.11e-04 | (9893.04 ms | 52996 tok/s) step 8439/76294 | train loss 3.217971 | norm 0.1160 | lr 8.11e-04 | (9899.98 ms | 52958 tok/s) step 8440/76294 | train loss 3.282404 | norm 0.1559 | lr 8.11e-04 | (9891.64 ms | 53003 tok/s) step 8441/76294 | train loss 3.431093 | norm 0.1465 | lr 8.11e-04 | (9900.28 ms | 52957 tok/s) step 8442/76294 | train loss 3.280857 | norm 0.1519 | lr 8.11e-04 | (9888.66 ms | 53019 tok/s) step 8443/76294 | train loss 3.369901 | norm 0.1190 | lr 8.11e-04 | (9900.20 ms | 52957 tok/s) step 8444/76294 | train loss 3.260641 | norm 0.1387 | lr 8.11e-04 | (9890.53 ms | 53009 tok/s) step 8445/76294 | train loss 3.240793 | norm 0.1319 | lr 8.10e-04 | (9916.54 ms | 52870 tok/s) step 8446/76294 | train loss 3.255781 | norm 0.1341 | lr 8.10e-04 | (9919.86 ms | 52852 tok/s) step 8447/76294 | train loss 3.301999 | norm 0.1512 | lr 8.10e-04 | (9886.84 ms | 53029 tok/s) step 8448/76294 | train loss 3.309301 | norm 0.1251 | lr 8.10e-04 | (9916.93 ms | 52868 tok/s) step 8449/76294 | train loss 3.188311 | norm 0.1438 | lr 8.10e-04 | (9884.83 ms | 53040 tok/s) step 8450/76294 | train loss 3.222148 | norm 0.1392 | lr 8.10e-04 | (9931.13 ms | 52792 tok/s) step 8451/76294 | train loss 3.235839 | norm 0.1479 | lr 8.10e-04 | (9890.47 ms | 53009 tok/s) step 8452/76294 | train loss 3.205737 | norm 0.1280 | lr 8.10e-04 | (9894.78 ms | 52986 tok/s) step 8453/76294 | train loss 3.311929 | norm 0.1741 | lr 8.10e-04 | (9888.40 ms | 53021 tok/s) step 8454/76294 | train loss 3.253111 | norm 0.1498 | lr 8.10e-04 | (9883.30 ms | 53048 tok/s) step 8455/76294 | train loss 3.214296 | norm 0.1484 | lr 8.10e-04 | (9893.87 ms | 52991 tok/s) step 8456/76294 | train loss 3.284957 | norm 0.1394 | lr 8.09e-04 | (9890.23 ms | 53011 tok/s) step 8457/76294 | train loss 3.283701 | norm 0.1391 | lr 8.09e-04 | (9887.78 ms | 53024 tok/s) step 8458/76294 | train loss 3.257766 | norm 0.1326 | lr 8.09e-04 | (9957.26 ms | 52654 tok/s) step 8459/76294 | train loss 3.245958 | norm 0.1359 | lr 8.09e-04 | (9892.09 ms | 53001 tok/s) step 8460/76294 | train loss 3.272224 | norm 0.1398 | lr 8.09e-04 | (9883.24 ms | 53048 tok/s) step 8461/76294 | train loss 3.279492 | norm 0.1310 | lr 8.09e-04 | (9961.41 ms | 52632 tok/s) step 8462/76294 | train loss 3.325579 | norm 0.1450 | lr 8.09e-04 | (9884.96 ms | 53039 tok/s) step 8463/76294 | train loss 3.256383 | norm 0.1330 | lr 8.09e-04 | (9901.65 ms | 52950 tok/s) step 8464/76294 | train loss 3.302267 | norm 0.1335 | lr 8.09e-04 | (9927.59 ms | 52811 tok/s) step 8465/76294 | train loss 3.250840 | norm 0.1366 | lr 8.09e-04 | (9912.60 ms | 52891 tok/s) step 8466/76294 | train loss 3.322316 | norm 0.1560 | lr 8.09e-04 | (9890.17 ms | 53011 tok/s) step 8467/76294 | train loss 3.198383 | norm 0.1807 | lr 8.09e-04 | (9901.48 ms | 52950 tok/s) step 8468/76294 | train loss 3.216067 | norm 0.1247 | lr 8.08e-04 | (9887.21 ms | 53027 tok/s) step 8469/76294 | train loss 3.227297 | norm 0.1468 | lr 8.08e-04 | (9890.07 ms | 53012 tok/s) step 8470/76294 | train loss 3.274186 | norm 0.1339 | lr 8.08e-04 | (9971.66 ms | 52578 tok/s) step 8471/76294 | train loss 3.233448 | norm 0.1530 | lr 8.08e-04 | (9898.71 ms | 52965 tok/s) step 8472/76294 | train loss 3.245753 | norm 0.1382 | lr 8.08e-04 | (9886.18 ms | 53032 tok/s) step 8473/76294 | train loss 3.440760 | norm 0.1397 | lr 8.08e-04 | (9886.90 ms | 53029 tok/s) step 8474/76294 | train loss 3.247763 | norm 0.1530 | lr 8.08e-04 | (9888.10 ms | 53022 tok/s) step 8475/76294 | train loss 3.205009 | norm 0.1281 | lr 8.08e-04 | (9888.43 ms | 53020 tok/s) step 8476/76294 | train loss 3.342758 | norm 0.1441 | lr 8.08e-04 | (9891.94 ms | 53002 tok/s) step 8477/76294 | train loss 3.234746 | norm 0.1227 | lr 8.08e-04 | (9890.85 ms | 53007 tok/s) step 8478/76294 | train loss 3.283521 | norm 0.1351 | lr 8.08e-04 | (9893.46 ms | 52993 tok/s) step 8479/76294 | train loss 3.197134 | norm 0.1335 | lr 8.07e-04 | (9889.52 ms | 53014 tok/s) step 8480/76294 | train loss 3.223994 | norm 0.1431 | lr 8.07e-04 | (9883.50 ms | 53047 tok/s) step 8481/76294 | train loss 3.254198 | norm 0.1376 | lr 8.07e-04 | (9888.64 ms | 53019 tok/s) step 8482/76294 | train loss 3.273863 | norm 0.1330 | lr 8.07e-04 | (9926.92 ms | 52815 tok/s) step 8483/76294 | train loss 3.290527 | norm 0.1226 | lr 8.07e-04 | (9895.30 ms | 52984 tok/s) step 8484/76294 | train loss 3.194838 | norm 0.1428 | lr 8.07e-04 | (9889.09 ms | 53017 tok/s) step 8485/76294 | train loss 3.153113 | norm 0.1385 | lr 8.07e-04 | (9894.25 ms | 52989 tok/s) step 8486/76294 | train loss 3.229541 | norm 0.1231 | lr 8.07e-04 | (9885.41 ms | 53037 tok/s) step 8487/76294 | train loss 3.251112 | norm 0.1283 | lr 8.07e-04 | (9903.34 ms | 52941 tok/s) step 8488/76294 | train loss 3.250038 | norm 0.1282 | lr 8.07e-04 | (9886.53 ms | 53031 tok/s) step 8489/76294 | train loss 3.322398 | norm 0.1284 | lr 8.07e-04 | (9893.39 ms | 52994 tok/s) step 8490/76294 | train loss 3.348043 | norm 0.1514 | lr 8.07e-04 | (9888.74 ms | 53019 tok/s) step 8491/76294 | train loss 3.282709 | norm 0.1352 | lr 8.06e-04 | (9928.14 ms | 52808 tok/s) step 8492/76294 | train loss 3.266500 | norm 0.1337 | lr 8.06e-04 | (9881.95 ms | 53055 tok/s) step 8493/76294 | train loss 3.252374 | norm 0.1244 | lr 8.06e-04 | (10726.99 ms | 48876 tok/s) step 8494/76294 | train loss 3.328489 | norm 0.1192 | lr 8.06e-04 | (9872.75 ms | 53105 tok/s) step 8495/76294 | train loss 3.201202 | norm 0.1294 | lr 8.06e-04 | (9890.68 ms | 53008 tok/s) step 8496/76294 | train loss 3.275782 | norm 0.1258 | lr 8.06e-04 | (9876.96 ms | 53082 tok/s) step 8497/76294 | train loss 3.237277 | norm 0.1315 | lr 8.06e-04 | (9881.29 ms | 53059 tok/s) step 8498/76294 | train loss 3.222495 | norm 0.1520 | lr 8.06e-04 | (10167.07 ms | 51567 tok/s) step 8499/76294 | train loss 3.287486 | norm 0.1297 | lr 8.06e-04 | (9880.54 ms | 53063 tok/s) step 8500/76294 | train loss 3.206380 | norm 0.1308 | lr 8.06e-04 | (9917.68 ms | 52864 tok/s) val loss: 3.235009 saving model checkpoint to ./results/gpt2-350M-gqa/step_8500.pth step 8501/76294 | train loss 3.344092 | norm 0.1276 | lr 8.06e-04 | (9912.39 ms | 52892 tok/s) step 8502/76294 | train loss 3.281377 | norm 0.1293 | lr 8.05e-04 | (9914.04 ms | 52883 tok/s) step 8503/76294 | train loss 3.226812 | norm 0.1415 | lr 8.05e-04 | (9859.31 ms | 53177 tok/s) step 8504/76294 | train loss 3.259007 | norm 0.1285 | lr 8.05e-04 | (9859.25 ms | 53177 tok/s) step 8505/76294 | train loss 3.296297 | norm 0.1580 | lr 8.05e-04 | (9878.38 ms | 53074 tok/s) step 8506/76294 | train loss 3.240888 | norm 0.1268 | lr 8.05e-04 | (9874.68 ms | 53094 tok/s) step 8507/76294 | train loss 3.217204 | norm 0.1469 | lr 8.05e-04 | (9879.15 ms | 53070 tok/s) step 8508/76294 | train loss 3.223589 | norm 0.1241 | lr 8.05e-04 | (9894.11 ms | 52990 tok/s) step 8509/76294 | train loss 3.236421 | norm 0.1357 | lr 8.05e-04 | (9947.61 ms | 52705 tok/s) step 8510/76294 | train loss 3.250398 | norm 0.1235 | lr 8.05e-04 | (9881.62 ms | 53057 tok/s) step 8511/76294 | train loss 3.331939 | norm 0.1419 | lr 8.05e-04 | (9934.55 ms | 52774 tok/s) step 8512/76294 | train loss 3.290909 | norm 0.1217 | lr 8.05e-04 | (9887.44 ms | 53026 tok/s) step 8513/76294 | train loss 3.233740 | norm 0.1262 | lr 8.05e-04 | (9902.91 ms | 52943 tok/s) step 8514/76294 | train loss 3.309409 | norm 0.1234 | lr 8.04e-04 | (9927.10 ms | 52814 tok/s) step 8515/76294 | train loss 3.230163 | norm 0.1315 | lr 8.04e-04 | (9910.54 ms | 52902 tok/s) step 8516/76294 | train loss 3.201481 | norm 0.1301 | lr 8.04e-04 | (9917.35 ms | 52866 tok/s) step 8517/76294 | train loss 3.297670 | norm 0.1385 | lr 8.04e-04 | (9913.17 ms | 52888 tok/s) step 8518/76294 | train loss 3.247237 | norm 0.1434 | lr 8.04e-04 | (10672.32 ms | 49126 tok/s) step 8519/76294 | train loss 3.272824 | norm 0.1247 | lr 8.04e-04 | (9890.98 ms | 53007 tok/s) step 8520/76294 | train loss 3.259792 | norm 0.1396 | lr 8.04e-04 | (9884.85 ms | 53040 tok/s) step 8521/76294 | train loss 3.298402 | norm 0.1303 | lr 8.04e-04 | (9918.72 ms | 52858 tok/s) step 8522/76294 | train loss 3.185940 | norm 0.1403 | lr 8.04e-04 | (9932.42 ms | 52786 tok/s) step 8523/76294 | train loss 3.258785 | norm 0.1308 | lr 8.04e-04 | (9899.86 ms | 52959 tok/s) step 8524/76294 | train loss 3.404870 | norm 0.1494 | lr 8.04e-04 | (9909.27 ms | 52909 tok/s) step 8525/76294 | train loss 3.256028 | norm 0.1299 | lr 8.04e-04 | (9900.72 ms | 52955 tok/s) step 8526/76294 | train loss 3.250845 | norm 0.1547 | lr 8.03e-04 | (9908.84 ms | 52911 tok/s) step 8527/76294 | train loss 3.173345 | norm 0.1304 | lr 8.03e-04 | (9895.86 ms | 52981 tok/s) step 8528/76294 | train loss 3.308739 | norm 0.1559 | lr 8.03e-04 | (9897.22 ms | 52973 tok/s) step 8529/76294 | train loss 3.241057 | norm 0.1420 | lr 8.03e-04 | (9896.48 ms | 52977 tok/s) step 8530/76294 | train loss 3.290482 | norm 0.1544 | lr 8.03e-04 | (9900.56 ms | 52955 tok/s) step 8531/76294 | train loss 3.252199 | norm 0.1264 | lr 8.03e-04 | (9896.54 ms | 52977 tok/s) step 8532/76294 | train loss 3.243181 | norm 0.1410 | lr 8.03e-04 | (9903.62 ms | 52939 tok/s) step 8533/76294 | train loss 3.206430 | norm 0.1328 | lr 8.03e-04 | (9924.82 ms | 52826 tok/s) step 8534/76294 | train loss 3.278128 | norm 0.1456 | lr 8.03e-04 | (9889.26 ms | 53016 tok/s) step 8535/76294 | train loss 3.252227 | norm 0.1250 | lr 8.03e-04 | (9938.17 ms | 52755 tok/s) step 8536/76294 | train loss 3.205782 | norm 0.1279 | lr 8.03e-04 | (9892.19 ms | 53000 tok/s) step 8537/76294 | train loss 3.253770 | norm 0.1261 | lr 8.02e-04 | (9908.70 ms | 52912 tok/s) step 8538/76294 | train loss 3.229005 | norm 0.1305 | lr 8.02e-04 | (9894.44 ms | 52988 tok/s) step 8539/76294 | train loss 3.236798 | norm 0.1272 | lr 8.02e-04 | (9911.67 ms | 52896 tok/s) step 8540/76294 | train loss 3.199941 | norm 0.1410 | lr 8.02e-04 | (9891.08 ms | 53006 tok/s) step 8541/76294 | train loss 3.360278 | norm 0.1973 | lr 8.02e-04 | (9899.67 ms | 52960 tok/s) step 8542/76294 | train loss 3.223357 | norm 0.1633 | lr 8.02e-04 | (9893.93 ms | 52991 tok/s) step 8543/76294 | train loss 3.288851 | norm 0.1365 | lr 8.02e-04 | (9897.61 ms | 52971 tok/s) step 8544/76294 | train loss 3.264274 | norm 0.1479 | lr 8.02e-04 | (9890.84 ms | 53007 tok/s) step 8545/76294 | train loss 3.281155 | norm 0.1430 | lr 8.02e-04 | (9901.35 ms | 52951 tok/s) step 8546/76294 | train loss 3.231762 | norm 0.1689 | lr 8.02e-04 | (9904.27 ms | 52936 tok/s) step 8547/76294 | train loss 3.325665 | norm 0.1308 | lr 8.02e-04 | (9892.45 ms | 52999 tok/s) step 8548/76294 | train loss 3.228279 | norm 0.1564 | lr 8.02e-04 | (9905.02 ms | 52932 tok/s) step 8549/76294 | train loss 3.190599 | norm 0.1403 | lr 8.01e-04 | (10043.20 ms | 52203 tok/s) step 8550/76294 | train loss 3.216549 | norm 0.1558 | lr 8.01e-04 | (9905.03 ms | 52931 tok/s) step 8551/76294 | train loss 3.228432 | norm 0.1390 | lr 8.01e-04 | (9891.56 ms | 53004 tok/s) step 8552/76294 | train loss 3.271304 | norm 0.1389 | lr 8.01e-04 | (9886.98 ms | 53028 tok/s) step 8553/76294 | train loss 3.238102 | norm 0.1491 | lr 8.01e-04 | (9934.44 ms | 52775 tok/s) step 8554/76294 | train loss 3.281781 | norm 0.1295 | lr 8.01e-04 | (9892.46 ms | 52999 tok/s) step 8555/76294 | train loss 3.256481 | norm 0.1723 | lr 8.01e-04 | (9929.97 ms | 52799 tok/s) step 8556/76294 | train loss 3.238333 | norm 0.1231 | lr 8.01e-04 | (9888.56 ms | 53020 tok/s) step 8557/76294 | train loss 3.296551 | norm 0.1539 | lr 8.01e-04 | (9944.62 ms | 52721 tok/s) step 8558/76294 | train loss 3.180486 | norm 0.1527 | lr 8.01e-04 | (9899.65 ms | 52960 tok/s) step 8559/76294 | train loss 3.313142 | norm 0.1454 | lr 8.01e-04 | (9902.84 ms | 52943 tok/s) step 8560/76294 | train loss 3.276999 | norm 0.1397 | lr 8.00e-04 | (9973.24 ms | 52569 tok/s) step 8561/76294 | train loss 3.201559 | norm 0.1601 | lr 8.00e-04 | (9899.42 ms | 52961 tok/s) step 8562/76294 | train loss 3.217762 | norm 0.1190 | lr 8.00e-04 | (9897.95 ms | 52969 tok/s) step 8563/76294 | train loss 3.218638 | norm 0.1704 | lr 8.00e-04 | (9915.96 ms | 52873 tok/s) step 8564/76294 | train loss 3.263355 | norm 0.1268 | lr 8.00e-04 | (9896.29 ms | 52978 tok/s) step 8565/76294 | train loss 3.241114 | norm 0.1486 | lr 8.00e-04 | (9897.38 ms | 52972 tok/s) step 8566/76294 | train loss 3.215039 | norm 0.1293 | lr 8.00e-04 | (9896.84 ms | 52975 tok/s) step 8567/76294 | train loss 3.322527 | norm 0.1508 | lr 8.00e-04 | (9897.06 ms | 52974 tok/s) step 8568/76294 | train loss 3.261330 | norm 0.1369 | lr 8.00e-04 | (9894.43 ms | 52988 tok/s) step 8569/76294 | train loss 3.235646 | norm 0.1444 | lr 8.00e-04 | (9895.48 ms | 52983 tok/s) step 8570/76294 | train loss 3.233105 | norm 0.1192 | lr 8.00e-04 | (9896.02 ms | 52980 tok/s) step 8571/76294 | train loss 3.284570 | norm 0.1214 | lr 8.00e-04 | (9896.54 ms | 52977 tok/s) step 8572/76294 | train loss 3.215955 | norm 0.1367 | lr 7.99e-04 | (9932.94 ms | 52783 tok/s) step 8573/76294 | train loss 3.268453 | norm 0.1176 | lr 7.99e-04 | (9929.26 ms | 52802 tok/s) step 8574/76294 | train loss 3.263343 | norm 0.1286 | lr 7.99e-04 | (9891.36 ms | 53005 tok/s) step 8575/76294 | train loss 3.333520 | norm 0.1389 | lr 7.99e-04 | (9909.17 ms | 52909 tok/s) step 8576/76294 | train loss 3.269426 | norm 0.1195 | lr 7.99e-04 | (9886.77 ms | 53029 tok/s) step 8577/76294 | train loss 3.270023 | norm 0.1257 | lr 7.99e-04 | (9901.15 ms | 52952 tok/s) step 8578/76294 | train loss 3.288494 | norm 0.1230 | lr 7.99e-04 | (9890.93 ms | 53007 tok/s) step 8579/76294 | train loss 3.209670 | norm 0.1238 | lr 7.99e-04 | (10064.26 ms | 52094 tok/s) step 8580/76294 | train loss 3.295835 | norm 0.1177 | lr 7.99e-04 | (9886.66 ms | 53030 tok/s) step 8581/76294 | train loss 3.276771 | norm 0.1223 | lr 7.99e-04 | (10283.52 ms | 50983 tok/s) step 8582/76294 | train loss 3.261399 | norm 0.1157 | lr 7.99e-04 | (9883.00 ms | 53049 tok/s) step 8583/76294 | train loss 3.245147 | norm 0.1321 | lr 7.98e-04 | (9901.54 ms | 52950 tok/s) step 8584/76294 | train loss 3.196147 | norm 0.1183 | lr 7.98e-04 | (9890.28 ms | 53010 tok/s) step 8585/76294 | train loss 3.255307 | norm 0.1218 | lr 7.98e-04 | (9897.27 ms | 52973 tok/s) step 8586/76294 | train loss 3.245252 | norm 0.1288 | lr 7.98e-04 | (9888.47 ms | 53020 tok/s) step 8587/76294 | train loss 3.219232 | norm 0.1189 | lr 7.98e-04 | (9901.39 ms | 52951 tok/s) step 8588/76294 | train loss 3.313609 | norm 0.1201 | lr 7.98e-04 | (9884.09 ms | 53044 tok/s) step 8589/76294 | train loss 3.258232 | norm 0.1253 | lr 7.98e-04 | (9919.11 ms | 52856 tok/s) step 8590/76294 | train loss 3.173239 | norm 0.1354 | lr 7.98e-04 | (9925.95 ms | 52820 tok/s) step 8591/76294 | train loss 3.302016 | norm 0.1411 | lr 7.98e-04 | (10794.60 ms | 48569 tok/s) step 8592/76294 | train loss 3.372133 | norm 0.1734 | lr 7.98e-04 | (9897.31 ms | 52973 tok/s) step 8593/76294 | train loss 3.230572 | norm 0.1359 | lr 7.98e-04 | (9909.60 ms | 52907 tok/s) step 8594/76294 | train loss 3.267257 | norm 0.1600 | lr 7.98e-04 | (9916.76 ms | 52869 tok/s) step 8595/76294 | train loss 3.233414 | norm 0.1529 | lr 7.97e-04 | (9888.89 ms | 53018 tok/s) step 8596/76294 | train loss 3.251312 | norm 0.1424 | lr 7.97e-04 | (9920.00 ms | 52852 tok/s) step 8597/76294 | train loss 3.213769 | norm 0.1460 | lr 7.97e-04 | (9957.76 ms | 52651 tok/s) step 8598/76294 | train loss 3.354373 | norm 0.1547 | lr 7.97e-04 | (9888.18 ms | 53022 tok/s) step 8599/76294 | train loss 3.279183 | norm 0.1581 | lr 7.97e-04 | (9902.76 ms | 52944 tok/s) step 8600/76294 | train loss 3.237121 | norm 0.1470 | lr 7.97e-04 | (9896.55 ms | 52977 tok/s) step 8601/76294 | train loss 3.294402 | norm 0.1549 | lr 7.97e-04 | (9890.80 ms | 53008 tok/s) step 8602/76294 | train loss 3.240583 | norm 0.1344 | lr 7.97e-04 | (9896.03 ms | 52980 tok/s) step 8603/76294 | train loss 3.144948 | norm 0.1596 | lr 7.97e-04 | (9888.73 ms | 53019 tok/s) step 8604/76294 | train loss 3.281402 | norm 0.1365 | lr 7.97e-04 | (9900.77 ms | 52954 tok/s) step 8605/76294 | train loss 3.263360 | norm 0.1310 | lr 7.97e-04 | (9916.15 ms | 52872 tok/s) step 8606/76294 | train loss 3.192631 | norm 0.1275 | lr 7.96e-04 | (9890.64 ms | 53008 tok/s) step 8607/76294 | train loss 3.270951 | norm 0.1471 | lr 7.96e-04 | (9895.36 ms | 52983 tok/s) step 8608/76294 | train loss 3.413982 | norm 0.1199 | lr 7.96e-04 | (9934.76 ms | 52773 tok/s) step 8609/76294 | train loss 3.277124 | norm 0.1320 | lr 7.96e-04 | (10366.91 ms | 50573 tok/s) step 8610/76294 | train loss 3.234444 | norm 0.1391 | lr 7.96e-04 | (9895.61 ms | 52982 tok/s) step 8611/76294 | train loss 3.204459 | norm 0.1381 | lr 7.96e-04 | (9889.26 ms | 53016 tok/s) step 8612/76294 | train loss 3.244861 | norm 0.1295 | lr 7.96e-04 | (9899.32 ms | 52962 tok/s) step 8613/76294 | train loss 3.303488 | norm 0.1285 | lr 7.96e-04 | (9898.73 ms | 52965 tok/s) step 8614/76294 | train loss 3.242498 | norm 0.1321 | lr 7.96e-04 | (9890.19 ms | 53011 tok/s) step 8615/76294 | train loss 3.225961 | norm 0.1387 | lr 7.96e-04 | (9900.81 ms | 52954 tok/s) step 8616/76294 | train loss 3.215803 | norm 0.1400 | lr 7.96e-04 | (9889.84 ms | 53013 tok/s) step 8617/76294 | train loss 3.258698 | norm 0.1253 | lr 7.96e-04 | (9900.15 ms | 52958 tok/s) step 8618/76294 | train loss 3.255263 | norm 0.1316 | lr 7.95e-04 | (11346.50 ms | 46207 tok/s) step 8619/76294 | train loss 3.206307 | norm 0.1255 | lr 7.95e-04 | (9910.03 ms | 52905 tok/s) step 8620/76294 | train loss 3.237922 | norm 0.1384 | lr 7.95e-04 | (9902.46 ms | 52945 tok/s) step 8621/76294 | train loss 3.271973 | norm 0.1122 | lr 7.95e-04 | (9926.75 ms | 52816 tok/s) step 8622/76294 | train loss 3.269859 | norm 0.1207 | lr 7.95e-04 | (9884.79 ms | 53040 tok/s) step 8623/76294 | train loss 3.233824 | norm 0.1312 | lr 7.95e-04 | (9892.84 ms | 52997 tok/s) step 8624/76294 | train loss 3.230436 | norm 0.1235 | lr 7.95e-04 | (9886.71 ms | 53030 tok/s) step 8625/76294 | train loss 3.237675 | norm 0.1317 | lr 7.95e-04 | (9902.46 ms | 52945 tok/s) step 8626/76294 | train loss 3.325576 | norm 0.1272 | lr 7.95e-04 | (11068.67 ms | 47367 tok/s) step 8627/76294 | train loss 3.279244 | norm 0.1274 | lr 7.95e-04 | (9896.80 ms | 52976 tok/s) step 8628/76294 | train loss 3.222634 | norm 0.1410 | lr 7.95e-04 | (9905.62 ms | 52928 tok/s) step 8629/76294 | train loss 3.222140 | norm 0.1359 | lr 7.94e-04 | (9890.50 ms | 53009 tok/s) step 8630/76294 | train loss 3.240301 | norm 0.1196 | lr 7.94e-04 | (9882.04 ms | 53055 tok/s) step 8631/76294 | train loss 3.237382 | norm 0.1455 | lr 7.94e-04 | (11172.38 ms | 46927 tok/s) step 8632/76294 | train loss 3.208586 | norm 0.1091 | lr 7.94e-04 | (9874.78 ms | 53094 tok/s) step 8633/76294 | train loss 3.161857 | norm 0.1251 | lr 7.94e-04 | (9891.07 ms | 53006 tok/s) step 8634/76294 | train loss 3.363830 | norm 0.1728 | lr 7.94e-04 | (9903.41 ms | 52940 tok/s) step 8635/76294 | train loss 3.225867 | norm 0.2315 | lr 7.94e-04 | (9884.84 ms | 53040 tok/s) step 8636/76294 | train loss 3.240014 | norm 0.1299 | lr 7.94e-04 | (9916.74 ms | 52869 tok/s) step 8637/76294 | train loss 3.274188 | norm 0.1962 | lr 7.94e-04 | (9896.95 ms | 52975 tok/s) step 8638/76294 | train loss 3.302428 | norm 0.1517 | lr 7.94e-04 | (9899.96 ms | 52959 tok/s) step 8639/76294 | train loss 3.243830 | norm 0.1487 | lr 7.94e-04 | (9891.25 ms | 53005 tok/s) step 8640/76294 | train loss 3.340587 | norm 0.2171 | lr 7.93e-04 | (9936.28 ms | 52765 tok/s) step 8641/76294 | train loss 3.198003 | norm 0.1642 | lr 7.93e-04 | (9890.55 ms | 53009 tok/s) step 8642/76294 | train loss 3.205917 | norm 0.1555 | lr 7.93e-04 | (9893.91 ms | 52991 tok/s) step 8643/76294 | train loss 3.186701 | norm 0.1619 | lr 7.93e-04 | (9905.09 ms | 52931 tok/s) step 8644/76294 | train loss 3.197308 | norm 0.1418 | lr 7.93e-04 | (9892.59 ms | 52998 tok/s) step 8645/76294 | train loss 3.320782 | norm 0.1575 | lr 7.93e-04 | (9937.24 ms | 52760 tok/s) step 8646/76294 | train loss 3.160228 | norm 0.1359 | lr 7.93e-04 | (9904.14 ms | 52936 tok/s) step 8647/76294 | train loss 3.280028 | norm 0.1480 | lr 7.93e-04 | (9919.83 ms | 52852 tok/s) step 8648/76294 | train loss 3.310111 | norm 0.1339 | lr 7.93e-04 | (9901.01 ms | 52953 tok/s) step 8649/76294 | train loss 3.200536 | norm 0.1410 | lr 7.93e-04 | (9887.59 ms | 53025 tok/s) step 8650/76294 | train loss 3.191473 | norm 0.1417 | lr 7.93e-04 | (9984.48 ms | 52510 tok/s) step 8651/76294 | train loss 3.274658 | norm 0.1385 | lr 7.93e-04 | (9912.24 ms | 52893 tok/s) step 8652/76294 | train loss 3.261959 | norm 0.1412 | lr 7.92e-04 | (9903.17 ms | 52941 tok/s) step 8653/76294 | train loss 3.245885 | norm 0.1383 | lr 7.92e-04 | (9936.40 ms | 52764 tok/s) step 8654/76294 | train loss 3.286140 | norm 0.1366 | lr 7.92e-04 | (9889.88 ms | 53013 tok/s) step 8655/76294 | train loss 3.298424 | norm 0.1331 | lr 7.92e-04 | (9898.70 ms | 52965 tok/s) step 8656/76294 | train loss 3.264342 | norm 0.1430 | lr 7.92e-04 | (9886.84 ms | 53029 tok/s) step 8657/76294 | train loss 3.250883 | norm 0.1316 | lr 7.92e-04 | (9902.58 ms | 52945 tok/s) step 8658/76294 | train loss 3.250597 | norm 0.1354 | lr 7.92e-04 | (9893.94 ms | 52991 tok/s) step 8659/76294 | train loss 3.224787 | norm 0.1245 | lr 7.92e-04 | (9899.43 ms | 52961 tok/s) step 8660/76294 | train loss 3.233536 | norm 0.1349 | lr 7.92e-04 | (9891.58 ms | 53003 tok/s) step 8661/76294 | train loss 3.201432 | norm 0.1141 | lr 7.92e-04 | (9901.60 ms | 52950 tok/s) step 8662/76294 | train loss 3.277424 | norm 0.1301 | lr 7.92e-04 | (9893.70 ms | 52992 tok/s) step 8663/76294 | train loss 3.282078 | norm 0.1296 | lr 7.91e-04 | (9976.83 ms | 52551 tok/s) step 8664/76294 | train loss 3.245376 | norm 0.1493 | lr 7.91e-04 | (9891.70 ms | 53003 tok/s) step 8665/76294 | train loss 3.218866 | norm 0.1303 | lr 7.91e-04 | (9909.53 ms | 52907 tok/s) step 8666/76294 | train loss 3.260641 | norm 0.1383 | lr 7.91e-04 | (9888.98 ms | 53017 tok/s) step 8667/76294 | train loss 3.270392 | norm 0.1196 | lr 7.91e-04 | (9894.79 ms | 52986 tok/s) step 8668/76294 | train loss 3.263168 | norm 0.1146 | lr 7.91e-04 | (9920.61 ms | 52848 tok/s) step 8669/76294 | train loss 3.338685 | norm 0.1262 | lr 7.91e-04 | (9894.14 ms | 52990 tok/s) step 8670/76294 | train loss 3.206061 | norm 0.1097 | lr 7.91e-04 | (9895.39 ms | 52983 tok/s) step 8671/76294 | train loss 3.233961 | norm 0.1295 | lr 7.91e-04 | (9891.78 ms | 53002 tok/s) step 8672/76294 | train loss 3.478483 | norm 0.1233 | lr 7.91e-04 | (9917.72 ms | 52864 tok/s) step 8673/76294 | train loss 3.242304 | norm 0.1505 | lr 7.91e-04 | (9944.85 ms | 52720 tok/s) step 8674/76294 | train loss 3.227638 | norm 0.1175 | lr 7.91e-04 | (9894.75 ms | 52986 tok/s) step 8675/76294 | train loss 3.266156 | norm 0.2560 | lr 7.90e-04 | (9894.76 ms | 52986 tok/s) step 8676/76294 | train loss 3.270438 | norm 0.1489 | lr 7.90e-04 | (9907.90 ms | 52916 tok/s) step 8677/76294 | train loss 3.280982 | norm 0.1598 | lr 7.90e-04 | (9919.80 ms | 52853 tok/s) step 8678/76294 | train loss 3.307020 | norm 0.1618 | lr 7.90e-04 | (9902.59 ms | 52945 tok/s) step 8679/76294 | train loss 3.268251 | norm 0.1749 | lr 7.90e-04 | (9894.38 ms | 52988 tok/s) step 8680/76294 | train loss 3.244133 | norm 0.1523 | lr 7.90e-04 | (9995.24 ms | 52454 tok/s) step 8681/76294 | train loss 3.279592 | norm 0.1385 | lr 7.90e-04 | (9991.73 ms | 52472 tok/s) step 8682/76294 | train loss 3.253129 | norm 0.1533 | lr 7.90e-04 | (9887.52 ms | 53025 tok/s) step 8683/76294 | train loss 3.211574 | norm 0.1414 | lr 7.90e-04 | (9952.77 ms | 52678 tok/s) step 8684/76294 | train loss 3.264881 | norm 0.1416 | lr 7.90e-04 | (9896.77 ms | 52976 tok/s) step 8685/76294 | train loss 3.308532 | norm 0.1275 | lr 7.90e-04 | (9895.38 ms | 52983 tok/s) step 8686/76294 | train loss 3.188885 | norm 0.1257 | lr 7.89e-04 | (9924.44 ms | 52828 tok/s) step 8687/76294 | train loss 3.438196 | norm 0.1522 | lr 7.89e-04 | (9892.98 ms | 52996 tok/s) step 8688/76294 | train loss 3.278564 | norm 0.1432 | lr 7.89e-04 | (11435.00 ms | 45849 tok/s) step 8689/76294 | train loss 3.209915 | norm 0.1299 | lr 7.89e-04 | (10595.58 ms | 49482 tok/s) step 8690/76294 | train loss 3.174162 | norm 0.1204 | lr 7.89e-04 | (9869.09 ms | 53124 tok/s) step 8691/76294 | train loss 3.262141 | norm 0.1260 | lr 7.89e-04 | (9883.98 ms | 53044 tok/s) step 8692/76294 | train loss 3.215917 | norm 0.1314 | lr 7.89e-04 | (9942.70 ms | 52731 tok/s) step 8693/76294 | train loss 3.304048 | norm 0.1364 | lr 7.89e-04 | (9878.95 ms | 53071 tok/s) step 8694/76294 | train loss 3.221802 | norm 0.1184 | lr 7.89e-04 | (9925.77 ms | 52821 tok/s) step 8695/76294 | train loss 3.251537 | norm 0.1355 | lr 7.89e-04 | (9885.92 ms | 53034 tok/s) step 8696/76294 | train loss 3.191177 | norm 0.1119 | lr 7.89e-04 | (9883.37 ms | 53047 tok/s) step 8697/76294 | train loss 3.267969 | norm 0.1304 | lr 7.89e-04 | (9883.36 ms | 53048 tok/s) step 8698/76294 | train loss 3.241318 | norm 0.1308 | lr 7.88e-04 | (9888.79 ms | 53018 tok/s) step 8699/76294 | train loss 3.258981 | norm 0.1244 | lr 7.88e-04 | (12920.92 ms | 40577 tok/s) step 8700/76294 | train loss 3.208780 | norm 0.1317 | lr 7.88e-04 | (10041.36 ms | 52213 tok/s) step 8701/76294 | train loss 3.290884 | norm 0.1286 | lr 7.88e-04 | (9873.92 ms | 53098 tok/s) step 8702/76294 | train loss 3.205140 | norm 0.1455 | lr 7.88e-04 | (9876.92 ms | 53082 tok/s) step 8703/76294 | train loss 3.229149 | norm 0.1199 | lr 7.88e-04 | (9878.30 ms | 53075 tok/s) step 8704/76294 | train loss 3.231303 | norm 0.1242 | lr 7.88e-04 | (9884.75 ms | 53040 tok/s) step 8705/76294 | train loss 3.261261 | norm 0.1242 | lr 7.88e-04 | (9914.10 ms | 52883 tok/s) step 8706/76294 | train loss 3.268178 | norm 0.1745 | lr 7.88e-04 | (9947.67 ms | 52705 tok/s) step 8707/76294 | train loss 3.271171 | norm 0.1345 | lr 7.88e-04 | (9895.25 ms | 52984 tok/s) step 8708/76294 | train loss 3.208851 | norm 0.1332 | lr 7.88e-04 | (9892.53 ms | 52998 tok/s) step 8709/76294 | train loss 3.242271 | norm 0.1447 | lr 7.87e-04 | (9935.27 ms | 52770 tok/s) step 8710/76294 | train loss 3.267296 | norm 0.1337 | lr 7.87e-04 | (9888.19 ms | 53022 tok/s) step 8711/76294 | train loss 3.239338 | norm 0.1655 | lr 7.87e-04 | (9924.78 ms | 52826 tok/s) step 8712/76294 | train loss 3.149810 | norm 0.1365 | lr 7.87e-04 | (9893.23 ms | 52995 tok/s) step 8713/76294 | train loss 3.290587 | norm 0.1478 | lr 7.87e-04 | (9905.84 ms | 52927 tok/s) step 8714/76294 | train loss 3.190930 | norm 0.1397 | lr 7.87e-04 | (9893.35 ms | 52994 tok/s) step 8715/76294 | train loss 3.196405 | norm 0.1262 | lr 7.87e-04 | (9910.70 ms | 52901 tok/s) step 8716/76294 | train loss 3.248263 | norm 0.1369 | lr 7.87e-04 | (9894.99 ms | 52985 tok/s) step 8717/76294 | train loss 3.131725 | norm 0.1406 | lr 7.87e-04 | (9937.69 ms | 52758 tok/s) step 8718/76294 | train loss 3.282613 | norm 0.1296 | lr 7.87e-04 | (9903.36 ms | 52940 tok/s) step 8719/76294 | train loss 3.202005 | norm 0.1315 | lr 7.87e-04 | (9967.29 ms | 52601 tok/s) step 8720/76294 | train loss 3.238832 | norm 0.1488 | lr 7.87e-04 | (9894.00 ms | 52991 tok/s) step 8721/76294 | train loss 3.198549 | norm 0.1229 | lr 7.86e-04 | (9901.67 ms | 52949 tok/s) step 8722/76294 | train loss 3.235716 | norm 0.1229 | lr 7.86e-04 | (9892.80 ms | 52997 tok/s) step 8723/76294 | train loss 3.238881 | norm 0.1170 | lr 7.86e-04 | (9900.78 ms | 52954 tok/s) step 8724/76294 | train loss 3.189964 | norm 0.1288 | lr 7.86e-04 | (9893.69 ms | 52992 tok/s) step 8725/76294 | train loss 3.226772 | norm 0.1148 | lr 7.86e-04 | (9900.47 ms | 52956 tok/s) step 8726/76294 | train loss 3.328087 | norm 0.1293 | lr 7.86e-04 | (9893.24 ms | 52995 tok/s) step 8727/76294 | train loss 3.219084 | norm 0.1178 | lr 7.86e-04 | (9904.86 ms | 52932 tok/s) step 8728/76294 | train loss 3.277999 | norm 0.1354 | lr 7.86e-04 | (9891.42 ms | 53004 tok/s) step 8729/76294 | train loss 3.233769 | norm 0.1212 | lr 7.86e-04 | (9912.25 ms | 52893 tok/s) step 8730/76294 | train loss 3.277469 | norm 0.1352 | lr 7.86e-04 | (9890.93 ms | 53007 tok/s) step 8731/76294 | train loss 3.244795 | norm 0.1464 | lr 7.86e-04 | (9899.58 ms | 52961 tok/s) step 8732/76294 | train loss 3.226245 | norm 0.1274 | lr 7.85e-04 | (9891.84 ms | 53002 tok/s) step 8733/76294 | train loss 3.216664 | norm 0.1397 | lr 7.85e-04 | (9905.35 ms | 52930 tok/s) step 8734/76294 | train loss 3.325680 | norm 0.1310 | lr 7.85e-04 | (9897.62 ms | 52971 tok/s) step 8735/76294 | train loss 3.254107 | norm 0.1528 | lr 7.85e-04 | (9931.86 ms | 52789 tok/s) step 8736/76294 | train loss 3.245222 | norm 0.1288 | lr 7.85e-04 | (9927.37 ms | 52812 tok/s) step 8737/76294 | train loss 3.256836 | norm 0.1311 | lr 7.85e-04 | (9917.60 ms | 52864 tok/s) step 8738/76294 | train loss 3.341625 | norm 0.1296 | lr 7.85e-04 | (9963.94 ms | 52619 tok/s) step 8739/76294 | train loss 3.196826 | norm 0.1324 | lr 7.85e-04 | (9926.33 ms | 52818 tok/s) step 8740/76294 | train loss 3.233961 | norm 0.1262 | lr 7.85e-04 | (9895.72 ms | 52981 tok/s) step 8741/76294 | train loss 3.200213 | norm 0.1183 | lr 7.85e-04 | (9941.21 ms | 52739 tok/s) step 8742/76294 | train loss 3.200963 | norm 0.1297 | lr 7.85e-04 | (9892.72 ms | 52997 tok/s) step 8743/76294 | train loss 3.197912 | norm 0.1222 | lr 7.84e-04 | (9904.45 ms | 52935 tok/s) step 8744/76294 | train loss 3.271100 | norm 0.1223 | lr 7.84e-04 | (9894.05 ms | 52990 tok/s) step 8745/76294 | train loss 3.252699 | norm 0.1322 | lr 7.84e-04 | (9903.11 ms | 52942 tok/s) step 8746/76294 | train loss 3.227177 | norm 0.1282 | lr 7.84e-04 | (9951.08 ms | 52687 tok/s) step 8747/76294 | train loss 3.267251 | norm 0.1182 | lr 7.84e-04 | (9975.80 ms | 52556 tok/s) step 8748/76294 | train loss 3.224273 | norm 0.1198 | lr 7.84e-04 | (9898.74 ms | 52965 tok/s) step 8749/76294 | train loss 3.239241 | norm 0.1461 | lr 7.84e-04 | (9892.39 ms | 52999 tok/s) step 8750/76294 | train loss 3.138176 | norm 0.1647 | lr 7.84e-04 | (9889.84 ms | 53013 tok/s) val loss: 3.228174 saving model checkpoint to ./results/gpt2-350M-gqa/step_8750.pth step 8751/76294 | train loss 3.144637 | norm 0.1275 | lr 7.84e-04 | (9968.82 ms | 52593 tok/s) step 8752/76294 | train loss 3.200655 | norm 0.1496 | lr 7.84e-04 | (9958.76 ms | 52646 tok/s) step 8753/76294 | train loss 3.357261 | norm 0.1300 | lr 7.84e-04 | (9939.75 ms | 52747 tok/s) step 8754/76294 | train loss 3.245434 | norm 0.1407 | lr 7.84e-04 | (9879.70 ms | 53067 tok/s) step 8755/76294 | train loss 3.233064 | norm 0.1310 | lr 7.83e-04 | (9898.31 ms | 52967 tok/s) step 8756/76294 | train loss 3.215994 | norm 0.1245 | lr 7.83e-04 | (9885.09 ms | 53038 tok/s) step 8757/76294 | train loss 3.308466 | norm 0.1553 | lr 7.83e-04 | (9898.43 ms | 52967 tok/s) step 8758/76294 | train loss 3.272341 | norm 0.1354 | lr 7.83e-04 | (9889.38 ms | 53015 tok/s) step 8759/76294 | train loss 3.295631 | norm 0.1340 | lr 7.83e-04 | (9984.50 ms | 52510 tok/s) step 8760/76294 | train loss 3.230165 | norm 0.1370 | lr 7.83e-04 | (9895.60 ms | 52982 tok/s) step 8761/76294 | train loss 3.269050 | norm 0.1334 | lr 7.83e-04 | (9902.95 ms | 52943 tok/s) step 8762/76294 | train loss 3.158393 | norm 0.1334 | lr 7.83e-04 | (9900.29 ms | 52957 tok/s) step 8763/76294 | train loss 3.261205 | norm 0.1576 | lr 7.83e-04 | (9896.83 ms | 52975 tok/s) step 8764/76294 | train loss 3.254629 | norm 0.1213 | lr 7.83e-04 | (9921.74 ms | 52842 tok/s) step 8765/76294 | train loss 3.154040 | norm 0.1554 | lr 7.83e-04 | (9938.65 ms | 52752 tok/s) step 8766/76294 | train loss 3.250817 | norm 0.1378 | lr 7.82e-04 | (9894.82 ms | 52986 tok/s) step 8767/76294 | train loss 3.209990 | norm 0.1301 | lr 7.82e-04 | (9906.48 ms | 52924 tok/s) step 8768/76294 | train loss 3.279505 | norm 0.1259 | lr 7.82e-04 | (9894.87 ms | 52986 tok/s) step 8769/76294 | train loss 3.174346 | norm 0.1142 | lr 7.82e-04 | (9903.81 ms | 52938 tok/s) step 8770/76294 | train loss 3.290200 | norm 0.1193 | lr 7.82e-04 | (9893.49 ms | 52993 tok/s) step 8771/76294 | train loss 3.244856 | norm 0.1367 | lr 7.82e-04 | (9951.64 ms | 52684 tok/s) step 8772/76294 | train loss 3.212848 | norm 0.1139 | lr 7.82e-04 | (9892.55 ms | 52998 tok/s) step 8773/76294 | train loss 3.133754 | norm 0.1477 | lr 7.82e-04 | (9924.72 ms | 52826 tok/s) step 8774/76294 | train loss 3.250974 | norm 0.1354 | lr 7.82e-04 | (9896.52 ms | 52977 tok/s) step 8775/76294 | train loss 3.237450 | norm 0.1408 | lr 7.82e-04 | (9903.17 ms | 52941 tok/s) step 8776/76294 | train loss 3.237993 | norm 0.1334 | lr 7.82e-04 | (9890.87 ms | 53007 tok/s) step 8777/76294 | train loss 3.211124 | norm 0.1347 | lr 7.82e-04 | (9896.54 ms | 52977 tok/s) step 8778/76294 | train loss 3.240129 | norm 0.1405 | lr 7.81e-04 | (9963.89 ms | 52619 tok/s) step 8779/76294 | train loss 3.244299 | norm 0.1489 | lr 7.81e-04 | (9906.83 ms | 52922 tok/s) step 8780/76294 | train loss 3.215190 | norm 0.1401 | lr 7.81e-04 | (9890.74 ms | 53008 tok/s) step 8781/76294 | train loss 3.223725 | norm 0.1269 | lr 7.81e-04 | (9966.20 ms | 52607 tok/s) step 8782/76294 | train loss 3.205750 | norm 0.1509 | lr 7.81e-04 | (9902.15 ms | 52947 tok/s) step 8783/76294 | train loss 3.193474 | norm 0.1503 | lr 7.81e-04 | (9900.25 ms | 52957 tok/s) step 8784/76294 | train loss 3.207383 | norm 0.1426 | lr 7.81e-04 | (9964.90 ms | 52613 tok/s) step 8785/76294 | train loss 3.274986 | norm 0.1261 | lr 7.81e-04 | (9895.16 ms | 52984 tok/s) step 8786/76294 | train loss 3.265769 | norm 0.1222 | lr 7.81e-04 | (10695.17 ms | 49021 tok/s) step 8787/76294 | train loss 3.245266 | norm 0.1240 | lr 7.81e-04 | (9888.85 ms | 53018 tok/s) step 8788/76294 | train loss 3.267825 | norm 0.1344 | lr 7.81e-04 | (9943.57 ms | 52726 tok/s) step 8789/76294 | train loss 3.216081 | norm 0.1253 | lr 7.80e-04 | (9892.92 ms | 52996 tok/s) step 8790/76294 | train loss 3.276829 | norm 0.1202 | lr 7.80e-04 | (9908.46 ms | 52913 tok/s) step 8791/76294 | train loss 3.252666 | norm 0.1322 | lr 7.80e-04 | (9897.25 ms | 52973 tok/s) step 8792/76294 | train loss 3.248982 | norm 0.1267 | lr 7.80e-04 | (9940.35 ms | 52743 tok/s) step 8793/76294 | train loss 3.226439 | norm 0.1178 | lr 7.80e-04 | (9894.20 ms | 52989 tok/s) step 8794/76294 | train loss 3.187085 | norm 0.1178 | lr 7.80e-04 | (9896.60 ms | 52977 tok/s) step 8795/76294 | train loss 3.296756 | norm 0.1299 | lr 7.80e-04 | (9894.98 ms | 52985 tok/s) step 8796/76294 | train loss 3.258462 | norm 0.1460 | lr 7.80e-04 | (9897.07 ms | 52974 tok/s) step 8797/76294 | train loss 3.220918 | norm 0.1253 | lr 7.80e-04 | (9895.09 ms | 52985 tok/s) step 8798/76294 | train loss 3.235447 | norm 0.1426 | lr 7.80e-04 | (9893.67 ms | 52992 tok/s) step 8799/76294 | train loss 3.211307 | norm 0.1359 | lr 7.80e-04 | (9907.49 ms | 52918 tok/s) step 8800/76294 | train loss 3.280638 | norm 0.1457 | lr 7.79e-04 | (9892.17 ms | 53000 tok/s) step 8801/76294 | train loss 3.292847 | norm 0.1253 | lr 7.79e-04 | (9897.95 ms | 52969 tok/s) step 8802/76294 | train loss 3.277174 | norm 0.1309 | lr 7.79e-04 | (9928.73 ms | 52805 tok/s) step 8803/76294 | train loss 3.258719 | norm 0.1370 | lr 7.79e-04 | (9893.98 ms | 52991 tok/s) step 8804/76294 | train loss 3.248991 | norm 0.1383 | lr 7.79e-04 | (9898.99 ms | 52964 tok/s) step 8805/76294 | train loss 3.243365 | norm 0.1381 | lr 7.79e-04 | (9893.43 ms | 52994 tok/s) step 8806/76294 | train loss 3.233341 | norm 0.1426 | lr 7.79e-04 | (10032.91 ms | 52257 tok/s) step 8807/76294 | train loss 3.189058 | norm 0.1554 | lr 7.79e-04 | (9894.31 ms | 52989 tok/s) step 8808/76294 | train loss 3.147774 | norm 0.1300 | lr 7.79e-04 | (9938.50 ms | 52753 tok/s) step 8809/76294 | train loss 3.225710 | norm 0.1451 | lr 7.79e-04 | (9895.20 ms | 52984 tok/s) step 8810/76294 | train loss 3.198882 | norm 0.1240 | lr 7.79e-04 | (9894.59 ms | 52987 tok/s) step 8811/76294 | train loss 3.208488 | norm 0.1395 | lr 7.79e-04 | (9897.38 ms | 52972 tok/s) step 8812/76294 | train loss 3.173462 | norm 0.1363 | lr 7.78e-04 | (9893.45 ms | 52993 tok/s) step 8813/76294 | train loss 3.247651 | norm 0.1371 | lr 7.78e-04 | (9894.65 ms | 52987 tok/s) step 8814/76294 | train loss 3.238769 | norm 0.1415 | lr 7.78e-04 | (9899.39 ms | 52962 tok/s) step 8815/76294 | train loss 3.272046 | norm 0.1246 | lr 7.78e-04 | (9897.60 ms | 52971 tok/s) step 8816/76294 | train loss 3.228308 | norm 0.1310 | lr 7.78e-04 | (9945.03 ms | 52719 tok/s) step 8817/76294 | train loss 3.253174 | norm 0.1327 | lr 7.78e-04 | (9899.70 ms | 52960 tok/s) step 8818/76294 | train loss 3.224972 | norm 0.1180 | lr 7.78e-04 | (9893.69 ms | 52992 tok/s) step 8819/76294 | train loss 3.200851 | norm 0.1262 | lr 7.78e-04 | (9893.46 ms | 52993 tok/s) step 8820/76294 | train loss 3.231355 | norm 0.1202 | lr 7.78e-04 | (9890.83 ms | 53007 tok/s) step 8821/76294 | train loss 3.229631 | norm 0.1407 | lr 7.78e-04 | (9894.58 ms | 52987 tok/s) step 8822/76294 | train loss 3.214769 | norm 0.1135 | lr 7.78e-04 | (9892.00 ms | 53001 tok/s) step 8823/76294 | train loss 3.226466 | norm 0.1431 | lr 7.77e-04 | (9894.99 ms | 52985 tok/s) step 8824/76294 | train loss 3.259245 | norm 0.1438 | lr 7.77e-04 | (9918.54 ms | 52859 tok/s) step 8825/76294 | train loss 3.270411 | norm 0.1224 | lr 7.77e-04 | (9892.90 ms | 52996 tok/s) step 8826/76294 | train loss 3.201177 | norm 0.1432 | lr 7.77e-04 | (9961.39 ms | 52632 tok/s) step 8827/76294 | train loss 3.213247 | norm 0.1140 | lr 7.77e-04 | (9898.92 ms | 52964 tok/s) step 8828/76294 | train loss 3.252519 | norm 0.1326 | lr 7.77e-04 | (9941.29 ms | 52738 tok/s) step 8829/76294 | train loss 3.332335 | norm 0.1302 | lr 7.77e-04 | (9926.08 ms | 52819 tok/s) step 8830/76294 | train loss 3.242314 | norm 0.1361 | lr 7.77e-04 | (10407.97 ms | 50374 tok/s) step 8831/76294 | train loss 3.225274 | norm 0.1321 | lr 7.77e-04 | (9886.29 ms | 53032 tok/s) step 8832/76294 | train loss 3.206139 | norm 0.1514 | lr 7.77e-04 | (9911.12 ms | 52899 tok/s) step 8833/76294 | train loss 3.293425 | norm 0.1258 | lr 7.77e-04 | (9888.53 ms | 53020 tok/s) step 8834/76294 | train loss 3.264015 | norm 0.1522 | lr 7.77e-04 | (9949.15 ms | 52697 tok/s) step 8835/76294 | train loss 3.246487 | norm 0.1376 | lr 7.76e-04 | (9894.24 ms | 52989 tok/s) step 8836/76294 | train loss 3.276344 | norm 0.1448 | lr 7.76e-04 | (9931.28 ms | 52792 tok/s) step 8837/76294 | train loss 3.206662 | norm 0.1355 | lr 7.76e-04 | (9951.01 ms | 52687 tok/s) step 8838/76294 | train loss 3.209014 | norm 0.1470 | lr 7.76e-04 | (9952.28 ms | 52680 tok/s) step 8839/76294 | train loss 3.228323 | norm 0.1245 | lr 7.76e-04 | (9888.62 ms | 53019 tok/s) step 8840/76294 | train loss 3.194018 | norm 0.1725 | lr 7.76e-04 | (9903.32 ms | 52941 tok/s) step 8841/76294 | train loss 3.231317 | norm 0.1308 | lr 7.76e-04 | (9910.90 ms | 52900 tok/s) step 8842/76294 | train loss 3.295546 | norm 0.1475 | lr 7.76e-04 | (9889.90 ms | 53012 tok/s) step 8843/76294 | train loss 3.205650 | norm 0.1174 | lr 7.76e-04 | (9894.54 ms | 52988 tok/s) step 8844/76294 | train loss 3.261681 | norm 0.1351 | lr 7.76e-04 | (9924.28 ms | 52829 tok/s) step 8845/76294 | train loss 3.175875 | norm 0.1087 | lr 7.76e-04 | (9983.54 ms | 52515 tok/s) step 8846/76294 | train loss 3.237301 | norm 0.1384 | lr 7.75e-04 | (9888.16 ms | 53022 tok/s) step 8847/76294 | train loss 3.210582 | norm 0.1289 | lr 7.75e-04 | (9955.02 ms | 52666 tok/s) step 8848/76294 | train loss 3.222074 | norm 0.1391 | lr 7.75e-04 | (9890.74 ms | 53008 tok/s) step 8849/76294 | train loss 3.248575 | norm 0.1404 | lr 7.75e-04 | (9967.45 ms | 52600 tok/s) step 8850/76294 | train loss 3.215104 | norm 0.1388 | lr 7.75e-04 | (9893.60 ms | 52993 tok/s) step 8851/76294 | train loss 3.205907 | norm 0.1252 | lr 7.75e-04 | (9900.57 ms | 52955 tok/s) step 8852/76294 | train loss 3.200001 | norm 0.1374 | lr 7.75e-04 | (9906.00 ms | 52926 tok/s) step 8853/76294 | train loss 3.237557 | norm 0.1291 | lr 7.75e-04 | (9896.51 ms | 52977 tok/s) step 8854/76294 | train loss 3.249623 | norm 0.1363 | lr 7.75e-04 | (9888.41 ms | 53020 tok/s) step 8855/76294 | train loss 3.243381 | norm 0.1290 | lr 7.75e-04 | (9896.84 ms | 52975 tok/s) step 8856/76294 | train loss 3.207438 | norm 0.1231 | lr 7.75e-04 | (9888.93 ms | 53018 tok/s) step 8857/76294 | train loss 3.310801 | norm 0.1214 | lr 7.74e-04 | (9896.87 ms | 52975 tok/s) step 8858/76294 | train loss 3.176692 | norm 0.1322 | lr 7.74e-04 | (9890.53 ms | 53009 tok/s) step 8859/76294 | train loss 3.191359 | norm 0.1270 | lr 7.74e-04 | (9960.93 ms | 52634 tok/s) step 8860/76294 | train loss 3.122300 | norm 0.1197 | lr 7.74e-04 | (9895.32 ms | 52983 tok/s) step 8861/76294 | train loss 3.208812 | norm 0.1227 | lr 7.74e-04 | (9894.90 ms | 52986 tok/s) step 8862/76294 | train loss 3.245746 | norm 0.1194 | lr 7.74e-04 | (9968.25 ms | 52596 tok/s) step 8863/76294 | train loss 3.264481 | norm 0.1195 | lr 7.74e-04 | (9892.51 ms | 52998 tok/s) step 8864/76294 | train loss 3.173752 | norm 0.1127 | lr 7.74e-04 | (9890.87 ms | 53007 tok/s) step 8865/76294 | train loss 3.243992 | norm 0.1203 | lr 7.74e-04 | (9892.06 ms | 53001 tok/s) step 8866/76294 | train loss 3.233597 | norm 0.1265 | lr 7.74e-04 | (9904.28 ms | 52936 tok/s) step 8867/76294 | train loss 3.175753 | norm 0.1158 | lr 7.74e-04 | (9893.27 ms | 52994 tok/s) step 8868/76294 | train loss 3.257764 | norm 0.1277 | lr 7.74e-04 | (9895.59 ms | 52982 tok/s) step 8869/76294 | train loss 3.236708 | norm 0.1398 | lr 7.73e-04 | (9893.50 ms | 52993 tok/s) step 8870/76294 | train loss 3.254170 | norm 0.1234 | lr 7.73e-04 | (9891.65 ms | 53003 tok/s) step 8871/76294 | train loss 3.233767 | norm 0.1244 | lr 7.73e-04 | (9893.58 ms | 52993 tok/s) step 8872/76294 | train loss 3.253428 | norm 0.1501 | lr 7.73e-04 | (9915.60 ms | 52875 tok/s) step 8873/76294 | train loss 3.224095 | norm 0.1445 | lr 7.73e-04 | (9911.18 ms | 52899 tok/s) step 8874/76294 | train loss 3.207806 | norm 0.1345 | lr 7.73e-04 | (9894.06 ms | 52990 tok/s) step 8875/76294 | train loss 3.286093 | norm 0.1278 | lr 7.73e-04 | (9889.36 ms | 53015 tok/s) step 8876/76294 | train loss 3.219426 | norm 0.1427 | lr 7.73e-04 | (9898.44 ms | 52967 tok/s) step 8877/76294 | train loss 3.237215 | norm 0.1316 | lr 7.73e-04 | (9888.68 ms | 53019 tok/s) step 8878/76294 | train loss 3.254145 | norm 0.1254 | lr 7.73e-04 | (9889.14 ms | 53017 tok/s) step 8879/76294 | train loss 3.218672 | norm 0.1239 | lr 7.73e-04 | (9887.73 ms | 53024 tok/s) step 8880/76294 | train loss 3.198489 | norm 0.1155 | lr 7.72e-04 | (10344.19 ms | 50684 tok/s) step 8881/76294 | train loss 3.295071 | norm 0.1241 | lr 7.72e-04 | (9950.38 ms | 52690 tok/s) step 8882/76294 | train loss 3.268894 | norm 0.1329 | lr 7.72e-04 | (9881.36 ms | 53058 tok/s) step 8883/76294 | train loss 3.246225 | norm 0.1257 | lr 7.72e-04 | (9884.37 ms | 53042 tok/s) step 8884/76294 | train loss 3.202150 | norm 0.1156 | lr 7.72e-04 | (11119.17 ms | 47152 tok/s) step 8885/76294 | train loss 3.233711 | norm 0.1327 | lr 7.72e-04 | (9882.11 ms | 53054 tok/s) step 8886/76294 | train loss 3.235138 | norm 0.1145 | lr 7.72e-04 | (9872.76 ms | 53105 tok/s) step 8887/76294 | train loss 3.226477 | norm 0.1211 | lr 7.72e-04 | (9888.48 ms | 53020 tok/s) step 8888/76294 | train loss 3.282215 | norm 0.1230 | lr 7.72e-04 | (9883.26 ms | 53048 tok/s) step 8889/76294 | train loss 3.237774 | norm 0.1221 | lr 7.72e-04 | (9928.52 ms | 52806 tok/s) step 8890/76294 | train loss 3.228935 | norm 0.1303 | lr 7.72e-04 | (9885.00 ms | 53039 tok/s) step 8891/76294 | train loss 3.262664 | norm 0.1275 | lr 7.71e-04 | (9890.50 ms | 53009 tok/s) step 8892/76294 | train loss 3.231093 | norm 0.1282 | lr 7.71e-04 | (9880.59 ms | 53062 tok/s) step 8893/76294 | train loss 3.209891 | norm 0.1231 | lr 7.71e-04 | (9967.59 ms | 52599 tok/s) step 8894/76294 | train loss 3.144481 | norm 0.1456 | lr 7.71e-04 | (9882.02 ms | 53055 tok/s) step 8895/76294 | train loss 3.232128 | norm 0.1346 | lr 7.71e-04 | (9945.74 ms | 52715 tok/s) step 8896/76294 | train loss 3.284495 | norm 0.1411 | lr 7.71e-04 | (9926.29 ms | 52818 tok/s) step 8897/76294 | train loss 3.196380 | norm 0.1348 | lr 7.71e-04 | (9887.16 ms | 53027 tok/s) step 8898/76294 | train loss 3.230969 | norm 0.1223 | lr 7.71e-04 | (9918.91 ms | 52857 tok/s) step 8899/76294 | train loss 3.223955 | norm 0.1409 | lr 7.71e-04 | (9905.92 ms | 52927 tok/s) step 8900/76294 | train loss 3.169528 | norm 0.1187 | lr 7.71e-04 | (9889.05 ms | 53017 tok/s) step 8901/76294 | train loss 3.246759 | norm 0.1369 | lr 7.71e-04 | (9909.28 ms | 52909 tok/s) step 8902/76294 | train loss 3.288421 | norm 0.1181 | lr 7.71e-04 | (9886.22 ms | 53032 tok/s) step 8903/76294 | train loss 3.268118 | norm 0.1218 | lr 7.70e-04 | (9926.13 ms | 52819 tok/s) step 8904/76294 | train loss 3.282628 | norm 0.1301 | lr 7.70e-04 | (9882.42 ms | 53053 tok/s) step 8905/76294 | train loss 3.328929 | norm 0.1357 | lr 7.70e-04 | (9977.49 ms | 52547 tok/s) step 8906/76294 | train loss 3.228157 | norm 0.1210 | lr 7.70e-04 | (9888.57 ms | 53020 tok/s) step 8907/76294 | train loss 3.255281 | norm 0.1274 | lr 7.70e-04 | (9888.12 ms | 53022 tok/s) step 8908/76294 | train loss 3.233159 | norm 0.1278 | lr 7.70e-04 | (9928.54 ms | 52806 tok/s) step 8909/76294 | train loss 3.259300 | norm 0.1494 | lr 7.70e-04 | (9886.80 ms | 53029 tok/s) step 8910/76294 | train loss 3.313861 | norm 0.1288 | lr 7.70e-04 | (9893.64 ms | 52992 tok/s) step 8911/76294 | train loss 3.222887 | norm 0.1672 | lr 7.70e-04 | (9889.80 ms | 53013 tok/s) step 8912/76294 | train loss 3.270981 | norm 0.1580 | lr 7.70e-04 | (9889.26 ms | 53016 tok/s) step 8913/76294 | train loss 3.269285 | norm 0.1353 | lr 7.70e-04 | (9887.90 ms | 53023 tok/s) step 8914/76294 | train loss 3.185150 | norm 0.1628 | lr 7.69e-04 | (9917.98 ms | 52862 tok/s) step 8915/76294 | train loss 3.240917 | norm 0.1308 | lr 7.69e-04 | (9912.06 ms | 52894 tok/s) step 8916/76294 | train loss 3.263149 | norm 0.1475 | lr 7.69e-04 | (9894.85 ms | 52986 tok/s) step 8917/76294 | train loss 3.198687 | norm 0.1339 | lr 7.69e-04 | (9892.72 ms | 52997 tok/s) step 8918/76294 | train loss 3.218528 | norm 0.1362 | lr 7.69e-04 | (9927.75 ms | 52810 tok/s) step 8919/76294 | train loss 3.269803 | norm 0.1327 | lr 7.69e-04 | (9885.79 ms | 53035 tok/s) step 8920/76294 | train loss 3.232419 | norm 0.1348 | lr 7.69e-04 | (9901.06 ms | 52953 tok/s) step 8921/76294 | train loss 3.251751 | norm 0.1359 | lr 7.69e-04 | (9901.75 ms | 52949 tok/s) step 8922/76294 | train loss 3.254923 | norm 0.1286 | lr 7.69e-04 | (9891.11 ms | 53006 tok/s) step 8923/76294 | train loss 3.211573 | norm 0.1130 | lr 7.69e-04 | (9886.97 ms | 53028 tok/s) step 8924/76294 | train loss 3.265821 | norm 0.1298 | lr 7.69e-04 | (9898.41 ms | 52967 tok/s) step 8925/76294 | train loss 3.248600 | norm 0.1278 | lr 7.68e-04 | (9894.05 ms | 52990 tok/s) step 8926/76294 | train loss 3.298518 | norm 0.1288 | lr 7.68e-04 | (9891.35 ms | 53005 tok/s) step 8927/76294 | train loss 3.247580 | norm 0.1210 | lr 7.68e-04 | (9890.42 ms | 53010 tok/s) step 8928/76294 | train loss 3.261858 | norm 0.1170 | lr 7.68e-04 | (9892.09 ms | 53001 tok/s) step 8929/76294 | train loss 3.311778 | norm 0.1252 | lr 7.68e-04 | (9893.09 ms | 52995 tok/s) step 8930/76294 | train loss 3.283084 | norm 0.1299 | lr 7.68e-04 | (9888.09 ms | 53022 tok/s) step 8931/76294 | train loss 3.299196 | norm 0.1150 | lr 7.68e-04 | (9888.63 ms | 53019 tok/s) step 8932/76294 | train loss 3.272606 | norm 0.1157 | lr 7.68e-04 | (9960.54 ms | 52636 tok/s) step 8933/76294 | train loss 3.243992 | norm 0.1357 | lr 7.68e-04 | (9886.94 ms | 53028 tok/s) step 8934/76294 | train loss 3.216393 | norm 0.1209 | lr 7.68e-04 | (9890.17 ms | 53011 tok/s) step 8935/76294 | train loss 3.284461 | norm 0.1342 | lr 7.68e-04 | (9897.99 ms | 52969 tok/s) step 8936/76294 | train loss 3.241541 | norm 0.1290 | lr 7.68e-04 | (9908.08 ms | 52915 tok/s) step 8937/76294 | train loss 3.193809 | norm 0.1159 | lr 7.67e-04 | (9895.40 ms | 52983 tok/s) step 8938/76294 | train loss 3.327275 | norm 0.1390 | lr 7.67e-04 | (9887.18 ms | 53027 tok/s) step 8939/76294 | train loss 3.268273 | norm 0.1303 | lr 7.67e-04 | (9929.53 ms | 52801 tok/s) step 8940/76294 | train loss 3.239334 | norm 0.1371 | lr 7.67e-04 | (9887.77 ms | 53024 tok/s) step 8941/76294 | train loss 3.236725 | norm 0.1323 | lr 7.67e-04 | (9890.59 ms | 53009 tok/s) step 8942/76294 | train loss 3.247339 | norm 0.1233 | lr 7.67e-04 | (9954.04 ms | 52671 tok/s) step 8943/76294 | train loss 3.258210 | norm 0.1273 | lr 7.67e-04 | (9903.02 ms | 52942 tok/s) step 8944/76294 | train loss 3.257246 | norm 0.1273 | lr 7.67e-04 | (9902.56 ms | 52945 tok/s) step 8945/76294 | train loss 3.300375 | norm 0.1385 | lr 7.67e-04 | (9886.86 ms | 53029 tok/s) step 8946/76294 | train loss 3.202839 | norm 0.1294 | lr 7.67e-04 | (9907.80 ms | 52917 tok/s) step 8947/76294 | train loss 3.207146 | norm 0.1229 | lr 7.67e-04 | (9888.86 ms | 53018 tok/s) step 8948/76294 | train loss 3.281891 | norm 0.1342 | lr 7.66e-04 | (9890.47 ms | 53009 tok/s) step 8949/76294 | train loss 3.218386 | norm 0.1352 | lr 7.66e-04 | (9901.25 ms | 52952 tok/s) step 8950/76294 | train loss 3.263483 | norm 0.1189 | lr 7.66e-04 | (9957.55 ms | 52652 tok/s) step 8951/76294 | train loss 3.293921 | norm 0.1473 | lr 7.66e-04 | (9888.31 ms | 53021 tok/s) step 8952/76294 | train loss 3.192501 | norm 0.1338 | lr 7.66e-04 | (9958.26 ms | 52649 tok/s) step 8953/76294 | train loss 3.365619 | norm 0.1208 | lr 7.66e-04 | (9963.39 ms | 52621 tok/s) step 8954/76294 | train loss 3.278940 | norm 0.1291 | lr 7.66e-04 | (9894.25 ms | 52989 tok/s) step 8955/76294 | train loss 3.251420 | norm 0.1255 | lr 7.66e-04 | (9889.09 ms | 53017 tok/s) step 8956/76294 | train loss 3.265207 | norm 0.1373 | lr 7.66e-04 | (9898.99 ms | 52964 tok/s) step 8957/76294 | train loss 3.193368 | norm 0.1337 | lr 7.66e-04 | (9920.98 ms | 52846 tok/s) step 8958/76294 | train loss 3.269829 | norm 0.1240 | lr 7.66e-04 | (9885.48 ms | 53036 tok/s) step 8959/76294 | train loss 3.216639 | norm 0.1491 | lr 7.65e-04 | (9894.81 ms | 52986 tok/s) step 8960/76294 | train loss 3.327252 | norm 0.1315 | lr 7.65e-04 | (9889.42 ms | 53015 tok/s) step 8961/76294 | train loss 3.359436 | norm 0.1448 | lr 7.65e-04 | (9897.54 ms | 52972 tok/s) step 8962/76294 | train loss 3.239880 | norm 0.1349 | lr 7.65e-04 | (9887.51 ms | 53025 tok/s) step 8963/76294 | train loss 3.281542 | norm 0.1305 | lr 7.65e-04 | (9904.42 ms | 52935 tok/s) step 8964/76294 | train loss 3.241313 | norm 0.1408 | lr 7.65e-04 | (9884.26 ms | 53043 tok/s) step 8965/76294 | train loss 3.249890 | norm 0.1233 | lr 7.65e-04 | (9889.54 ms | 53014 tok/s) step 8966/76294 | train loss 3.254470 | norm 0.1283 | lr 7.65e-04 | (9891.91 ms | 53002 tok/s) step 8967/76294 | train loss 3.289304 | norm 0.1202 | lr 7.65e-04 | (9887.89 ms | 53023 tok/s) step 8968/76294 | train loss 3.202802 | norm 0.1387 | lr 7.65e-04 | (9923.42 ms | 52833 tok/s) step 8969/76294 | train loss 3.240792 | norm 0.1279 | lr 7.65e-04 | (9902.39 ms | 52946 tok/s) step 8970/76294 | train loss 3.267538 | norm 0.1488 | lr 7.65e-04 | (9891.73 ms | 53003 tok/s) step 8971/76294 | train loss 3.204177 | norm 0.1182 | lr 7.64e-04 | (9922.48 ms | 52838 tok/s) step 8972/76294 | train loss 3.245328 | norm 0.1552 | lr 7.64e-04 | (9885.41 ms | 53037 tok/s) step 8973/76294 | train loss 3.259848 | norm 0.1419 | lr 7.64e-04 | (9895.12 ms | 52984 tok/s) step 8974/76294 | train loss 3.204944 | norm 0.1252 | lr 7.64e-04 | (9885.02 ms | 53039 tok/s) step 8975/76294 | train loss 3.239218 | norm 0.1338 | lr 7.64e-04 | (9892.15 ms | 53000 tok/s) step 8976/76294 | train loss 3.264680 | norm 0.1148 | lr 7.64e-04 | (9887.34 ms | 53026 tok/s) step 8977/76294 | train loss 3.221380 | norm 0.1228 | lr 7.64e-04 | (9897.21 ms | 52973 tok/s) step 8978/76294 | train loss 3.235782 | norm 0.1235 | lr 7.64e-04 | (9883.30 ms | 53048 tok/s) step 8979/76294 | train loss 3.463598 | norm 0.1361 | lr 7.64e-04 | (9896.35 ms | 52978 tok/s) step 8980/76294 | train loss 3.253244 | norm 0.1209 | lr 7.64e-04 | (9884.69 ms | 53040 tok/s) step 8981/76294 | train loss 3.236289 | norm 0.1237 | lr 7.64e-04 | (11239.65 ms | 46646 tok/s) step 8982/76294 | train loss 3.265938 | norm 0.1243 | lr 7.63e-04 | (9876.77 ms | 53083 tok/s) step 8983/76294 | train loss 3.217446 | norm 0.1235 | lr 7.63e-04 | (9880.04 ms | 53065 tok/s) step 8984/76294 | train loss 3.237464 | norm 0.1292 | lr 7.63e-04 | (9877.95 ms | 53077 tok/s) step 8985/76294 | train loss 3.276748 | norm 0.1264 | lr 7.63e-04 | (9889.75 ms | 53013 tok/s) step 8986/76294 | train loss 3.257567 | norm 0.1229 | lr 7.63e-04 | (9887.54 ms | 53025 tok/s) step 8987/76294 | train loss 3.260447 | norm 0.1322 | lr 7.63e-04 | (9881.04 ms | 53060 tok/s) step 8988/76294 | train loss 3.261029 | norm 0.1172 | lr 7.63e-04 | (9892.07 ms | 53001 tok/s) step 8989/76294 | train loss 3.229093 | norm 0.1410 | lr 7.63e-04 | (9886.63 ms | 53030 tok/s) step 8990/76294 | train loss 3.188813 | norm 0.1113 | lr 7.63e-04 | (9905.15 ms | 52931 tok/s) step 8991/76294 | train loss 3.218515 | norm 0.1428 | lr 7.63e-04 | (9893.00 ms | 52996 tok/s) step 8992/76294 | train loss 3.154270 | norm 0.1459 | lr 7.63e-04 | (9878.42 ms | 53074 tok/s) step 8993/76294 | train loss 3.250906 | norm 0.1352 | lr 7.62e-04 | (9905.29 ms | 52930 tok/s) step 8994/76294 | train loss 3.287385 | norm 0.1350 | lr 7.62e-04 | (9928.53 ms | 52806 tok/s) step 8995/76294 | train loss 3.294060 | norm 0.1267 | lr 7.62e-04 | (9881.58 ms | 53057 tok/s) step 8996/76294 | train loss 3.206676 | norm 0.1326 | lr 7.62e-04 | (9895.68 ms | 52982 tok/s) step 8997/76294 | train loss 3.215154 | norm 0.1223 | lr 7.62e-04 | (9886.21 ms | 53032 tok/s) step 8998/76294 | train loss 3.323959 | norm 0.1372 | lr 7.62e-04 | (9896.41 ms | 52978 tok/s) step 8999/76294 | train loss 3.280907 | norm 0.1535 | lr 7.62e-04 | (9951.92 ms | 52682 tok/s) step 9000/76294 | train loss 3.277951 | norm 0.1275 | lr 7.62e-04 | (9881.98 ms | 53055 tok/s) val loss: 3.222470 saving model checkpoint to ./results/gpt2-350M-gqa/step_9000.pth step 9001/76294 | train loss 3.309733 | norm 0.1280 | lr 7.62e-04 | (10068.01 ms | 52075 tok/s) step 9002/76294 | train loss 3.262953 | norm 0.1185 | lr 7.62e-04 | (9864.04 ms | 53151 tok/s) step 9003/76294 | train loss 3.244890 | norm 0.1384 | lr 7.62e-04 | (9908.47 ms | 52913 tok/s) step 9004/76294 | train loss 3.273940 | norm 0.1198 | lr 7.62e-04 | (9877.88 ms | 53077 tok/s) step 9005/76294 | train loss 3.325740 | norm 0.1340 | lr 7.61e-04 | (9891.06 ms | 53006 tok/s) step 9006/76294 | train loss 3.189543 | norm 0.1171 | lr 7.61e-04 | (9870.86 ms | 53115 tok/s) step 9007/76294 | train loss 3.274642 | norm 0.1193 | lr 7.61e-04 | (9888.26 ms | 53021 tok/s) step 9008/76294 | train loss 3.292519 | norm 0.1295 | lr 7.61e-04 | (9874.28 ms | 53096 tok/s) step 9009/76294 | train loss 3.245203 | norm 0.1225 | lr 7.61e-04 | (10232.11 ms | 51239 tok/s) step 9010/76294 | train loss 3.270291 | norm 0.1273 | lr 7.61e-04 | (10099.70 ms | 51911 tok/s) step 9011/76294 | train loss 3.232493 | norm 0.1231 | lr 7.61e-04 | (9883.93 ms | 53045 tok/s) step 9012/76294 | train loss 3.249187 | norm 0.1383 | lr 7.61e-04 | (9874.60 ms | 53095 tok/s) step 9013/76294 | train loss 3.205230 | norm 0.1230 | lr 7.61e-04 | (9887.64 ms | 53025 tok/s) step 9014/76294 | train loss 3.253542 | norm 0.1245 | lr 7.61e-04 | (9878.23 ms | 53075 tok/s) step 9015/76294 | train loss 3.270867 | norm 0.1355 | lr 7.61e-04 | (9895.32 ms | 52983 tok/s) step 9016/76294 | train loss 3.212770 | norm 0.1389 | lr 7.60e-04 | (9898.69 ms | 52965 tok/s) step 9017/76294 | train loss 3.196436 | norm 0.1221 | lr 7.60e-04 | (9890.44 ms | 53010 tok/s) step 9018/76294 | train loss 3.238218 | norm 0.1356 | lr 7.60e-04 | (10791.06 ms | 48585 tok/s) step 9019/76294 | train loss 3.249962 | norm 0.1451 | lr 7.60e-04 | (9888.31 ms | 53021 tok/s) step 9020/76294 | train loss 3.240925 | norm 0.1262 | lr 7.60e-04 | (9889.65 ms | 53014 tok/s) step 9021/76294 | train loss 3.211007 | norm 0.1436 | lr 7.60e-04 | (9884.05 ms | 53044 tok/s) step 9022/76294 | train loss 3.238358 | norm 0.1440 | lr 7.60e-04 | (9888.29 ms | 53021 tok/s) step 9023/76294 | train loss 3.230040 | norm 0.1268 | lr 7.60e-04 | (9889.13 ms | 53017 tok/s) step 9024/76294 | train loss 3.246730 | norm 0.1505 | lr 7.60e-04 | (10856.39 ms | 48293 tok/s) step 9025/76294 | train loss 3.332565 | norm 0.1381 | lr 7.60e-04 | (9911.46 ms | 52897 tok/s) step 9026/76294 | train loss 3.259914 | norm 0.1529 | lr 7.60e-04 | (9881.31 ms | 53059 tok/s) step 9027/76294 | train loss 3.271175 | norm 0.1378 | lr 7.59e-04 | (9883.45 ms | 53047 tok/s) step 9028/76294 | train loss 3.248134 | norm 0.1473 | lr 7.59e-04 | (9989.58 ms | 52483 tok/s) step 9029/76294 | train loss 3.290134 | norm 0.1187 | lr 7.59e-04 | (9884.37 ms | 53042 tok/s) step 9030/76294 | train loss 3.283577 | norm 0.1507 | lr 7.59e-04 | (9955.92 ms | 52661 tok/s) step 9031/76294 | train loss 3.309642 | norm 0.1335 | lr 7.59e-04 | (9891.86 ms | 53002 tok/s) step 9032/76294 | train loss 3.192143 | norm 0.1662 | lr 7.59e-04 | (9906.84 ms | 52922 tok/s) step 9033/76294 | train loss 3.240278 | norm 0.1239 | lr 7.59e-04 | (9925.86 ms | 52820 tok/s) step 9034/76294 | train loss 3.253190 | norm 0.1295 | lr 7.59e-04 | (9888.46 ms | 53020 tok/s) step 9035/76294 | train loss 3.194235 | norm 0.1401 | lr 7.59e-04 | (9895.47 ms | 52983 tok/s) step 9036/76294 | train loss 3.232466 | norm 0.1218 | lr 7.59e-04 | (9887.36 ms | 53026 tok/s) step 9037/76294 | train loss 3.267088 | norm 0.1271 | lr 7.59e-04 | (9900.19 ms | 52957 tok/s) step 9038/76294 | train loss 3.168574 | norm 0.1248 | lr 7.59e-04 | (9950.81 ms | 52688 tok/s) step 9039/76294 | train loss 3.260620 | norm 0.1373 | lr 7.58e-04 | (9957.86 ms | 52651 tok/s) step 9040/76294 | train loss 3.217421 | norm 0.1241 | lr 7.58e-04 | (9890.85 ms | 53007 tok/s) step 9041/76294 | train loss 3.204060 | norm 0.1300 | lr 7.58e-04 | (9985.97 ms | 52502 tok/s) step 9042/76294 | train loss 3.270940 | norm 0.1278 | lr 7.58e-04 | (9892.68 ms | 52998 tok/s) step 9043/76294 | train loss 3.259071 | norm 0.1208 | lr 7.58e-04 | (9932.82 ms | 52783 tok/s) step 9044/76294 | train loss 3.267734 | norm 0.1321 | lr 7.58e-04 | (9892.36 ms | 52999 tok/s) step 9045/76294 | train loss 3.267025 | norm 0.1743 | lr 7.58e-04 | (9896.66 ms | 52976 tok/s) step 9046/76294 | train loss 3.438739 | norm 0.1891 | lr 7.58e-04 | (9915.67 ms | 52875 tok/s) step 9047/76294 | train loss 3.264381 | norm 0.1405 | lr 7.58e-04 | (9900.61 ms | 52955 tok/s) step 9048/76294 | train loss 3.266260 | norm 0.1591 | lr 7.58e-04 | (9898.68 ms | 52965 tok/s) step 9049/76294 | train loss 3.240435 | norm 0.1359 | lr 7.58e-04 | (9901.78 ms | 52949 tok/s) step 9050/76294 | train loss 3.267650 | norm 0.1552 | lr 7.57e-04 | (9901.07 ms | 52953 tok/s) step 9051/76294 | train loss 3.314984 | norm 0.1348 | lr 7.57e-04 | (9902.59 ms | 52945 tok/s) step 9052/76294 | train loss 3.230263 | norm 0.1533 | lr 7.57e-04 | (9902.00 ms | 52948 tok/s) step 9053/76294 | train loss 3.256792 | norm 0.1301 | lr 7.57e-04 | (9904.58 ms | 52934 tok/s) step 9054/76294 | train loss 3.245395 | norm 0.1404 | lr 7.57e-04 | (9890.46 ms | 53009 tok/s) step 9055/76294 | train loss 3.190780 | norm 0.1120 | lr 7.57e-04 | (9898.19 ms | 52968 tok/s) step 9056/76294 | train loss 3.177465 | norm 0.1337 | lr 7.57e-04 | (9894.94 ms | 52985 tok/s) step 9057/76294 | train loss 3.268301 | norm 0.1258 | lr 7.57e-04 | (9901.59 ms | 52950 tok/s) step 9058/76294 | train loss 3.196215 | norm 0.1203 | lr 7.57e-04 | (9891.14 ms | 53006 tok/s) step 9059/76294 | train loss 3.189743 | norm 0.1227 | lr 7.57e-04 | (9902.61 ms | 52944 tok/s) step 9060/76294 | train loss 3.234061 | norm 0.1262 | lr 7.57e-04 | (9892.87 ms | 52997 tok/s) step 9061/76294 | train loss 3.309567 | norm 0.1210 | lr 7.56e-04 | (9903.64 ms | 52939 tok/s) step 9062/76294 | train loss 3.294636 | norm 0.1223 | lr 7.56e-04 | (9886.38 ms | 53031 tok/s) step 9063/76294 | train loss 3.275032 | norm 0.1225 | lr 7.56e-04 | (9937.64 ms | 52758 tok/s) step 9064/76294 | train loss 3.266699 | norm 0.1541 | lr 7.56e-04 | (9891.12 ms | 53006 tok/s) step 9065/76294 | train loss 3.246051 | norm 0.1192 | lr 7.56e-04 | (9910.43 ms | 52903 tok/s) step 9066/76294 | train loss 3.260974 | norm 0.1433 | lr 7.56e-04 | (9892.16 ms | 53000 tok/s) step 9067/76294 | train loss 3.227926 | norm 0.1317 | lr 7.56e-04 | (9905.77 ms | 52928 tok/s) step 9068/76294 | train loss 3.278543 | norm 0.1307 | lr 7.56e-04 | (9887.53 ms | 53025 tok/s) step 9069/76294 | train loss 3.236746 | norm 0.1239 | lr 7.56e-04 | (9933.60 ms | 52779 tok/s) step 9070/76294 | train loss 3.215950 | norm 0.1237 | lr 7.56e-04 | (9890.47 ms | 53009 tok/s) step 9071/76294 | train loss 3.275438 | norm 0.1275 | lr 7.56e-04 | (10615.80 ms | 49387 tok/s) step 9072/76294 | train loss 3.340404 | norm 0.1256 | lr 7.56e-04 | (9887.14 ms | 53027 tok/s) step 9073/76294 | train loss 3.279651 | norm 0.1165 | lr 7.55e-04 | (9953.83 ms | 52672 tok/s) step 9074/76294 | train loss 3.281735 | norm 0.1377 | lr 7.55e-04 | (9883.24 ms | 53048 tok/s) step 9075/76294 | train loss 3.318369 | norm 0.1531 | lr 7.55e-04 | (9892.46 ms | 52999 tok/s) step 9076/76294 | train loss 3.255570 | norm 0.1373 | lr 7.55e-04 | (9887.67 ms | 53024 tok/s) step 9077/76294 | train loss 3.270499 | norm 0.1229 | lr 7.55e-04 | (9888.62 ms | 53019 tok/s) step 9078/76294 | train loss 3.250562 | norm 0.1360 | lr 7.55e-04 | (9890.82 ms | 53008 tok/s) step 9079/76294 | train loss 3.243952 | norm 0.1307 | lr 7.55e-04 | (11529.75 ms | 45473 tok/s) step 9080/76294 | train loss 3.272205 | norm 0.1524 | lr 7.55e-04 | (9887.51 ms | 53025 tok/s) step 9081/76294 | train loss 3.283458 | norm 0.1269 | lr 7.55e-04 | (10009.66 ms | 52378 tok/s) step 9082/76294 | train loss 3.343556 | norm 0.1405 | lr 7.55e-04 | (9879.70 ms | 53067 tok/s) step 9083/76294 | train loss 3.292635 | norm 0.1236 | lr 7.55e-04 | (9882.00 ms | 53055 tok/s) step 9084/76294 | train loss 3.223998 | norm 0.1419 | lr 7.54e-04 | (9886.37 ms | 53031 tok/s) step 9085/76294 | train loss 3.171131 | norm 0.1209 | lr 7.54e-04 | (9927.55 ms | 52811 tok/s) step 9086/76294 | train loss 3.252093 | norm 0.1360 | lr 7.54e-04 | (9881.66 ms | 53057 tok/s) step 9087/76294 | train loss 3.271634 | norm 0.1350 | lr 7.54e-04 | (9892.26 ms | 53000 tok/s) step 9088/76294 | train loss 3.262649 | norm 0.1169 | lr 7.54e-04 | (9881.14 ms | 53059 tok/s) step 9089/76294 | train loss 3.253106 | norm 0.1267 | lr 7.54e-04 | (9893.54 ms | 52993 tok/s) step 9090/76294 | train loss 3.254737 | norm 0.1149 | lr 7.54e-04 | (9885.78 ms | 53035 tok/s) step 9091/76294 | train loss 3.296459 | norm 0.1345 | lr 7.54e-04 | (9898.90 ms | 52964 tok/s) step 9092/76294 | train loss 3.220901 | norm 0.1189 | lr 7.54e-04 | (9974.53 ms | 52563 tok/s) step 9093/76294 | train loss 3.260040 | norm 0.1513 | lr 7.54e-04 | (9892.42 ms | 52999 tok/s) step 9094/76294 | train loss 3.239414 | norm 0.1215 | lr 7.54e-04 | (9889.92 ms | 53012 tok/s) step 9095/76294 | train loss 3.185039 | norm 0.1378 | lr 7.53e-04 | (9977.21 ms | 52549 tok/s) step 9096/76294 | train loss 3.269637 | norm 0.1300 | lr 7.53e-04 | (9885.05 ms | 53038 tok/s) step 9097/76294 | train loss 3.447496 | norm 0.1369 | lr 7.53e-04 | (9939.04 ms | 52750 tok/s) step 9098/76294 | train loss 3.261297 | norm 0.1463 | lr 7.53e-04 | (9913.80 ms | 52885 tok/s) step 9099/76294 | train loss 3.196079 | norm 0.1264 | lr 7.53e-04 | (9894.74 ms | 52987 tok/s) step 9100/76294 | train loss 3.221668 | norm 0.1465 | lr 7.53e-04 | (9892.90 ms | 52996 tok/s) step 9101/76294 | train loss 3.196463 | norm 0.1453 | lr 7.53e-04 | (9890.93 ms | 53007 tok/s) step 9102/76294 | train loss 3.254089 | norm 0.1200 | lr 7.53e-04 | (9891.56 ms | 53004 tok/s) step 9103/76294 | train loss 3.191168 | norm 0.1425 | lr 7.53e-04 | (9888.23 ms | 53021 tok/s) step 9104/76294 | train loss 3.209574 | norm 0.1195 | lr 7.53e-04 | (9914.03 ms | 52883 tok/s) step 9105/76294 | train loss 3.238437 | norm 0.1506 | lr 7.53e-04 | (9892.65 ms | 52998 tok/s) step 9106/76294 | train loss 3.278377 | norm 0.1250 | lr 7.52e-04 | (9894.52 ms | 52988 tok/s) step 9107/76294 | train loss 3.252495 | norm 0.1337 | lr 7.52e-04 | (9890.16 ms | 53011 tok/s) step 9108/76294 | train loss 3.321683 | norm 0.1249 | lr 7.52e-04 | (9895.13 ms | 52984 tok/s) step 9109/76294 | train loss 3.263514 | norm 0.1400 | lr 7.52e-04 | (9890.92 ms | 53007 tok/s) step 9110/76294 | train loss 3.329727 | norm 0.1714 | lr 7.52e-04 | (9896.06 ms | 52979 tok/s) step 9111/76294 | train loss 3.245992 | norm 0.1356 | lr 7.52e-04 | (9894.22 ms | 52989 tok/s) step 9112/76294 | train loss 3.278821 | norm 0.1426 | lr 7.52e-04 | (9895.72 ms | 52981 tok/s) step 9113/76294 | train loss 3.250475 | norm 0.1802 | lr 7.52e-04 | (9892.47 ms | 52999 tok/s) step 9114/76294 | train loss 3.245235 | norm 0.1496 | lr 7.52e-04 | (9892.24 ms | 53000 tok/s) step 9115/76294 | train loss 3.253788 | norm 0.1377 | lr 7.52e-04 | (9889.21 ms | 53016 tok/s) step 9116/76294 | train loss 3.280836 | norm 0.1458 | lr 7.52e-04 | (9892.75 ms | 52997 tok/s) step 9117/76294 | train loss 3.243336 | norm 0.1293 | lr 7.52e-04 | (9910.53 ms | 52902 tok/s) step 9118/76294 | train loss 3.280096 | norm 0.1297 | lr 7.51e-04 | (9889.62 ms | 53014 tok/s) step 9119/76294 | train loss 3.191755 | norm 0.1358 | lr 7.51e-04 | (9900.59 ms | 52955 tok/s) step 9120/76294 | train loss 3.173177 | norm 0.1316 | lr 7.51e-04 | (9951.68 ms | 52683 tok/s) step 9121/76294 | train loss 3.303302 | norm 0.1554 | lr 7.51e-04 | (9888.79 ms | 53018 tok/s) step 9122/76294 | train loss 3.249305 | norm 0.1281 | lr 7.51e-04 | (9951.44 ms | 52685 tok/s) step 9123/76294 | train loss 3.251683 | norm 0.1664 | lr 7.51e-04 | (9892.40 ms | 52999 tok/s) step 9124/76294 | train loss 3.218117 | norm 0.1636 | lr 7.51e-04 | (9899.71 ms | 52960 tok/s) step 9125/76294 | train loss 3.218798 | norm 0.1293 | lr 7.51e-04 | (9931.42 ms | 52791 tok/s) step 9126/76294 | train loss 3.272362 | norm 0.1449 | lr 7.51e-04 | (9889.55 ms | 53014 tok/s) step 9127/76294 | train loss 3.256195 | norm 0.1265 | lr 7.51e-04 | (9907.49 ms | 52918 tok/s) step 9128/76294 | train loss 3.253344 | norm 0.1374 | lr 7.51e-04 | (9951.19 ms | 52686 tok/s) step 9129/76294 | train loss 3.281213 | norm 0.1130 | lr 7.50e-04 | (9890.51 ms | 53009 tok/s) step 9130/76294 | train loss 3.267250 | norm 0.1252 | lr 7.50e-04 | (9921.00 ms | 52846 tok/s) step 9131/76294 | train loss 3.271752 | norm 0.1139 | lr 7.50e-04 | (9936.95 ms | 52761 tok/s) step 9132/76294 | train loss 3.256588 | norm 0.1150 | lr 7.50e-04 | (9884.66 ms | 53041 tok/s) step 9133/76294 | train loss 3.243786 | norm 0.1212 | lr 7.50e-04 | (9957.44 ms | 52653 tok/s) step 9134/76294 | train loss 3.298236 | norm 0.1173 | lr 7.50e-04 | (9928.36 ms | 52807 tok/s) step 9135/76294 | train loss 3.252360 | norm 0.1234 | lr 7.50e-04 | (9914.97 ms | 52878 tok/s) step 9136/76294 | train loss 3.250042 | norm 0.1173 | lr 7.50e-04 | (9885.98 ms | 53034 tok/s) step 9137/76294 | train loss 3.231893 | norm 0.1167 | lr 7.50e-04 | (9895.14 ms | 52984 tok/s) step 9138/76294 | train loss 3.235860 | norm 0.1205 | lr 7.50e-04 | (9887.69 ms | 53024 tok/s) step 9139/76294 | train loss 3.206997 | norm 0.1296 | lr 7.50e-04 | (9897.52 ms | 52972 tok/s) step 9140/76294 | train loss 3.245912 | norm 0.1302 | lr 7.49e-04 | (9887.96 ms | 53023 tok/s) step 9141/76294 | train loss 3.209184 | norm 0.1274 | lr 7.49e-04 | (9898.12 ms | 52968 tok/s) step 9142/76294 | train loss 3.262173 | norm 0.1199 | lr 7.49e-04 | (9907.76 ms | 52917 tok/s) step 9143/76294 | train loss 3.270079 | norm 0.1417 | lr 7.49e-04 | (9899.52 ms | 52961 tok/s) step 9144/76294 | train loss 3.229309 | norm 0.1176 | lr 7.49e-04 | (9885.77 ms | 53035 tok/s) step 9145/76294 | train loss 3.200937 | norm 0.1399 | lr 7.49e-04 | (9895.68 ms | 52981 tok/s) step 9146/76294 | train loss 3.252604 | norm 0.1255 | lr 7.49e-04 | (9981.63 ms | 52525 tok/s) step 9147/76294 | train loss 3.239844 | norm 0.1315 | lr 7.49e-04 | (9891.98 ms | 53001 tok/s) step 9148/76294 | train loss 3.306409 | norm 0.1392 | lr 7.49e-04 | (9896.27 ms | 52978 tok/s) step 9149/76294 | train loss 3.238761 | norm 0.1309 | lr 7.49e-04 | (9892.49 ms | 52999 tok/s) step 9150/76294 | train loss 3.256961 | norm 0.1561 | lr 7.49e-04 | (9897.86 ms | 52970 tok/s) step 9151/76294 | train loss 3.256262 | norm 0.1255 | lr 7.48e-04 | (9890.64 ms | 53008 tok/s) step 9152/76294 | train loss 3.277888 | norm 0.1404 | lr 7.48e-04 | (9933.73 ms | 52779 tok/s) step 9153/76294 | train loss 3.251757 | norm 0.1377 | lr 7.48e-04 | (9891.34 ms | 53005 tok/s) step 9154/76294 | train loss 3.179832 | norm 0.1336 | lr 7.48e-04 | (9910.88 ms | 52900 tok/s) step 9155/76294 | train loss 3.253463 | norm 0.1567 | lr 7.48e-04 | (9889.57 ms | 53014 tok/s) step 9156/76294 | train loss 3.232383 | norm 0.1367 | lr 7.48e-04 | (9888.88 ms | 53018 tok/s) step 9157/76294 | train loss 3.257114 | norm 0.1329 | lr 7.48e-04 | (9893.54 ms | 52993 tok/s) step 9158/76294 | train loss 3.267631 | norm 0.1236 | lr 7.48e-04 | (9895.34 ms | 52983 tok/s) step 9159/76294 | train loss 3.346001 | norm 0.1378 | lr 7.48e-04 | (9897.27 ms | 52973 tok/s) step 9160/76294 | train loss 3.260694 | norm 0.1216 | lr 7.48e-04 | (9900.63 ms | 52955 tok/s) step 9161/76294 | train loss 3.218199 | norm 0.1210 | lr 7.48e-04 | (9896.41 ms | 52978 tok/s) step 9162/76294 | train loss 3.237327 | norm 0.1336 | lr 7.48e-04 | (9897.60 ms | 52971 tok/s) step 9163/76294 | train loss 3.197570 | norm 0.1321 | lr 7.47e-04 | (9894.74 ms | 52987 tok/s) step 9164/76294 | train loss 3.242038 | norm 0.1336 | lr 7.47e-04 | (9893.92 ms | 52991 tok/s) step 9165/76294 | train loss 3.210105 | norm 0.1344 | lr 7.47e-04 | (9896.13 ms | 52979 tok/s) step 9166/76294 | train loss 3.256921 | norm 0.1220 | lr 7.47e-04 | (9905.14 ms | 52931 tok/s) step 9167/76294 | train loss 3.230471 | norm 0.1286 | lr 7.47e-04 | (9894.38 ms | 52988 tok/s) step 9168/76294 | train loss 3.255932 | norm 0.1391 | lr 7.47e-04 | (9929.97 ms | 52799 tok/s) step 9169/76294 | train loss 3.264293 | norm 0.1167 | lr 7.47e-04 | (9895.84 ms | 52981 tok/s) step 9170/76294 | train loss 3.204140 | norm 0.1406 | lr 7.47e-04 | (11143.96 ms | 47047 tok/s) step 9171/76294 | train loss 3.203764 | norm 0.1363 | lr 7.47e-04 | (10508.29 ms | 49893 tok/s) step 9172/76294 | train loss 3.256139 | norm 0.1285 | lr 7.47e-04 | (9965.07 ms | 52613 tok/s) step 9173/76294 | train loss 3.200848 | norm 0.1267 | lr 7.47e-04 | (9887.49 ms | 53025 tok/s) step 9174/76294 | train loss 3.178392 | norm 0.1231 | lr 7.46e-04 | (9890.02 ms | 53012 tok/s) step 9175/76294 | train loss 3.205762 | norm 0.1219 | lr 7.46e-04 | (9893.44 ms | 52994 tok/s) step 9176/76294 | train loss 3.167892 | norm 0.1108 | lr 7.46e-04 | (9901.21 ms | 52952 tok/s) step 9177/76294 | train loss 3.321577 | norm 0.1119 | lr 7.46e-04 | (10959.58 ms | 47838 tok/s) step 9178/76294 | train loss 3.263417 | norm 0.1163 | lr 7.46e-04 | (11595.50 ms | 45215 tok/s) step 9179/76294 | train loss 3.216004 | norm 0.1070 | lr 7.46e-04 | (10013.76 ms | 52357 tok/s) step 9180/76294 | train loss 3.179110 | norm 0.1200 | lr 7.46e-04 | (9883.79 ms | 53045 tok/s) step 9181/76294 | train loss 3.182577 | norm 0.1240 | lr 7.46e-04 | (9895.30 ms | 52984 tok/s) step 9182/76294 | train loss 3.214661 | norm 0.1370 | lr 7.46e-04 | (9936.60 ms | 52763 tok/s) step 9183/76294 | train loss 3.252264 | norm 0.1316 | lr 7.46e-04 | (9893.82 ms | 52991 tok/s) step 9184/76294 | train loss 3.225573 | norm 0.1243 | lr 7.46e-04 | (9912.27 ms | 52893 tok/s) step 9185/76294 | train loss 3.226945 | norm 0.1434 | lr 7.45e-04 | (9900.84 ms | 52954 tok/s) step 9186/76294 | train loss 3.248827 | norm 0.1419 | lr 7.45e-04 | (9904.30 ms | 52935 tok/s) step 9187/76294 | train loss 3.212008 | norm 0.1381 | lr 7.45e-04 | (9967.52 ms | 52600 tok/s) step 9188/76294 | train loss 3.216124 | norm 0.1536 | lr 7.45e-04 | (9934.19 ms | 52776 tok/s) step 9189/76294 | train loss 3.246679 | norm 0.1344 | lr 7.45e-04 | (9911.97 ms | 52894 tok/s) step 9190/76294 | train loss 3.227343 | norm 0.1419 | lr 7.45e-04 | (9895.25 ms | 52984 tok/s) step 9191/76294 | train loss 3.185664 | norm 0.1510 | lr 7.45e-04 | (9894.51 ms | 52988 tok/s) step 9192/76294 | train loss 3.213740 | norm 0.1268 | lr 7.45e-04 | (9899.20 ms | 52963 tok/s) step 9193/76294 | train loss 3.179615 | norm 0.1485 | lr 7.45e-04 | (9933.42 ms | 52780 tok/s) step 9194/76294 | train loss 3.279666 | norm 0.1354 | lr 7.45e-04 | (9898.51 ms | 52966 tok/s) step 9195/76294 | train loss 3.259771 | norm 0.1340 | lr 7.45e-04 | (9905.69 ms | 52928 tok/s) step 9196/76294 | train loss 3.200175 | norm 0.1187 | lr 7.44e-04 | (9892.41 ms | 52999 tok/s) step 9197/76294 | train loss 3.231336 | norm 0.1269 | lr 7.44e-04 | (9907.46 ms | 52919 tok/s) step 9198/76294 | train loss 3.273710 | norm 0.1184 | lr 7.44e-04 | (9888.97 ms | 53017 tok/s) step 9199/76294 | train loss 3.246637 | norm 0.1267 | lr 7.44e-04 | (9928.82 ms | 52805 tok/s) step 9200/76294 | train loss 3.194327 | norm 0.1292 | lr 7.44e-04 | (9894.40 ms | 52988 tok/s) step 9201/76294 | train loss 3.224900 | norm 0.1270 | lr 7.44e-04 | (9902.56 ms | 52945 tok/s) step 9202/76294 | train loss 3.168050 | norm 0.1296 | lr 7.44e-04 | (9895.18 ms | 52984 tok/s) step 9203/76294 | train loss 3.293580 | norm 0.1488 | lr 7.44e-04 | (9939.71 ms | 52747 tok/s) step 9204/76294 | train loss 3.247832 | norm 0.1317 | lr 7.44e-04 | (9898.54 ms | 52966 tok/s) step 9205/76294 | train loss 3.207716 | norm 0.1273 | lr 7.44e-04 | (9904.45 ms | 52935 tok/s) step 9206/76294 | train loss 3.228137 | norm 0.1355 | lr 7.44e-04 | (9896.31 ms | 52978 tok/s) step 9207/76294 | train loss 3.198735 | norm 0.1224 | lr 7.44e-04 | (9916.61 ms | 52870 tok/s) step 9208/76294 | train loss 3.225990 | norm 0.1178 | lr 7.43e-04 | (9898.26 ms | 52968 tok/s) step 9209/76294 | train loss 3.227691 | norm 0.1284 | lr 7.43e-04 | (9896.88 ms | 52975 tok/s) step 9210/76294 | train loss 3.142799 | norm 0.1195 | lr 7.43e-04 | (9902.04 ms | 52947 tok/s) step 9211/76294 | train loss 3.198076 | norm 0.1219 | lr 7.43e-04 | (9896.93 ms | 52975 tok/s) step 9212/76294 | train loss 3.271364 | norm 0.1474 | lr 7.43e-04 | (9900.49 ms | 52956 tok/s) step 9213/76294 | train loss 3.236554 | norm 0.1178 | lr 7.43e-04 | (9896.57 ms | 52977 tok/s) step 9214/76294 | train loss 3.224603 | norm 0.1421 | lr 7.43e-04 | (9898.26 ms | 52968 tok/s) step 9215/76294 | train loss 3.201037 | norm 0.1182 | lr 7.43e-04 | (9894.34 ms | 52989 tok/s) step 9216/76294 | train loss 3.210511 | norm 0.1357 | lr 7.43e-04 | (9899.17 ms | 52963 tok/s) step 9217/76294 | train loss 3.183836 | norm 0.1383 | lr 7.43e-04 | (9893.05 ms | 52996 tok/s) step 9218/76294 | train loss 3.218865 | norm 0.1235 | lr 7.43e-04 | (9897.57 ms | 52971 tok/s) step 9219/76294 | train loss 3.270050 | norm 0.1232 | lr 7.42e-04 | (9897.02 ms | 52974 tok/s) step 9220/76294 | train loss 3.241145 | norm 0.1188 | lr 7.42e-04 | (9901.97 ms | 52948 tok/s) step 9221/76294 | train loss 3.285030 | norm 0.1433 | lr 7.42e-04 | (9894.74 ms | 52987 tok/s) step 9222/76294 | train loss 3.199375 | norm 0.1376 | lr 7.42e-04 | (9900.23 ms | 52957 tok/s) step 9223/76294 | train loss 3.233290 | norm 0.1296 | lr 7.42e-04 | (9907.80 ms | 52917 tok/s) step 9224/76294 | train loss 3.205675 | norm 0.1297 | lr 7.42e-04 | (9938.61 ms | 52753 tok/s) step 9225/76294 | train loss 3.195611 | norm 0.1183 | lr 7.42e-04 | (9898.57 ms | 52966 tok/s) step 9226/76294 | train loss 3.230737 | norm 0.1287 | lr 7.42e-04 | (9953.77 ms | 52672 tok/s) step 9227/76294 | train loss 3.279318 | norm 0.1242 | lr 7.42e-04 | (9930.23 ms | 52797 tok/s) step 9228/76294 | train loss 3.221568 | norm 0.1537 | lr 7.42e-04 | (9897.43 ms | 52972 tok/s) step 9229/76294 | train loss 3.204006 | norm 0.1542 | lr 7.42e-04 | (9896.59 ms | 52977 tok/s) step 9230/76294 | train loss 3.233783 | norm 0.1355 | lr 7.41e-04 | (9888.39 ms | 53021 tok/s) step 9231/76294 | train loss 3.204841 | norm 0.1376 | lr 7.41e-04 | (9906.42 ms | 52924 tok/s) step 9232/76294 | train loss 3.154181 | norm 0.1250 | lr 7.41e-04 | (9893.21 ms | 52995 tok/s) step 9233/76294 | train loss 3.292684 | norm 0.1517 | lr 7.41e-04 | (9937.84 ms | 52757 tok/s) step 9234/76294 | train loss 3.164538 | norm 0.1185 | lr 7.41e-04 | (9892.99 ms | 52996 tok/s) step 9235/76294 | train loss 3.164245 | norm 0.1351 | lr 7.41e-04 | (9911.03 ms | 52899 tok/s) step 9236/76294 | train loss 3.190617 | norm 0.1214 | lr 7.41e-04 | (9888.94 ms | 53018 tok/s) step 9237/76294 | train loss 3.251710 | norm 0.1318 | lr 7.41e-04 | (9963.34 ms | 52622 tok/s) step 9238/76294 | train loss 3.172913 | norm 0.1169 | lr 7.41e-04 | (9898.00 ms | 52969 tok/s) step 9239/76294 | train loss 3.213791 | norm 0.1352 | lr 7.41e-04 | (9930.08 ms | 52798 tok/s) step 9240/76294 | train loss 3.255961 | norm 0.1194 | lr 7.41e-04 | (9889.66 ms | 53014 tok/s) step 9241/76294 | train loss 3.105010 | norm 0.1601 | lr 7.40e-04 | (9917.36 ms | 52866 tok/s) step 9242/76294 | train loss 3.192689 | norm 0.1531 | lr 7.40e-04 | (10035.90 ms | 52241 tok/s) step 9243/76294 | train loss 3.140738 | norm 0.1442 | lr 7.40e-04 | (9894.04 ms | 52990 tok/s) step 9244/76294 | train loss 3.202770 | norm 0.1412 | lr 7.40e-04 | (9890.56 ms | 53009 tok/s) step 9245/76294 | train loss 3.185200 | norm 0.1344 | lr 7.40e-04 | (9944.38 ms | 52722 tok/s) step 9246/76294 | train loss 3.212255 | norm 0.1321 | lr 7.40e-04 | (9888.51 ms | 53020 tok/s) step 9247/76294 | train loss 3.163430 | norm 0.1279 | lr 7.40e-04 | (9921.32 ms | 52845 tok/s) step 9248/76294 | train loss 3.225438 | norm 0.1328 | lr 7.40e-04 | (9888.08 ms | 53022 tok/s) step 9249/76294 | train loss 3.216441 | norm 0.1249 | lr 7.40e-04 | (9928.24 ms | 52808 tok/s) step 9250/76294 | train loss 3.195883 | norm 0.1321 | lr 7.40e-04 | (9885.51 ms | 53036 tok/s) val loss: 3.212154 saving model checkpoint to ./results/gpt2-350M-gqa/step_9250.pth step 9251/76294 | train loss 3.250019 | norm 0.1185 | lr 7.40e-04 | (9934.09 ms | 52777 tok/s) step 9252/76294 | train loss 3.238206 | norm 0.1299 | lr 7.40e-04 | (9843.61 ms | 53262 tok/s) step 9253/76294 | train loss 3.203488 | norm 0.1214 | lr 7.39e-04 | (9843.68 ms | 53261 tok/s) step 9254/76294 | train loss 3.282791 | norm 0.1480 | lr 7.39e-04 | (9875.56 ms | 53089 tok/s) step 9255/76294 | train loss 3.220132 | norm 0.1283 | lr 7.39e-04 | (9856.88 ms | 53190 tok/s) step 9256/76294 | train loss 3.286269 | norm 0.1291 | lr 7.39e-04 | (9860.61 ms | 53170 tok/s) step 9257/76294 | train loss 3.189593 | norm 0.1224 | lr 7.39e-04 | (9862.22 ms | 53161 tok/s) step 9258/76294 | train loss 3.216244 | norm 0.1213 | lr 7.39e-04 | (9893.19 ms | 52995 tok/s) step 9259/76294 | train loss 3.261819 | norm 0.1363 | lr 7.39e-04 | (9869.18 ms | 53124 tok/s) step 9260/76294 | train loss 3.149822 | norm 0.1301 | lr 7.39e-04 | (9892.38 ms | 52999 tok/s) step 9261/76294 | train loss 3.200625 | norm 0.1263 | lr 7.39e-04 | (10210.45 ms | 51348 tok/s) step 9262/76294 | train loss 3.226219 | norm 0.1203 | lr 7.39e-04 | (9880.78 ms | 53061 tok/s) step 9263/76294 | train loss 3.192669 | norm 0.1156 | lr 7.39e-04 | (9873.00 ms | 53103 tok/s) step 9264/76294 | train loss 3.245133 | norm 0.1239 | lr 7.38e-04 | (9867.35 ms | 53134 tok/s) step 9265/76294 | train loss 3.261234 | norm 0.1285 | lr 7.38e-04 | (9876.32 ms | 53085 tok/s) step 9266/76294 | train loss 3.250475 | norm 0.1403 | lr 7.38e-04 | (9870.54 ms | 53116 tok/s) step 9267/76294 | train loss 3.238551 | norm 0.1372 | lr 7.38e-04 | (9882.17 ms | 53054 tok/s) step 9268/76294 | train loss 3.204917 | norm 0.1332 | lr 7.38e-04 | (9869.26 ms | 53123 tok/s) step 9269/76294 | train loss 3.274578 | norm 0.1305 | lr 7.38e-04 | (9881.25 ms | 53059 tok/s) step 9270/76294 | train loss 3.206660 | norm 0.1265 | lr 7.38e-04 | (9915.67 ms | 52875 tok/s) step 9271/76294 | train loss 3.212841 | norm 0.1453 | lr 7.38e-04 | (9874.60 ms | 53095 tok/s) step 9272/76294 | train loss 3.232517 | norm 0.1120 | lr 7.38e-04 | (9895.21 ms | 52984 tok/s) step 9273/76294 | train loss 3.241068 | norm 0.1402 | lr 7.38e-04 | (9876.78 ms | 53083 tok/s) step 9274/76294 | train loss 3.244019 | norm 0.1837 | lr 7.38e-04 | (11171.10 ms | 46933 tok/s) step 9275/76294 | train loss 3.236094 | norm 0.1754 | lr 7.37e-04 | (9866.48 ms | 53138 tok/s) step 9276/76294 | train loss 3.228263 | norm 0.1536 | lr 7.37e-04 | (9884.28 ms | 53043 tok/s) step 9277/76294 | train loss 3.197516 | norm 0.1319 | lr 7.37e-04 | (9888.49 ms | 53020 tok/s) step 9278/76294 | train loss 3.266961 | norm 0.1515 | lr 7.37e-04 | (9873.01 ms | 53103 tok/s) step 9279/76294 | train loss 3.284638 | norm 0.1321 | lr 7.37e-04 | (9918.13 ms | 52862 tok/s) step 9280/76294 | train loss 3.272187 | norm 0.1397 | lr 7.37e-04 | (9937.34 ms | 52759 tok/s) step 9281/76294 | train loss 3.187063 | norm 0.1332 | lr 7.37e-04 | (9876.41 ms | 53085 tok/s) step 9282/76294 | train loss 3.181575 | norm 0.1439 | lr 7.37e-04 | (9884.59 ms | 53041 tok/s) step 9283/76294 | train loss 3.226919 | norm 0.1489 | lr 7.37e-04 | (9909.56 ms | 52907 tok/s) step 9284/76294 | train loss 3.224937 | norm 0.1283 | lr 7.37e-04 | (9873.78 ms | 53099 tok/s) step 9285/76294 | train loss 3.201875 | norm 0.1507 | lr 7.37e-04 | (9884.48 ms | 53042 tok/s) step 9286/76294 | train loss 3.241096 | norm 0.1307 | lr 7.36e-04 | (9873.77 ms | 53099 tok/s) step 9287/76294 | train loss 3.213899 | norm 0.1289 | lr 7.36e-04 | (9932.97 ms | 52783 tok/s) step 9288/76294 | train loss 3.231980 | norm 0.1435 | lr 7.36e-04 | (9883.91 ms | 53045 tok/s) step 9289/76294 | train loss 3.267347 | norm 0.1386 | lr 7.36e-04 | (9938.03 ms | 52756 tok/s) step 9290/76294 | train loss 3.234807 | norm 0.1348 | lr 7.36e-04 | (9880.25 ms | 53064 tok/s) step 9291/76294 | train loss 3.237748 | norm 0.1402 | lr 7.36e-04 | (9902.97 ms | 52943 tok/s) step 9292/76294 | train loss 3.209722 | norm 0.1740 | lr 7.36e-04 | (9920.13 ms | 52851 tok/s) step 9293/76294 | train loss 3.232270 | norm 0.1238 | lr 7.36e-04 | (9878.05 ms | 53076 tok/s) step 9294/76294 | train loss 3.259025 | norm 0.1469 | lr 7.36e-04 | (9886.31 ms | 53032 tok/s) step 9295/76294 | train loss 3.214378 | norm 0.1231 | lr 7.36e-04 | (9879.58 ms | 53068 tok/s) step 9296/76294 | train loss 3.252774 | norm 0.1384 | lr 7.36e-04 | (9910.78 ms | 52901 tok/s) step 9297/76294 | train loss 3.259704 | norm 0.1230 | lr 7.36e-04 | (9880.59 ms | 53062 tok/s) step 9298/76294 | train loss 3.171512 | norm 0.1161 | lr 7.35e-04 | (9882.20 ms | 53054 tok/s) step 9299/76294 | train loss 3.095624 | norm 0.1237 | lr 7.35e-04 | (9881.80 ms | 53056 tok/s) step 9300/76294 | train loss 3.257847 | norm 0.1307 | lr 7.35e-04 | (9878.72 ms | 53072 tok/s) step 9301/76294 | train loss 3.242863 | norm 0.1181 | lr 7.35e-04 | (9889.31 ms | 53016 tok/s) step 9302/76294 | train loss 3.230005 | norm 0.1344 | lr 7.35e-04 | (9877.34 ms | 53080 tok/s) step 9303/76294 | train loss 3.290199 | norm 0.1267 | lr 7.35e-04 | (9888.48 ms | 53020 tok/s) step 9304/76294 | train loss 3.197681 | norm 0.1155 | lr 7.35e-04 | (9874.83 ms | 53093 tok/s) step 9305/76294 | train loss 3.151107 | norm 0.1278 | lr 7.35e-04 | (9896.53 ms | 52977 tok/s) step 9306/76294 | train loss 3.245070 | norm 0.1132 | lr 7.35e-04 | (9881.26 ms | 53059 tok/s) step 9307/76294 | train loss 3.199717 | norm 0.1442 | lr 7.35e-04 | (9887.60 ms | 53025 tok/s) step 9308/76294 | train loss 3.186602 | norm 0.1248 | lr 7.35e-04 | (9883.59 ms | 53046 tok/s) step 9309/76294 | train loss 3.237814 | norm 0.1224 | lr 7.34e-04 | (9884.56 ms | 53041 tok/s) step 9310/76294 | train loss 3.252579 | norm 0.1387 | lr 7.34e-04 | (9874.46 ms | 53095 tok/s) step 9311/76294 | train loss 3.169549 | norm 0.1160 | lr 7.34e-04 | (9888.25 ms | 53021 tok/s) step 9312/76294 | train loss 3.166631 | norm 0.1590 | lr 7.34e-04 | (9877.29 ms | 53080 tok/s) step 9313/76294 | train loss 3.222338 | norm 0.1238 | lr 7.34e-04 | (9888.77 ms | 53019 tok/s) step 9314/76294 | train loss 3.199582 | norm 0.1581 | lr 7.34e-04 | (9920.30 ms | 52850 tok/s) step 9315/76294 | train loss 3.252270 | norm 0.1362 | lr 7.34e-04 | (9878.34 ms | 53075 tok/s) step 9316/76294 | train loss 3.239591 | norm 0.1346 | lr 7.34e-04 | (9876.94 ms | 53082 tok/s) step 9317/76294 | train loss 3.284690 | norm 0.1289 | lr 7.34e-04 | (9886.94 ms | 53028 tok/s) step 9318/76294 | train loss 3.252205 | norm 0.1625 | lr 7.34e-04 | (9877.34 ms | 53080 tok/s) step 9319/76294 | train loss 3.232823 | norm 0.1330 | lr 7.34e-04 | (9924.20 ms | 52829 tok/s) step 9320/76294 | train loss 3.173025 | norm 0.1539 | lr 7.33e-04 | (9936.72 ms | 52763 tok/s) step 9321/76294 | train loss 3.198217 | norm 0.1605 | lr 7.33e-04 | (9897.98 ms | 52969 tok/s) step 9322/76294 | train loss 3.231910 | norm 0.1438 | lr 7.33e-04 | (9940.92 ms | 52740 tok/s) step 9323/76294 | train loss 3.214902 | norm 0.1528 | lr 7.33e-04 | (9882.15 ms | 53054 tok/s) step 9324/76294 | train loss 3.194557 | norm 0.1337 | lr 7.33e-04 | (9870.96 ms | 53114 tok/s) step 9325/76294 | train loss 3.223329 | norm 0.1689 | lr 7.33e-04 | (9883.20 ms | 53048 tok/s) step 9326/76294 | train loss 3.209930 | norm 0.1364 | lr 7.33e-04 | (9917.36 ms | 52866 tok/s) step 9327/76294 | train loss 3.190577 | norm 0.1414 | lr 7.33e-04 | (9883.03 ms | 53049 tok/s) step 9328/76294 | train loss 3.174212 | norm 0.1255 | lr 7.33e-04 | (9919.02 ms | 52857 tok/s) step 9329/76294 | train loss 3.209845 | norm 0.1463 | lr 7.33e-04 | (9878.16 ms | 53075 tok/s) step 9330/76294 | train loss 3.209177 | norm 0.1422 | lr 7.33e-04 | (9906.71 ms | 52923 tok/s) step 9331/76294 | train loss 3.197639 | norm 0.1306 | lr 7.32e-04 | (9883.39 ms | 53047 tok/s) step 9332/76294 | train loss 3.224970 | norm 0.1585 | lr 7.32e-04 | (9981.28 ms | 52527 tok/s) step 9333/76294 | train loss 3.179039 | norm 0.1431 | lr 7.32e-04 | (9878.82 ms | 53072 tok/s) step 9334/76294 | train loss 3.198344 | norm 0.1753 | lr 7.32e-04 | (9931.96 ms | 52788 tok/s) step 9335/76294 | train loss 3.247627 | norm 0.1418 | lr 7.32e-04 | (9912.76 ms | 52890 tok/s) step 9336/76294 | train loss 3.171353 | norm 0.1644 | lr 7.32e-04 | (9876.40 ms | 53085 tok/s) step 9337/76294 | train loss 3.164299 | norm 0.1229 | lr 7.32e-04 | (9885.92 ms | 53034 tok/s) step 9338/76294 | train loss 3.204800 | norm 0.1627 | lr 7.32e-04 | (9874.47 ms | 53095 tok/s) step 9339/76294 | train loss 3.363519 | norm 0.1338 | lr 7.32e-04 | (9930.65 ms | 52795 tok/s) step 9340/76294 | train loss 3.187587 | norm 0.1396 | lr 7.32e-04 | (9880.41 ms | 53063 tok/s) step 9341/76294 | train loss 3.215950 | norm 0.1218 | lr 7.32e-04 | (9890.62 ms | 53009 tok/s) step 9342/76294 | train loss 3.300040 | norm 0.1585 | lr 7.31e-04 | (9877.23 ms | 53080 tok/s) step 9343/76294 | train loss 3.169723 | norm 0.1205 | lr 7.31e-04 | (9909.70 ms | 52907 tok/s) step 9344/76294 | train loss 3.215339 | norm 0.1498 | lr 7.31e-04 | (9876.05 ms | 53087 tok/s) step 9345/76294 | train loss 3.134395 | norm 0.1564 | lr 7.31e-04 | (9880.03 ms | 53065 tok/s) step 9346/76294 | train loss 3.313371 | norm 0.1647 | lr 7.31e-04 | (9901.39 ms | 52951 tok/s) step 9347/76294 | train loss 3.236999 | norm 0.1375 | lr 7.31e-04 | (9882.27 ms | 53053 tok/s) step 9348/76294 | train loss 3.166297 | norm 0.1412 | lr 7.31e-04 | (9909.40 ms | 52908 tok/s) step 9349/76294 | train loss 3.158630 | norm 0.1298 | lr 7.31e-04 | (9883.34 ms | 53048 tok/s) step 9350/76294 | train loss 3.308190 | norm 0.1405 | lr 7.31e-04 | (9923.91 ms | 52831 tok/s) step 9351/76294 | train loss 3.190487 | norm 0.2046 | lr 7.31e-04 | (9888.57 ms | 53020 tok/s) step 9352/76294 | train loss 3.243499 | norm 0.1433 | lr 7.31e-04 | (9889.76 ms | 53013 tok/s) step 9353/76294 | train loss 3.225806 | norm 0.1824 | lr 7.31e-04 | (9881.73 ms | 53056 tok/s) slurmstepd: error: *** JOB 11687021 ON g070 CANCELLED AT 2024-10-13T19:50:24 DUE TO TIME LIMIT ***