2023-06-15 01:56:32,530 INFO [train.py:1056] (3/4) Training started 2023-06-15 01:56:32,530 INFO [train.py:1066] (3/4) Device: cuda:3 2023-06-15 01:56:32,536 INFO [train.py:1075] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Debug', 'k2-with-cuda': True, 'k2-git-sha1': '38211604d6a24b15f320578a1a38f6c12d7a711c', 'k2-git-date': 'Mon Jun 12 10:59:44 2023', 'lhotse-version': '1.15.0.dev+git.f1fd23d.clean', 'torch-version': '2.0.0+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.8', 'icefall-git-branch': 'ted/zipformer', 'icefall-git-sha1': '323a299-dirty', 'icefall-git-date': 'Tue Jun 13 04:47:15 2023', 'icefall-path': '/exp/draj/jsalt2023/icefall', 'k2-path': '/exp/draj/jsalt2023/k2/k2/python/k2/__init__.py', 'lhotse-path': '/exp/draj/jsalt2023/lhotse/lhotse/__init__.py', 'hostname': 'r2n01', 'IP address': '10.1.2.1'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp/v5'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/manifests'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-06-15 01:56:32,536 INFO [train.py:1077] (3/4) About to create model 2023-06-15 01:56:33,438 INFO [train.py:1081] (3/4) Number of model parameters: 65549011 2023-06-15 01:56:46,283 INFO [train.py:1096] (3/4) Using DDP 2023-06-15 01:56:46,643 INFO [asr_datamodule.py:356] (3/4) About to get train cuts 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:185] (3/4) Enable SpecAugment 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:186] (3/4) Time warp factor: 80 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:202] (3/4) About to get Musan cuts 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:205] (3/4) Enable MUSAN 2023-06-15 01:56:48,323 INFO [asr_datamodule.py:227] (3/4) About to create train dataset 2023-06-15 01:56:48,323 INFO [asr_datamodule.py:253] (3/4) Using DynamicBucketingSampler. 2023-06-15 01:56:50,469 INFO [asr_datamodule.py:274] (3/4) About to create train dataloader 2023-06-15 01:56:50,470 INFO [asr_datamodule.py:361] (3/4) About to get dev cuts 2023-06-15 01:56:50,471 INFO [asr_datamodule.py:295] (3/4) About to create dev dataset 2023-06-15 01:56:50,491 INFO [asr_datamodule.py:314] (3/4) About to create dev dataloader 2023-06-15 01:56:50,491 INFO [train.py:1249] (3/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-15 01:57:37,208 INFO [scaling.py:962] (3/4) Whitening: name=None, num_groups=4, num_channels=128, metric=12.22 vs. limit=3.0 2023-06-15 01:57:37,530 INFO [scaling.py:962] (3/4) Whitening: name=None, num_groups=1, num_channels=256, metric=38.08 vs. limit=5.0 2023-06-15 01:57:37,874 INFO [train.py:1277] (3/4) Maximum memory allocated so far is 8669MB 2023-06-15 01:57:40,148 INFO [train.py:1277] (3/4) Maximum memory allocated so far is 8785MB 2023-06-15 01:57:51,999 INFO [train.py:1277] (3/4) Maximum memory allocated so far is 11519MB 2023-06-15 01:57:58,253 INFO [train.py:1277] (3/4) Maximum memory allocated so far is 11847MB 2023-06-15 01:58:17,641 INFO [train.py:1277] (3/4) Maximum memory allocated so far is 11847MB 2023-06-15 01:58:27,469 INFO [train.py:1277] (3/4) Maximum memory allocated so far is 11960MB 2023-06-15 01:58:50,927 INFO [train.py:988] (3/4) Epoch 1, batch 0, loss[loss=7.819, simple_loss=7.116, pruned_loss=7.018, over 16339.00 frames. ], tot_loss[loss=7.819, simple_loss=7.116, pruned_loss=7.018, over 16339.00 frames. ], batch size: 52, lr: 2.00e-02, grad_scale: 1.0 2023-06-15 01:58:50,928 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 01:58:57,872 INFO [train.py:1020] (3/4) Epoch 1, validation: loss=7.824, simple_loss=7.131, pruned_loss=6.914, over 143649.00 frames. 2023-06-15 01:58:57,872 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 11960MB 2023-06-15 01:59:05,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=0.0, ans=0.5 2023-06-15 01:59:44,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=66.66666666666667, ans=0.8976666666666667 2023-06-15 02:00:44,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=17.50 vs. limit=5.05 2023-06-15 02:00:58,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=142.14 vs. limit=7.6 2023-06-15 02:01:02,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=187.97 vs. limit=7.6 2023-06-15 02:01:08,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=266.6666666666667, ans=0.4875 2023-06-15 02:01:16,030 INFO [train.py:988] (3/4) Epoch 1, batch 50, loss[loss=1.39, simple_loss=1.254, pruned_loss=1.235, over 20077.00 frames. ], tot_loss[loss=3.39, simple_loss=3.122, pruned_loss=2.619, over 844398.90 frames. ], batch size: 293, lr: 2.20e-02, grad_scale: 0.25 2023-06-15 02:01:20,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=63.92 vs. limit=7.75 2023-06-15 02:01:29,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=212.90 vs. limit=7.625 2023-06-15 02:01:33,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=333.3333333333333, ans=0.484375 2023-06-15 02:01:43,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=123.13 vs. limit=7.65 2023-06-15 02:01:45,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=400.0, ans=0.0975 2023-06-15 02:02:06,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466.6666666666667, ans=0.29533333333333334 2023-06-15 02:02:15,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=65.36 vs. limit=7.675 2023-06-15 02:02:17,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=27.80 vs. limit=5.233333333333333 2023-06-15 02:02:24,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=4.213333333333333 2023-06-15 02:02:32,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=533.3333333333334, ans=0.08800000000000001 2023-06-15 02:02:48,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=84.83 vs. limit=7.725 2023-06-15 02:02:53,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=22.82 vs. limit=7.725 2023-06-15 02:02:56,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=369.30 vs. limit=7.725 2023-06-15 02:02:57,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=600.0, ans=0.879 2023-06-15 02:03:02,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=7.725 2023-06-15 02:03:06,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=32.26 vs. limit=5.333333333333333 2023-06-15 02:03:07,763 INFO [train.py:988] (3/4) Epoch 1, batch 100, loss[loss=1.365, simple_loss=1.179, pruned_loss=1.488, over 18314.00 frames. ], tot_loss[loss=2.247, simple_loss=2.038, pruned_loss=1.921, over 1487573.26 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 0.5 2023-06-15 02:03:14,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.068e+02 1.535e+02 6.068e+02 3.361e+03 1.967e+04, threshold=1.214e+03, percent-clipped=0.0 2023-06-15 02:03:19,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=36.10 vs. limit=8.0 2023-06-15 02:03:26,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=30.97 vs. limit=8.0 2023-06-15 02:03:31,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=733.3333333333334, ans=0.17250000000000001 2023-06-15 02:03:48,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=76.83 vs. limit=7.775 2023-06-15 02:03:57,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=68.09 vs. limit=7.8 2023-06-15 02:04:01,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.07 vs. limit=8.1 2023-06-15 02:04:07,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.81 vs. limit=7.8 2023-06-15 02:04:26,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=36.44 vs. limit=7.825 2023-06-15 02:04:31,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.94 vs. limit=8.15 2023-06-15 02:04:32,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=933.3333333333334, ans=0.45625 2023-06-15 02:04:33,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=35.84 vs. limit=7.85 2023-06-15 02:04:39,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=116.70 vs. limit=5.0 2023-06-15 02:04:40,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=933.3333333333334, ans=0.04666666666666667 2023-06-15 02:04:44,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=933.3333333333334, ans=0.7593333333333333 2023-06-15 02:04:44,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=4.373333333333333 2023-06-15 02:04:48,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.23 vs. limit=8.2 2023-06-15 02:04:51,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.46 vs. limit=5.466666666666667 2023-06-15 02:04:56,164 INFO [train.py:988] (3/4) Epoch 1, batch 150, loss[loss=0.9756, simple_loss=0.8401, pruned_loss=0.9956, over 20262.00 frames. ], tot_loss[loss=1.769, simple_loss=1.583, pruned_loss=1.609, over 2005148.43 frames. ], batch size: 239, lr: 2.60e-02, grad_scale: 0.5 2023-06-15 02:05:01,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=28.47 vs. limit=7.875 2023-06-15 02:05:05,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1000.0, ans=0.0775 2023-06-15 02:05:05,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=7.875 2023-06-15 02:05:06,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=187.94 vs. limit=7.875 2023-06-15 02:05:10,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.65 vs. limit=7.875 2023-06-15 02:05:11,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1000.0, ans=0.453125 2023-06-15 02:05:12,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.72 vs. limit=8.25 2023-06-15 02:05:14,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.32 vs. limit=8.25 2023-06-15 02:05:48,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=7.925 2023-06-15 02:05:52,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=61.53 vs. limit=7.925 2023-06-15 02:05:56,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=29.49 vs. limit=7.925 2023-06-15 02:06:00,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1200.0, ans=0.04625 2023-06-15 02:06:11,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.58 vs. limit=8.4 2023-06-15 02:06:27,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1266.6666666666667, ans=0.1525 2023-06-15 02:06:36,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=26.62 vs. limit=7.975 2023-06-15 02:06:39,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=44.98 vs. limit=7.975 2023-06-15 02:06:40,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.46 vs. limit=8.45 2023-06-15 02:06:44,810 INFO [train.py:988] (3/4) Epoch 1, batch 200, loss[loss=1.017, simple_loss=0.8657, pruned_loss=1.021, over 19670.00 frames. ], tot_loss[loss=1.5, simple_loss=1.328, pruned_loss=1.405, over 2417093.12 frames. ], batch size: 110, lr: 2.80e-02, grad_scale: 1.0 2023-06-15 02:06:50,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.982e+01 9.369e+01 1.058e+02 1.160e+02 1.378e+03, threshold=2.115e+02, percent-clipped=1.0 2023-06-15 02:06:59,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1333.3333333333333, ans=0.4375 2023-06-15 02:07:05,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=169.39 vs. limit=8.025 2023-06-15 02:07:21,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1400.0, ans=0.236 2023-06-15 02:07:24,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=8.025 2023-06-15 02:07:29,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1466.6666666666667, ans=0.43125 2023-06-15 02:07:34,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1466.6666666666667, ans=0.43125 2023-06-15 02:07:35,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1466.6666666666667, ans=0.145 2023-06-15 02:07:40,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1466.6666666666667, ans=0.09083333333333334 2023-06-15 02:07:57,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=4.613333333333333 2023-06-15 02:08:10,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=8.1 2023-06-15 02:08:13,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.44 vs. limit=8.1 2023-06-15 02:08:17,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=18.42 vs. limit=8.1 2023-06-15 02:08:21,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=86.30 vs. limit=5.8 2023-06-15 02:08:30,658 INFO [train.py:988] (3/4) Epoch 1, batch 250, loss[loss=0.8108, simple_loss=0.6961, pruned_loss=0.7435, over 16921.00 frames. ], tot_loss[loss=1.338, simple_loss=1.173, pruned_loss=1.267, over 2720008.63 frames. ], batch size: 391, lr: 3.00e-02, grad_scale: 1.0 2023-06-15 02:08:38,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=55.72 vs. limit=8.125 2023-06-15 02:08:39,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1666.6666666666667, ans=0.421875 2023-06-15 02:08:39,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1666.6666666666667, ans=8.125 2023-06-15 02:08:40,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=25.35 vs. limit=8.125 2023-06-15 02:08:48,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=4.666666666666667 2023-06-15 02:08:52,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.68 vs. limit=5.433333333333334 2023-06-15 02:08:53,465 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:08:56,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.44 vs. limit=5.433333333333334 2023-06-15 02:08:58,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.44 vs. limit=5.433333333333334 2023-06-15 02:09:02,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1733.3333333333333, ans=0.41875 2023-06-15 02:09:09,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1733.3333333333333, ans=0.061000000000000006 2023-06-15 02:09:09,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1733.3333333333333, ans=0.061000000000000006 2023-06-15 02:09:21,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=24.44 vs. limit=8.175 2023-06-15 02:09:24,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=31.23 vs. limit=8.175 2023-06-15 02:09:32,186 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:09:38,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1866.6666666666667, ans=0.2813333333333333 2023-06-15 02:09:38,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1866.6666666666667, ans=0.057999999999999996 2023-06-15 02:09:39,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=8.9 2023-06-15 02:09:44,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1866.6666666666667, ans=0.4125 2023-06-15 02:09:49,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.69 vs. limit=8.9 2023-06-15 02:09:50,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=14.06 vs. limit=5.466666666666667 2023-06-15 02:09:55,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=25.11 vs. limit=8.225 2023-06-15 02:10:05,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.76 vs. limit=5.966666666666667 2023-06-15 02:10:09,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1933.3333333333333, ans=5.966666666666667 2023-06-15 02:10:10,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1933.3333333333333, ans=0.2806666666666667 2023-06-15 02:10:16,473 INFO [train.py:988] (3/4) Epoch 1, batch 300, loss[loss=0.9479, simple_loss=0.7955, pruned_loss=0.9015, over 19880.00 frames. ], tot_loss[loss=1.223, simple_loss=1.064, pruned_loss=1.16, over 2967992.23 frames. ], batch size: 120, lr: 3.20e-02, grad_scale: 2.0 2023-06-15 02:10:18,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2000.0, ans=0.25 2023-06-15 02:10:20,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2000.0, ans=0.27999999999999997 2023-06-15 02:10:21,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=46.69 vs. limit=8.25 2023-06-15 02:10:22,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.963e+01 1.052e+02 1.248e+02 1.628e+02 2.864e+02, threshold=2.496e+02, percent-clipped=3.0 2023-06-15 02:10:29,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2000.0, ans=0.055 2023-06-15 02:10:31,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.63 vs. limit=6.0 2023-06-15 02:10:35,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2066.6666666666665, ans=0.403125 2023-06-15 02:10:37,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=9.05 2023-06-15 02:10:38,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2066.6666666666665, ans=0.403125 2023-06-15 02:11:02,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=74.77 vs. limit=8.3 2023-06-15 02:11:10,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=183.95 vs. limit=8.3 2023-06-15 02:11:12,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=9.1 2023-06-15 02:11:14,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.18 vs. limit=8.3 2023-06-15 02:11:14,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=8.3 2023-06-15 02:11:19,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2200.0, ans=0.1175 2023-06-15 02:12:02,176 INFO [train.py:988] (3/4) Epoch 1, batch 350, loss[loss=0.9462, simple_loss=0.7885, pruned_loss=0.8798, over 19333.00 frames. ], tot_loss[loss=1.152, simple_loss=0.9944, pruned_loss=1.089, over 3135944.44 frames. ], batch size: 98, lr: 3.40e-02, grad_scale: 2.0 2023-06-15 02:12:12,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2333.3333333333335, ans=0.0475 2023-06-15 02:12:15,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.95 vs. limit=9.25 2023-06-15 02:12:31,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.91 vs. limit=9.3 2023-06-15 02:12:35,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2400.0, ans=8.4 2023-06-15 02:12:46,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.22 vs. limit=8.425 2023-06-15 02:13:21,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=8.45 2023-06-15 02:13:25,733 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=8.475 2023-06-15 02:13:35,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=8.475 2023-06-15 02:13:35,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.79 vs. limit=9.45 2023-06-15 02:13:37,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=8.475 2023-06-15 02:13:46,989 INFO [train.py:988] (3/4) Epoch 1, batch 400, loss[loss=0.9183, simple_loss=0.7671, pruned_loss=0.8133, over 18935.00 frames. ], tot_loss[loss=1.099, simple_loss=0.9414, pruned_loss=1.028, over 3261412.87 frames. ], batch size: 86, lr: 3.60e-02, grad_scale: 4.0 2023-06-15 02:13:48,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=8.5 2023-06-15 02:13:52,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=29.15 vs. limit=8.5 2023-06-15 02:13:52,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 9.424e+01 1.333e+02 1.553e+02 2.129e+02 3.991e+02, threshold=3.107e+02, percent-clipped=15.0 2023-06-15 02:14:17,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2733.3333333333335, ans=0.0385 2023-06-15 02:14:35,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2800.0, ans=0.36875 2023-06-15 02:14:43,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=9.6 2023-06-15 02:14:45,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=2800.0, ans=5.12 2023-06-15 02:14:49,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.11 vs. limit=9.65 2023-06-15 02:14:56,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=14.57 vs. limit=8.575 2023-06-15 02:15:01,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2866.6666666666665, ans=0.2713333333333333 2023-06-15 02:15:01,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2866.6666666666665, ans=0.7996666666666667 2023-06-15 02:15:11,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.01 vs. limit=6.466666666666667 2023-06-15 02:15:33,095 INFO [train.py:988] (3/4) Epoch 1, batch 450, loss[loss=0.8194, simple_loss=0.6881, pruned_loss=0.6894, over 20121.00 frames. ], tot_loss[loss=1.054, simple_loss=0.8985, pruned_loss=0.9701, over 3372637.38 frames. ], batch size: 239, lr: 3.80e-02, grad_scale: 4.0 2023-06-15 02:15:39,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3000.0, ans=0.27 2023-06-15 02:15:51,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=3066.6666666666665, ans=0.7926666666666667 2023-06-15 02:16:08,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=3066.6666666666665, ans=0.246 2023-06-15 02:16:31,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=8.675 2023-06-15 02:16:35,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=8.7 2023-06-15 02:16:38,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=3200.0, ans=0.268 2023-06-15 02:16:43,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=3200.0, ans=0.08 2023-06-15 02:17:00,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=8.725 2023-06-15 02:17:06,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.80 vs. limit=5.306666666666667 2023-06-15 02:17:06,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.71 vs. limit=5.0 2023-06-15 02:17:07,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=3266.6666666666665, ans=0.03979166666666667 2023-06-15 02:17:10,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=5.816666666666666 2023-06-15 02:17:12,983 INFO [train.py:988] (3/4) Epoch 1, batch 500, loss[loss=0.8293, simple_loss=0.7032, pruned_loss=0.6579, over 19662.00 frames. ], tot_loss[loss=1.006, simple_loss=0.856, pruned_loss=0.9049, over 3476928.47 frames. ], batch size: 110, lr: 4.00e-02, grad_scale: 8.0 2023-06-15 02:17:18,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 9.868e+01 1.531e+02 1.902e+02 2.695e+02 6.993e+02, threshold=3.804e+02, percent-clipped=16.0 2023-06-15 02:17:22,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3333.3333333333335, ans=0.34375 2023-06-15 02:17:25,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.24 vs. limit=5.333333333333334 2023-06-15 02:17:50,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=3466.6666666666665, ans=0.3375 2023-06-15 02:17:52,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=3466.6666666666665, ans=7.166666666666666 2023-06-15 02:17:58,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=8.8 2023-06-15 02:18:02,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.04 vs. limit=10.1 2023-06-15 02:18:07,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.17 vs. limit=5.866666666666666 2023-06-15 02:18:43,113 INFO [train.py:988] (3/4) Epoch 2, batch 0, loss[loss=0.8712, simple_loss=0.747, pruned_loss=0.6581, over 18334.00 frames. ], tot_loss[loss=0.8712, simple_loss=0.747, pruned_loss=0.6581, over 18334.00 frames. ], batch size: 72, lr: 3.96e-02, grad_scale: 16.0 2023-06-15 02:18:43,113 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 02:18:49,233 INFO [train.py:1020] (3/4) Epoch 2, validation: loss=0.7911, simple_loss=0.6884, pruned_loss=0.5718, over 143649.00 frames. 2023-06-15 02:18:49,233 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13495MB 2023-06-15 02:19:13,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.44 vs. limit=5.448 2023-06-15 02:19:18,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=3620.0, ans=0.06424999999999997 2023-06-15 02:19:29,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=3686.6666666666665, ans=0.06174999999999997 2023-06-15 02:19:30,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=3686.6666666666665, ans=8.8825 2023-06-15 02:19:31,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=3686.6666666666665, ans=0.32718749999999996 2023-06-15 02:19:32,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=10.265 2023-06-15 02:20:06,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=8.9075 2023-06-15 02:20:15,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=3820.0, ans=0.02250000000000002 2023-06-15 02:20:23,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=8.932500000000001 2023-06-15 02:20:28,447 INFO [train.py:988] (3/4) Epoch 2, batch 50, loss[loss=0.7723, simple_loss=0.6668, pruned_loss=0.5592, over 20267.00 frames. ], tot_loss[loss=0.7864, simple_loss=0.6753, pruned_loss=0.5844, over 848174.13 frames. ], batch size: 141, lr: 3.95e-02, grad_scale: 16.0 2023-06-15 02:20:31,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3886.6666666666665, ans=0.26113333333333333 2023-06-15 02:20:35,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=8.9575 2023-06-15 02:20:55,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=8.9825 2023-06-15 02:20:59,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=3953.3333333333335, ans=0.01104999999999999 2023-06-15 02:21:07,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.379e+02 3.485e+02 5.608e+02 1.271e+03, threshold=6.971e+02, percent-clipped=46.0 2023-06-15 02:21:18,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4020.0, ans=0.31156249999999996 2023-06-15 02:21:25,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=10.515 2023-06-15 02:22:00,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4153.333333333333, ans=0.3053125 2023-06-15 02:22:06,302 INFO [train.py:988] (3/4) Epoch 2, batch 100, loss[loss=0.7021, simple_loss=0.6114, pruned_loss=0.487, over 20231.00 frames. ], tot_loss[loss=0.7512, simple_loss=0.6488, pruned_loss=0.5431, over 1500133.64 frames. ], batch size: 141, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:22:15,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=4220.0, ans=0.2578 2023-06-15 02:22:26,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=4286.666666666667, ans=0.03660416666666667 2023-06-15 02:22:29,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=4286.666666666667, ans=0.2990625 2023-06-15 02:23:13,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=9.1575 2023-06-15 02:23:39,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=4486.666666666667, ans=0.009894202898550725 2023-06-15 02:23:45,684 INFO [train.py:988] (3/4) Epoch 2, batch 150, loss[loss=0.7245, simple_loss=0.6423, pruned_loss=0.4709, over 18320.00 frames. ], tot_loss[loss=0.7219, simple_loss=0.627, pruned_loss=0.5086, over 2014104.81 frames. ], batch size: 72, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:23:45,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4553.333333333333, ans=0.2544666666666667 2023-06-15 02:23:49,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4553.333333333333, ans=0.2865625 2023-06-15 02:24:25,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.805e+02 4.216e+02 6.424e+02 2.276e+03, threshold=8.432e+02, percent-clipped=19.0 2023-06-15 02:24:25,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=4686.666666666667, ans=0.28031249999999996 2023-06-15 02:24:30,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.62 vs. limit=5.874666666666666 2023-06-15 02:24:31,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=4686.666666666667, ans=0.2531333333333333 2023-06-15 02:24:38,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4686.666666666667, ans=0.2531333333333333 2023-06-15 02:24:39,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=4686.666666666667, ans=7.929166666666667 2023-06-15 02:24:50,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=9.2825 2023-06-15 02:24:59,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.23 vs. limit=6.1883333333333335 2023-06-15 02:25:05,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.88 vs. limit=6.205 2023-06-15 02:25:05,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=4820.0, ans=0.2740625 2023-06-15 02:25:20,896 INFO [train.py:988] (3/4) Epoch 2, batch 200, loss[loss=0.64, simple_loss=0.5674, pruned_loss=0.4106, over 19873.00 frames. ], tot_loss[loss=0.6971, simple_loss=0.6093, pruned_loss=0.4783, over 2387097.91 frames. ], batch size: 120, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:25:34,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=4886.666666666667, ans=0.2709375 2023-06-15 02:26:25,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=5086.666666666667, ans=0.26156250000000003 2023-06-15 02:26:36,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=5153.333333333333, ans=0.2584375 2023-06-15 02:26:56,742 INFO [train.py:988] (3/4) Epoch 2, batch 250, loss[loss=0.6028, simple_loss=0.5496, pruned_loss=0.3555, over 16651.00 frames. ], tot_loss[loss=0.6734, simple_loss=0.5923, pruned_loss=0.4505, over 2702640.97 frames. ], batch size: 59, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:27:18,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5286.666666666667, ans=0.24713333333333332 2023-06-15 02:27:26,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5286.666666666667, ans=0.24713333333333332 2023-06-15 02:27:30,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-15 02:27:32,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=5286.666666666667, ans=0.19713333333333333 2023-06-15 02:27:37,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.822e+02 4.791e+02 8.752e+02 2.397e+03, threshold=9.582e+02, percent-clipped=28.0 2023-06-15 02:27:38,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=5353.333333333333, ans=0.24906250000000002 2023-06-15 02:27:55,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=5420.0, ans=0.24593749999999998 2023-06-15 02:28:12,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.33 vs. limit=6.371666666666667 2023-06-15 02:28:15,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5486.666666666667, ans=0.2428125 2023-06-15 02:28:20,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.71 vs. limit=6.371666666666667 2023-06-15 02:28:28,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=5486.666666666667, ans=0.2428125 2023-06-15 02:28:32,177 INFO [train.py:988] (3/4) Epoch 2, batch 300, loss[loss=0.6174, simple_loss=0.5534, pruned_loss=0.3772, over 18424.00 frames. ], tot_loss[loss=0.6523, simple_loss=0.5769, pruned_loss=0.4263, over 2960895.16 frames. ], batch size: 77, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:29:43,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5753.333333333333, ans=0.24246666666666666 2023-06-15 02:29:54,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5820.0, ans=0.2418 2023-06-15 02:30:07,018 INFO [train.py:988] (3/4) Epoch 2, batch 350, loss[loss=0.5577, simple_loss=0.4988, pruned_loss=0.3398, over 20318.00 frames. ], tot_loss[loss=0.6317, simple_loss=0.5621, pruned_loss=0.4033, over 3145868.82 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:30:07,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=5886.666666666667, ans=0.2240625 2023-06-15 02:30:12,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5886.666666666667, ans=0.2240625 2023-06-15 02:30:25,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5953.333333333333, ans=0.2209375 2023-06-15 02:30:25,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=11.965 2023-06-15 02:30:46,022 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 3.241e+02 6.004e+02 9.437e+02 1.791e+03, threshold=1.201e+03, percent-clipped=24.0 2023-06-15 02:30:52,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=9.7575 2023-06-15 02:31:15,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=6086.666666666667, ans=0.21468749999999998 2023-06-15 02:31:38,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=6220.0, ans=0.2084375 2023-06-15 02:31:40,061 INFO [train.py:988] (3/4) Epoch 2, batch 400, loss[loss=0.6075, simple_loss=0.5574, pruned_loss=0.3464, over 16259.00 frames. ], tot_loss[loss=0.6161, simple_loss=0.5516, pruned_loss=0.3847, over 3281848.93 frames. ], batch size: 52, lr: 3.95e-02, grad_scale: 16.0 2023-06-15 02:32:03,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=6286.666666666667, ans=0.2053125 2023-06-15 02:33:13,473 INFO [train.py:988] (3/4) Epoch 2, batch 450, loss[loss=0.547, simple_loss=0.5038, pruned_loss=0.3079, over 19701.00 frames. ], tot_loss[loss=0.6006, simple_loss=0.5412, pruned_loss=0.3669, over 3383406.79 frames. ], batch size: 110, lr: 3.94e-02, grad_scale: 8.0 2023-06-15 02:33:36,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=6620.0, ans=0.1896875 2023-06-15 02:33:41,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=12.465 2023-06-15 02:33:54,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 3.705e+02 5.386e+02 8.050e+02 1.837e+03, threshold=1.077e+03, percent-clipped=8.0 2023-06-15 02:34:04,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=12.515 2023-06-15 02:34:04,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0075 2023-06-15 02:34:25,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=6820.0, ans=0.038250000000000006 2023-06-15 02:34:40,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=6820.0, ans=0.1803125 2023-06-15 02:34:41,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=6886.666666666667, ans=0.1771875 2023-06-15 02:34:43,319 INFO [train.py:988] (3/4) Epoch 2, batch 500, loss[loss=0.5571, simple_loss=0.5239, pruned_loss=0.2983, over 18313.00 frames. ], tot_loss[loss=0.5845, simple_loss=0.5295, pruned_loss=0.3504, over 3458061.22 frames. ], batch size: 72, lr: 3.94e-02, grad_scale: 8.0 2023-06-15 02:34:47,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=6886.666666666667, ans=0.6589666666666667 2023-06-15 02:34:57,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=6886.666666666667, ans=0.1771875 2023-06-15 02:35:17,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7020.0, ans=0.17093750000000002 2023-06-15 02:35:32,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=7020.0, ans=0.17093750000000002 2023-06-15 02:36:00,641 INFO [train.py:988] (3/4) Epoch 3, batch 0, loss[loss=0.5188, simple_loss=0.4741, pruned_loss=0.295, over 20232.00 frames. ], tot_loss[loss=0.5188, simple_loss=0.4741, pruned_loss=0.295, over 20232.00 frames. ], batch size: 239, lr: 3.84e-02, grad_scale: 16.0 2023-06-15 02:36:00,641 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 02:36:06,792 INFO [train.py:1020] (3/4) Epoch 3, validation: loss=0.4219, simple_loss=0.4383, pruned_loss=0.1731, over 143649.00 frames. 2023-06-15 02:36:06,793 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 02:36:17,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=7100.0, ans=0.1671875 2023-06-15 02:36:37,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=7166.666666666667, ans=0.1640625 2023-06-15 02:36:47,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7233.333333333333, ans=0.1609375 2023-06-15 02:36:56,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7233.333333333333, ans=0.22766666666666668 2023-06-15 02:37:06,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=7300.0, ans=0.15781250000000002 2023-06-15 02:37:06,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=7300.0, ans=0.15781250000000002 2023-06-15 02:37:17,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 3.334e+02 5.511e+02 7.843e+02 1.620e+03, threshold=1.102e+03, percent-clipped=11.0 2023-06-15 02:37:24,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=7366.666666666667, ans=0.15468749999999998 2023-06-15 02:37:36,293 INFO [train.py:988] (3/4) Epoch 3, batch 50, loss[loss=0.5715, simple_loss=0.5356, pruned_loss=0.3077, over 14714.00 frames. ], tot_loss[loss=0.5202, simple_loss=0.4852, pruned_loss=0.2832, over 850256.57 frames. ], batch size: 42, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:37:50,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=7433.333333333333, ans=0.035694444444444445 2023-06-15 02:38:07,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=7500.0, ans=0.1484375 2023-06-15 02:38:11,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7566.666666666667, ans=0.1453125 2023-06-15 02:38:51,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7700.0, ans=0.223 2023-06-15 02:39:06,075 INFO [train.py:988] (3/4) Epoch 3, batch 100, loss[loss=0.5282, simple_loss=0.4894, pruned_loss=0.2905, over 19535.00 frames. ], tot_loss[loss=0.5138, simple_loss=0.4813, pruned_loss=0.2769, over 1491365.35 frames. ], batch size: 102, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:39:27,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.87 vs. limit=10.4375 2023-06-15 02:39:41,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7900.0, ans=0.221 2023-06-15 02:39:42,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.93 vs. limit=7.16 2023-06-15 02:39:54,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=7900.0, ans=0.0 2023-06-15 02:40:00,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=7966.666666666667, ans=0.03347222222222222 2023-06-15 02:40:19,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 2.598e+02 5.024e+02 8.933e+02 2.174e+03, threshold=1.005e+03, percent-clipped=11.0 2023-06-15 02:40:32,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8033.333333333333, ans=0.21966666666666668 2023-06-15 02:40:35,850 INFO [train.py:988] (3/4) Epoch 3, batch 150, loss[loss=0.5277, simple_loss=0.5047, pruned_loss=0.2726, over 17606.00 frames. ], tot_loss[loss=0.506, simple_loss=0.4758, pruned_loss=0.2705, over 2002535.10 frames. ], batch size: 67, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:41:18,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=8233.333333333334, ans=0.125 2023-06-15 02:42:04,804 INFO [train.py:988] (3/4) Epoch 3, batch 200, loss[loss=0.5258, simple_loss=0.5053, pruned_loss=0.2695, over 16704.00 frames. ], tot_loss[loss=0.5003, simple_loss=0.4723, pruned_loss=0.2654, over 2399997.57 frames. ], batch size: 59, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:42:43,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=8566.666666666666, ans=0.125 2023-06-15 02:43:15,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=8700.0, ans=0.125 2023-06-15 02:43:16,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.306e+02 5.233e+02 8.261e+02 1.948e+03, threshold=1.047e+03, percent-clipped=15.0 2023-06-15 02:43:31,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.71 vs. limit=10.7625 2023-06-15 02:43:32,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=8766.666666666666, ans=0.21233333333333332 2023-06-15 02:43:33,634 INFO [train.py:988] (3/4) Epoch 3, batch 250, loss[loss=0.4655, simple_loss=0.4509, pruned_loss=0.2354, over 20319.00 frames. ], tot_loss[loss=0.4952, simple_loss=0.4686, pruned_loss=0.2615, over 2719799.26 frames. ], batch size: 149, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:44:14,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=8900.0, ans=0.125 2023-06-15 02:44:32,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=8966.666666666666, ans=0.125 2023-06-15 02:44:34,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8966.666666666666, ans=0.21033333333333332 2023-06-15 02:44:34,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=10.8625 2023-06-15 02:44:35,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8966.666666666666, ans=0.21033333333333332 2023-06-15 02:45:03,173 INFO [train.py:988] (3/4) Epoch 3, batch 300, loss[loss=0.4345, simple_loss=0.4284, pruned_loss=0.2132, over 18791.00 frames. ], tot_loss[loss=0.4886, simple_loss=0.4645, pruned_loss=0.2559, over 2954423.08 frames. ], batch size: 83, lr: 3.82e-02, grad_scale: 8.0 2023-06-15 02:45:21,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=9166.666666666666, ans=0.125 2023-06-15 02:45:29,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9166.666666666666, ans=0.20833333333333334 2023-06-15 02:45:45,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9233.333333333334, ans=0.125 2023-06-15 02:45:45,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=9233.333333333334, ans=0.125 2023-06-15 02:45:47,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=9233.333333333334, ans=0.5768333333333333 2023-06-15 02:45:59,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.02 vs. limit=7.325 2023-06-15 02:46:15,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9366.666666666666, ans=0.20633333333333334 2023-06-15 02:46:16,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.844e+02 4.784e+02 7.568e+02 1.827e+03, threshold=9.568e+02, percent-clipped=19.0 2023-06-15 02:46:16,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=9366.666666666666, ans=0.008833333333333334 2023-06-15 02:46:31,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=9433.333333333334, ans=0.027361111111111114 2023-06-15 02:46:32,381 INFO [train.py:988] (3/4) Epoch 3, batch 350, loss[loss=0.4441, simple_loss=0.4345, pruned_loss=0.2217, over 19791.00 frames. ], tot_loss[loss=0.483, simple_loss=0.4615, pruned_loss=0.2509, over 3133224.71 frames. ], batch size: 115, lr: 3.82e-02, grad_scale: 8.0 2023-06-15 02:46:39,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=9433.333333333334, ans=0.125 2023-06-15 02:46:39,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=9433.333333333334, ans=0.125 2023-06-15 02:46:39,727 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=11.0375 2023-06-15 02:47:24,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=9566.666666666666, ans=0.0 2023-06-15 02:47:27,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=9633.333333333334, ans=0.02652777777777778 2023-06-15 02:47:32,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=11.1125 2023-06-15 02:48:02,094 INFO [train.py:988] (3/4) Epoch 3, batch 400, loss[loss=0.4474, simple_loss=0.4362, pruned_loss=0.2252, over 18943.00 frames. ], tot_loss[loss=0.4768, simple_loss=0.458, pruned_loss=0.2456, over 3286105.07 frames. ], batch size: 86, lr: 3.82e-02, grad_scale: 16.0 2023-06-15 02:48:20,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=9833.333333333334, ans=0.5558333333333334 2023-06-15 02:48:36,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9900.0, ans=0.201 2023-06-15 02:49:15,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.793e+02 4.517e+02 6.574e+02 1.219e+03, threshold=9.033e+02, percent-clipped=7.0 2023-06-15 02:49:26,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=10033.333333333334, ans=0.125 2023-06-15 02:49:31,992 INFO [train.py:988] (3/4) Epoch 3, batch 450, loss[loss=0.4434, simple_loss=0.4342, pruned_loss=0.2222, over 19305.00 frames. ], tot_loss[loss=0.4704, simple_loss=0.4536, pruned_loss=0.241, over 3413235.88 frames. ], batch size: 98, lr: 3.82e-02, grad_scale: 16.0 2023-06-15 02:50:09,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=10233.333333333334, ans=0.125 2023-06-15 02:50:30,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=10300.0, ans=0.125 2023-06-15 02:50:35,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=10300.0, ans=0.025 2023-06-15 02:50:42,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=10366.666666666666, ans=0.125 2023-06-15 02:50:49,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=11.3875 2023-06-15 02:50:59,003 INFO [train.py:988] (3/4) Epoch 3, batch 500, loss[loss=0.452, simple_loss=0.4492, pruned_loss=0.222, over 18468.00 frames. ], tot_loss[loss=0.466, simple_loss=0.4516, pruned_loss=0.237, over 3500382.99 frames. ], batch size: 77, lr: 3.81e-02, grad_scale: 16.0 2023-06-15 02:51:12,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=10433.333333333334, ans=0.125 2023-06-15 02:51:45,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.05 vs. limit=10.283333333333333 2023-06-15 02:52:16,086 INFO [train.py:988] (3/4) Epoch 4, batch 0, loss[loss=0.4489, simple_loss=0.4343, pruned_loss=0.2297, over 20580.00 frames. ], tot_loss[loss=0.4489, simple_loss=0.4343, pruned_loss=0.2297, over 20580.00 frames. ], batch size: 189, lr: 3.66e-02, grad_scale: 32.0 2023-06-15 02:52:16,086 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 02:52:22,221 INFO [train.py:1020] (3/4) Epoch 4, validation: loss=0.3338, simple_loss=0.3946, pruned_loss=0.1182, over 143649.00 frames. 2023-06-15 02:52:22,222 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 02:52:28,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=11.4925 2023-06-15 02:52:34,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.31 vs. limit=11.4925 2023-06-15 02:52:38,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.821e+02 4.565e+02 6.318e+02 1.774e+03, threshold=9.130e+02, percent-clipped=10.0 2023-06-15 02:53:01,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=10780.0, ans=0.008526086956521739 2023-06-15 02:53:18,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10846.666666666666, ans=0.19153333333333333 2023-06-15 02:53:26,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10846.666666666666, ans=0.19153333333333333 2023-06-15 02:53:32,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=10913.333333333334, ans=0.125 2023-06-15 02:53:42,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=10913.333333333334, ans=0.125 2023-06-15 02:53:52,042 INFO [train.py:988] (3/4) Epoch 4, batch 50, loss[loss=0.4162, simple_loss=0.4236, pruned_loss=0.1986, over 20115.00 frames. ], tot_loss[loss=0.4374, simple_loss=0.4347, pruned_loss=0.2157, over 848489.11 frames. ], batch size: 133, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:54:27,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.53 vs. limit=15.835 2023-06-15 02:54:57,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11180.0, ans=0.18819999999999998 2023-06-15 02:54:58,614 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:55:14,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.83 vs. limit=11.717500000000001 2023-06-15 02:55:23,236 INFO [train.py:988] (3/4) Epoch 4, batch 100, loss[loss=0.4529, simple_loss=0.4526, pruned_loss=0.2227, over 19094.00 frames. ], tot_loss[loss=0.4343, simple_loss=0.4338, pruned_loss=0.2131, over 1506993.76 frames. ], batch size: 94, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:55:40,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=11380.0, ans=0.008395652173913044 2023-06-15 02:55:41,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=11.7675 2023-06-15 02:55:42,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.113e+02 4.609e+02 7.608e+02 1.612e+03, threshold=9.219e+02, percent-clipped=13.0 2023-06-15 02:55:44,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=11380.0, ans=0.025 2023-06-15 02:55:57,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.83 vs. limit=16.035 2023-06-15 02:55:58,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=11446.666666666666, ans=0.018972222222222227 2023-06-15 02:56:17,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=11513.333333333334, ans=0.0 2023-06-15 02:56:21,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=11513.333333333334, ans=0.125 2023-06-15 02:56:46,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=11580.0, ans=11.842500000000001 2023-06-15 02:56:53,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=4.747 2023-06-15 02:56:53,609 INFO [train.py:988] (3/4) Epoch 4, batch 150, loss[loss=0.436, simple_loss=0.3947, pruned_loss=0.2411, over 17242.00 frames. ], tot_loss[loss=0.4305, simple_loss=0.4319, pruned_loss=0.2104, over 1999791.86 frames. ], batch size: 392, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:56:54,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=11646.666666666666, ans=0.125 2023-06-15 02:57:08,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=11646.666666666666, ans=0.49236666666666673 2023-06-15 02:57:26,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=11713.333333333334, ans=10.0 2023-06-15 02:57:42,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=11780.0, ans=0.008308695652173913 2023-06-15 02:57:54,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=11846.666666666666, ans=0.008294202898550724 2023-06-15 02:58:14,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=11913.333333333334, ans=0.125 2023-06-15 02:58:23,378 INFO [train.py:988] (3/4) Epoch 4, batch 200, loss[loss=0.4035, simple_loss=0.4222, pruned_loss=0.1882, over 18462.00 frames. ], tot_loss[loss=0.4263, simple_loss=0.4307, pruned_loss=0.2069, over 2407861.19 frames. ], batch size: 77, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 02:58:40,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.791e+02 4.180e+02 6.641e+02 1.358e+03, threshold=8.360e+02, percent-clipped=6.0 2023-06-15 02:58:43,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=12046.666666666666, ans=0.0 2023-06-15 02:58:43,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=12.0175 2023-06-15 02:59:24,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=12180.0, ans=10.0 2023-06-15 02:59:47,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=12246.666666666666, ans=0.07 2023-06-15 02:59:55,816 INFO [train.py:988] (3/4) Epoch 4, batch 250, loss[loss=0.3983, simple_loss=0.4067, pruned_loss=0.1927, over 20579.00 frames. ], tot_loss[loss=0.4232, simple_loss=0.429, pruned_loss=0.2051, over 2708012.41 frames. ], batch size: 173, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 02:59:56,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=12313.333333333334, ans=0.008192753623188406 2023-06-15 03:00:40,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=12446.666666666666, ans=0.125 2023-06-15 03:01:06,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=12580.0, ans=0.125 2023-06-15 03:01:26,975 INFO [train.py:988] (3/4) Epoch 4, batch 300, loss[loss=0.395, simple_loss=0.3691, pruned_loss=0.2107, over 16824.00 frames. ], tot_loss[loss=0.4203, simple_loss=0.4281, pruned_loss=0.2031, over 2935535.24 frames. ], batch size: 392, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 03:01:44,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.980e+02 4.845e+02 6.504e+02 1.050e+03, threshold=9.691e+02, percent-clipped=10.0 2023-06-15 03:02:39,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=12913.333333333334, ans=0.125 2023-06-15 03:02:41,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=12913.333333333334, ans=0.125 2023-06-15 03:02:41,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.342500000000001 2023-06-15 03:02:48,584 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:02:59,286 INFO [train.py:988] (3/4) Epoch 4, batch 350, loss[loss=0.3847, simple_loss=0.4033, pruned_loss=0.1821, over 20127.00 frames. ], tot_loss[loss=0.4155, simple_loss=0.4252, pruned_loss=0.2002, over 3120463.53 frames. ], batch size: 133, lr: 3.64e-02, grad_scale: 16.0 2023-06-15 03:03:06,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=8.245000000000001 2023-06-15 03:03:16,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13046.666666666666, ans=0.125 2023-06-15 03:03:17,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=13046.666666666666, ans=0.125 2023-06-15 03:03:34,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=13113.333333333334, ans=0.0 2023-06-15 03:03:36,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=13113.333333333334, ans=0.39670000000000005 2023-06-15 03:03:55,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=13180.0, ans=0.125 2023-06-15 03:03:56,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.40 vs. limit=17.384999999999998 2023-06-15 03:04:29,364 INFO [train.py:988] (3/4) Epoch 4, batch 400, loss[loss=0.4036, simple_loss=0.4426, pruned_loss=0.1823, over 18311.00 frames. ], tot_loss[loss=0.4114, simple_loss=0.4237, pruned_loss=0.1974, over 3273497.58 frames. ], batch size: 72, lr: 3.64e-02, grad_scale: 32.0 2023-06-15 03:04:46,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.163e+02 4.789e+02 6.291e+02 1.274e+03, threshold=9.578e+02, percent-clipped=4.0 2023-06-15 03:04:56,375 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:05:28,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=13513.333333333334, ans=0.010361111111111106 2023-06-15 03:05:59,025 INFO [train.py:988] (3/4) Epoch 4, batch 450, loss[loss=0.4231, simple_loss=0.4555, pruned_loss=0.1953, over 17677.00 frames. ], tot_loss[loss=0.4088, simple_loss=0.4227, pruned_loss=0.1958, over 3390971.17 frames. ], batch size: 67, lr: 3.64e-02, grad_scale: 16.0 2023-06-15 03:05:59,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=13646.666666666666, ans=0.125 2023-06-15 03:06:32,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.47 vs. limit=11.89 2023-06-15 03:06:35,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.96 vs. limit=17.835 2023-06-15 03:06:42,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=12.6675 2023-06-15 03:06:53,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=13846.666666666666, ans=0.125 2023-06-15 03:06:55,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=13846.666666666666, ans=0.125 2023-06-15 03:07:00,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13846.666666666666, ans=0.16153333333333333 2023-06-15 03:07:23,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=13980.0, ans=0.4107 2023-06-15 03:07:24,286 INFO [train.py:988] (3/4) Epoch 4, batch 500, loss[loss=0.4099, simple_loss=0.4264, pruned_loss=0.1967, over 19722.00 frames. ], tot_loss[loss=0.408, simple_loss=0.4236, pruned_loss=0.195, over 3462035.71 frames. ], batch size: 110, lr: 3.63e-02, grad_scale: 16.0 2023-06-15 03:07:29,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13980.0, ans=0.0 2023-06-15 03:07:39,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=14046.666666666666, ans=0.4083666666666667 2023-06-15 03:07:42,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=9.618666666666666 2023-06-15 03:07:42,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.873e+02 4.186e+02 6.544e+02 1.200e+03, threshold=8.372e+02, percent-clipped=10.0 2023-06-15 03:08:43,613 INFO [train.py:988] (3/4) Epoch 5, batch 0, loss[loss=0.3701, simple_loss=0.4081, pruned_loss=0.1661, over 18485.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.4081, pruned_loss=0.1661, over 18485.00 frames. ], batch size: 77, lr: 3.47e-02, grad_scale: 32.0 2023-06-15 03:08:43,613 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 03:08:49,786 INFO [train.py:1020] (3/4) Epoch 5, validation: loss=0.2868, simple_loss=0.3756, pruned_loss=0.09898, over 143649.00 frames. 2023-06-15 03:08:49,787 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 03:09:16,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14260.0, ans=0.125 2023-06-15 03:09:30,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.71 vs. limit=18.244999999999997 2023-06-15 03:09:41,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=14393.333333333334, ans=0.025 2023-06-15 03:09:41,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14393.333333333334, ans=0.125 2023-06-15 03:09:49,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=14393.333333333334, ans=0.125 2023-06-15 03:10:18,924 INFO [train.py:988] (3/4) Epoch 5, batch 50, loss[loss=0.3811, simple_loss=0.4071, pruned_loss=0.1776, over 19073.00 frames. ], tot_loss[loss=0.3896, simple_loss=0.4151, pruned_loss=0.182, over 844180.65 frames. ], batch size: 89, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:10:32,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=5.179 2023-06-15 03:10:41,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=9.837333333333333 2023-06-15 03:11:10,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.824e+02 3.862e+02 4.906e+02 1.527e+03, threshold=7.724e+02, percent-clipped=12.0 2023-06-15 03:11:18,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=13.0225 2023-06-15 03:11:22,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=14726.666666666666, ans=0.007668115942028986 2023-06-15 03:11:29,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=14793.333333333334, ans=0.007653623188405797 2023-06-15 03:11:48,674 INFO [train.py:988] (3/4) Epoch 5, batch 100, loss[loss=0.4001, simple_loss=0.4121, pruned_loss=0.1941, over 20589.00 frames. ], tot_loss[loss=0.3866, simple_loss=0.4143, pruned_loss=0.1795, over 1477789.10 frames. ], batch size: 189, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:11:52,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=14860.0, ans=0.0 2023-06-15 03:12:10,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=14926.666666666666, ans=0.3775666666666667 2023-06-15 03:12:11,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=14926.666666666666, ans=0.8992666666666667 2023-06-15 03:12:19,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=14926.666666666666, ans=0.007624637681159421 2023-06-15 03:12:23,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=14993.333333333334, ans=0.025 2023-06-15 03:12:27,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=13.122499999999999 2023-06-15 03:12:31,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=13.122499999999999 2023-06-15 03:12:47,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=18.795 2023-06-15 03:13:11,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=13.1725 2023-06-15 03:13:18,003 INFO [train.py:988] (3/4) Epoch 5, batch 150, loss[loss=0.3837, simple_loss=0.3877, pruned_loss=0.1898, over 20030.00 frames. ], tot_loss[loss=0.3858, simple_loss=0.4125, pruned_loss=0.1795, over 1993840.31 frames. ], batch size: 294, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:13:47,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=15260.0, ans=0.125 2023-06-15 03:14:10,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.930e+02 4.326e+02 6.625e+02 9.040e+02, threshold=8.653e+02, percent-clipped=9.0 2023-06-15 03:14:14,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=15393.333333333334, ans=0.0025277777777777746 2023-06-15 03:14:26,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=15393.333333333334, ans=0.007523188405797102 2023-06-15 03:14:47,999 INFO [train.py:988] (3/4) Epoch 5, batch 200, loss[loss=0.3883, simple_loss=0.41, pruned_loss=0.1833, over 20332.00 frames. ], tot_loss[loss=0.3835, simple_loss=0.4106, pruned_loss=0.1782, over 2389834.50 frames. ], batch size: 149, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:14:53,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=15526.666666666666, ans=0.125 2023-06-15 03:15:08,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=15593.333333333334, ans=10.0 2023-06-15 03:15:12,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=15593.333333333334, ans=0.02 2023-06-15 03:15:28,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=15660.0, ans=0.3519 2023-06-15 03:15:59,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=8.948333333333334 2023-06-15 03:16:18,857 INFO [train.py:988] (3/4) Epoch 5, batch 250, loss[loss=0.365, simple_loss=0.4, pruned_loss=0.165, over 19093.00 frames. ], tot_loss[loss=0.3803, simple_loss=0.4084, pruned_loss=0.1761, over 2697710.51 frames. ], batch size: 89, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:16:59,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=10.397333333333334 2023-06-15 03:17:11,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.83 vs. limit=19.494999999999997 2023-06-15 03:17:11,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.885e+02 4.184e+02 6.160e+02 1.201e+03, threshold=8.369e+02, percent-clipped=9.0 2023-06-15 03:17:14,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=10.424 2023-06-15 03:17:18,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=16060.0, ans=0.05 2023-06-15 03:17:40,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=16126.666666666666, ans=0.125 2023-06-15 03:17:49,392 INFO [train.py:988] (3/4) Epoch 5, batch 300, loss[loss=0.3388, simple_loss=0.3808, pruned_loss=0.1484, over 19691.00 frames. ], tot_loss[loss=0.3783, simple_loss=0.4067, pruned_loss=0.1749, over 2934411.32 frames. ], batch size: 110, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:18:26,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=16326.666666666666, ans=0.0 2023-06-15 03:18:36,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.50 vs. limit=13.622499999999999 2023-06-15 03:19:10,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=13.6725 2023-06-15 03:19:16,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.98 vs. limit=19.845 2023-06-15 03:19:18,970 INFO [train.py:988] (3/4) Epoch 5, batch 350, loss[loss=0.3805, simple_loss=0.4111, pruned_loss=0.175, over 19716.00 frames. ], tot_loss[loss=0.3764, simple_loss=0.4058, pruned_loss=0.1735, over 3130598.42 frames. ], batch size: 110, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:19:44,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=16593.333333333332, ans=0.125 2023-06-15 03:19:52,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=13.747499999999999 2023-06-15 03:19:52,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=13.747499999999999 2023-06-15 03:19:56,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16660.0, ans=0.13340000000000002 2023-06-15 03:20:09,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.834e+02 3.544e+02 5.201e+02 9.187e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-15 03:20:19,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.05 vs. limit=20.045 2023-06-15 03:20:23,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.49 vs. limit=9.181666666666668 2023-06-15 03:20:24,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=16726.666666666668, ans=0.007233333333333333 2023-06-15 03:20:47,496 INFO [train.py:988] (3/4) Epoch 5, batch 400, loss[loss=0.3451, simple_loss=0.3853, pruned_loss=0.1524, over 19481.00 frames. ], tot_loss[loss=0.3741, simple_loss=0.4047, pruned_loss=0.1718, over 3272655.05 frames. ], batch size: 105, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:21:11,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=16926.666666666668, ans=0.125 2023-06-15 03:21:23,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=16993.333333333332, ans=0.125 2023-06-15 03:21:24,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.13 vs. limit=13.496666666666666 2023-06-15 03:21:33,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=16993.333333333332, ans=0.125 2023-06-15 03:21:46,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.05 vs. limit=13.53 2023-06-15 03:22:01,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17126.666666666668, ans=0.0 2023-06-15 03:22:17,698 INFO [train.py:988] (3/4) Epoch 5, batch 450, loss[loss=0.3669, simple_loss=0.4039, pruned_loss=0.1649, over 19455.00 frames. ], tot_loss[loss=0.373, simple_loss=0.4041, pruned_loss=0.1709, over 3396009.70 frames. ], batch size: 105, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:22:18,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=17193.333333333332, ans=0.007131884057971015 2023-06-15 03:23:07,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=17326.666666666668, ans=0.125 2023-06-15 03:23:08,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.161e+02 3.951e+02 6.319e+02 1.120e+03, threshold=7.903e+02, percent-clipped=17.0 2023-06-15 03:23:16,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=17393.333333333332, ans=0.125 2023-06-15 03:23:18,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17393.333333333332, ans=0.125 2023-06-15 03:23:37,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=17460.0, ans=0.28890000000000005 2023-06-15 03:23:45,502 INFO [train.py:988] (3/4) Epoch 5, batch 500, loss[loss=0.3647, simple_loss=0.3963, pruned_loss=0.1666, over 19854.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4028, pruned_loss=0.169, over 3473414.98 frames. ], batch size: 120, lr: 3.43e-02, grad_scale: 32.0 2023-06-15 03:23:52,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=17526.666666666668, ans=0.28656666666666675 2023-06-15 03:24:05,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=14.0975 2023-06-15 03:24:12,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=17593.333333333332, ans=0.125 2023-06-15 03:24:24,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=17660.0, ans=0.007030434782608696 2023-06-15 03:25:04,647 INFO [train.py:988] (3/4) Epoch 6, batch 0, loss[loss=0.3494, simple_loss=0.3862, pruned_loss=0.1564, over 19837.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.3862, pruned_loss=0.1564, over 19837.00 frames. ], batch size: 120, lr: 3.27e-02, grad_scale: 32.0 2023-06-15 03:25:04,648 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 03:25:10,739 INFO [train.py:1020] (3/4) Epoch 6, validation: loss=0.268, simple_loss=0.365, pruned_loss=0.08554, over 143649.00 frames. 2023-06-15 03:25:10,741 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 03:25:40,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=17813.333333333332, ans=0.125 2023-06-15 03:25:41,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.19 vs. limit=11.125333333333334 2023-06-15 03:25:56,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=17880.0, ans=0.125 2023-06-15 03:26:07,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17946.666666666668, ans=0.12053333333333333 2023-06-15 03:26:12,449 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:26:25,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=18013.333333333332, ans=0.11986666666666668 2023-06-15 03:26:30,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.488e+02 3.063e+02 4.314e+02 9.185e+02, threshold=6.126e+02, percent-clipped=4.0 2023-06-15 03:26:37,311 INFO [train.py:988] (3/4) Epoch 6, batch 50, loss[loss=0.3415, simple_loss=0.3819, pruned_loss=0.1505, over 19216.00 frames. ], tot_loss[loss=0.3646, simple_loss=0.398, pruned_loss=0.1657, over 859239.14 frames. ], batch size: 92, lr: 3.26e-02, grad_scale: 32.0 2023-06-15 03:26:41,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=18080.0, ans=0.006939130434782609 2023-06-15 03:27:12,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18213.333333333332, ans=0.11786666666666668 2023-06-15 03:27:12,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=18213.333333333332, ans=0.09899494936611666 2023-06-15 03:27:45,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=18346.666666666668, ans=0.125 2023-06-15 03:27:56,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18346.666666666668, ans=0.0 2023-06-15 03:28:02,257 INFO [train.py:988] (3/4) Epoch 6, batch 100, loss[loss=0.3523, simple_loss=0.3933, pruned_loss=0.1557, over 19981.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.3973, pruned_loss=0.1618, over 1494506.02 frames. ], batch size: 126, lr: 3.26e-02, grad_scale: 16.0 2023-06-15 03:28:33,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=14.43 2023-06-15 03:28:55,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.19 vs. limit=21.46 2023-06-15 03:29:09,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18680.0, ans=0.11320000000000002 2023-06-15 03:29:22,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.570e+02 3.710e+02 4.834e+02 1.052e+03, threshold=7.420e+02, percent-clipped=12.0 2023-06-15 03:29:27,803 INFO [train.py:988] (3/4) Epoch 6, batch 150, loss[loss=0.3386, simple_loss=0.3775, pruned_loss=0.1498, over 19971.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3942, pruned_loss=0.1601, over 1997864.86 frames. ], batch size: 126, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:29:57,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.63 vs. limit=14.555 2023-06-15 03:30:28,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.80 vs. limit=21.71 2023-06-15 03:30:31,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=18946.666666666668, ans=0.125 2023-06-15 03:30:33,454 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:30:54,540 INFO [train.py:988] (3/4) Epoch 6, batch 200, loss[loss=0.3235, simple_loss=0.3706, pruned_loss=0.1382, over 19488.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3939, pruned_loss=0.1603, over 2398638.21 frames. ], batch size: 105, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:30:54,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=19080.0, ans=0.125 2023-06-15 03:30:56,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19080.0, ans=0.125 2023-06-15 03:31:08,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=19080.0, ans=0.23220000000000007 2023-06-15 03:31:10,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.41 vs. limit=21.86 2023-06-15 03:31:15,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=19146.666666666668, ans=0.125 2023-06-15 03:31:18,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=14.68 2023-06-15 03:31:24,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=19146.666666666668, ans=0.22986666666666677 2023-06-15 03:31:54,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19280.0, ans=0.125 2023-06-15 03:32:15,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.813e+02 3.502e+02 4.540e+02 9.537e+02, threshold=7.003e+02, percent-clipped=3.0 2023-06-15 03:32:20,494 INFO [train.py:988] (3/4) Epoch 6, batch 250, loss[loss=0.3837, simple_loss=0.4252, pruned_loss=0.1711, over 16419.00 frames. ], tot_loss[loss=0.3554, simple_loss=0.3925, pruned_loss=0.1591, over 2704836.61 frames. ], batch size: 52, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:32:34,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=19413.333333333332, ans=0.1 2023-06-15 03:32:38,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=19480.0, ans=0.21820000000000006 2023-06-15 03:32:50,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.77 vs. limit=14.805 2023-06-15 03:33:10,235 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:33:23,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=19613.333333333332, ans=0.125 2023-06-15 03:33:46,379 INFO [train.py:988] (3/4) Epoch 6, batch 300, loss[loss=0.3517, simple_loss=0.3897, pruned_loss=0.1568, over 19442.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.3915, pruned_loss=0.1574, over 2963676.73 frames. ], batch size: 105, lr: 3.24e-02, grad_scale: 16.0 2023-06-15 03:34:14,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=5.9719999999999995 2023-06-15 03:34:47,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19946.666666666668, ans=0.10053333333333334 2023-06-15 03:34:59,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2023-06-15 03:35:08,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.795e+02 3.555e+02 5.115e+02 9.485e+02, threshold=7.110e+02, percent-clipped=8.0 2023-06-15 03:35:13,421 INFO [train.py:988] (3/4) Epoch 6, batch 350, loss[loss=0.338, simple_loss=0.3866, pruned_loss=0.1447, over 18607.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.3905, pruned_loss=0.1573, over 3142905.05 frames. ], batch size: 80, lr: 3.24e-02, grad_scale: 16.0 2023-06-15 03:35:41,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=20146.666666666668, ans=0.95 2023-06-15 03:36:14,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=20280.0, ans=0.125 2023-06-15 03:36:40,615 INFO [train.py:988] (3/4) Epoch 6, batch 400, loss[loss=0.3382, simple_loss=0.385, pruned_loss=0.1458, over 19219.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3896, pruned_loss=0.1559, over 3286817.42 frames. ], batch size: 92, lr: 3.24e-02, grad_scale: 32.0 2023-06-15 03:37:37,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=20613.333333333332, ans=0.95 2023-06-15 03:38:01,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.580e+02 3.237e+02 4.539e+02 1.447e+03, threshold=6.474e+02, percent-clipped=10.0 2023-06-15 03:38:06,910 INFO [train.py:988] (3/4) Epoch 6, batch 450, loss[loss=0.3385, simple_loss=0.3904, pruned_loss=0.1433, over 19457.00 frames. ], tot_loss[loss=0.3491, simple_loss=0.3888, pruned_loss=0.1547, over 3416580.89 frames. ], batch size: 105, lr: 3.23e-02, grad_scale: 32.0 2023-06-15 03:38:09,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=20746.666666666668, ans=0.1 2023-06-15 03:38:38,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2023-06-15 03:38:58,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=20946.666666666668, ans=0.05 2023-06-15 03:39:31,070 INFO [train.py:988] (3/4) Epoch 6, batch 500, loss[loss=0.3437, simple_loss=0.3795, pruned_loss=0.1539, over 20283.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3878, pruned_loss=0.1537, over 3491249.42 frames. ], batch size: 141, lr: 3.23e-02, grad_scale: 32.0 2023-06-15 03:39:33,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-06-15 03:40:12,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21213.333333333332, ans=0.125 2023-06-15 03:40:50,456 INFO [train.py:988] (3/4) Epoch 7, batch 0, loss[loss=0.3681, simple_loss=0.392, pruned_loss=0.1721, over 20174.00 frames. ], tot_loss[loss=0.3681, simple_loss=0.392, pruned_loss=0.1721, over 20174.00 frames. ], batch size: 239, lr: 3.07e-02, grad_scale: 32.0 2023-06-15 03:40:50,456 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 03:40:56,358 INFO [train.py:1020] (3/4) Epoch 7, validation: loss=0.2561, simple_loss=0.3562, pruned_loss=0.07803, over 143649.00 frames. 2023-06-15 03:40:56,359 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 03:41:19,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.547e+02 3.083e+02 4.575e+02 1.238e+03, threshold=6.165e+02, percent-clipped=14.0 2023-06-15 03:41:56,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=21500.0, ans=0.125 2023-06-15 03:42:13,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.58 vs. limit=22.5 2023-06-15 03:42:21,399 INFO [train.py:988] (3/4) Epoch 7, batch 50, loss[loss=0.3476, simple_loss=0.3711, pruned_loss=0.162, over 19776.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3822, pruned_loss=0.1473, over 845624.06 frames. ], batch size: 293, lr: 3.07e-02, grad_scale: 32.0 2023-06-15 03:42:28,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=21633.333333333332, ans=0.2 2023-06-15 03:43:26,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=21833.333333333332, ans=0.07 2023-06-15 03:43:49,170 INFO [train.py:988] (3/4) Epoch 7, batch 100, loss[loss=0.3449, simple_loss=0.3998, pruned_loss=0.145, over 17167.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3837, pruned_loss=0.1482, over 1491501.51 frames. ], batch size: 60, lr: 3.06e-02, grad_scale: 32.0 2023-06-15 03:44:06,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=22033.333333333332, ans=0.2 2023-06-15 03:44:13,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 2.359e+02 2.777e+02 3.733e+02 1.026e+03, threshold=5.554e+02, percent-clipped=6.0 2023-06-15 03:44:44,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=22166.666666666668, ans=0.006050724637681159 2023-06-15 03:45:16,416 INFO [train.py:988] (3/4) Epoch 7, batch 150, loss[loss=0.3348, simple_loss=0.3734, pruned_loss=0.1481, over 20520.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3812, pruned_loss=0.1482, over 2015652.91 frames. ], batch size: 160, lr: 3.06e-02, grad_scale: 32.0 2023-06-15 03:45:41,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=22366.666666666668, ans=0.125 2023-06-15 03:46:19,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=22500.0, ans=0.125 2023-06-15 03:46:45,804 INFO [train.py:988] (3/4) Epoch 7, batch 200, loss[loss=0.3536, simple_loss=0.3943, pruned_loss=0.1565, over 20126.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3806, pruned_loss=0.147, over 2417165.28 frames. ], batch size: 133, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:46:46,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=22633.333333333332, ans=0.2 2023-06-15 03:47:05,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=22700.0, ans=0.05 2023-06-15 03:47:11,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.670e+02 3.468e+02 4.313e+02 7.598e+02, threshold=6.936e+02, percent-clipped=8.0 2023-06-15 03:47:17,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2023-06-15 03:47:33,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.60 vs. limit=22.5 2023-06-15 03:47:49,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=22833.333333333332, ans=0.5 2023-06-15 03:48:01,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22900.0, ans=0.1 2023-06-15 03:48:06,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=22900.0, ans=10.0 2023-06-15 03:48:14,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.75 vs. limit=22.5 2023-06-15 03:48:15,541 INFO [train.py:988] (3/4) Epoch 7, batch 250, loss[loss=0.3304, simple_loss=0.3809, pruned_loss=0.1399, over 19559.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3797, pruned_loss=0.1462, over 2729740.37 frames. ], batch size: 102, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:48:40,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.07 vs. limit=22.5 2023-06-15 03:48:43,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=23033.333333333332, ans=0.0 2023-06-15 03:49:22,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=15.0 2023-06-15 03:49:23,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=23166.666666666668, ans=0.1 2023-06-15 03:49:34,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-06-15 03:49:40,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=23233.333333333332, ans=0.125 2023-06-15 03:49:44,565 INFO [train.py:988] (3/4) Epoch 7, batch 300, loss[loss=0.3407, simple_loss=0.3767, pruned_loss=0.1523, over 20702.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3795, pruned_loss=0.1456, over 2965134.61 frames. ], batch size: 211, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:49:45,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-15 03:50:08,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.398e+02 2.879e+02 3.819e+02 6.544e+02, threshold=5.757e+02, percent-clipped=0.0 2023-06-15 03:50:11,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=23366.666666666668, ans=0.005789855072463768 2023-06-15 03:50:14,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=23366.666666666668, ans=0.125 2023-06-15 03:50:37,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=23500.0, ans=0.5 2023-06-15 03:50:37,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2023-06-15 03:50:58,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=23566.666666666668, ans=0.035 2023-06-15 03:51:12,763 INFO [train.py:988] (3/4) Epoch 7, batch 350, loss[loss=0.3448, simple_loss=0.4046, pruned_loss=0.1425, over 15453.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3805, pruned_loss=0.1455, over 3133431.78 frames. ], batch size: 43, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:51:41,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=23700.0, ans=0.125 2023-06-15 03:52:26,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2023-06-15 03:52:41,271 INFO [train.py:988] (3/4) Epoch 7, batch 400, loss[loss=0.3393, simple_loss=0.397, pruned_loss=0.1408, over 16267.00 frames. ], tot_loss[loss=0.3349, simple_loss=0.3798, pruned_loss=0.145, over 3276844.32 frames. ], batch size: 52, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:52:43,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=23966.666666666668, ans=0.125 2023-06-15 03:52:56,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=24033.333333333332, ans=0.125 2023-06-15 03:52:59,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24033.333333333332, ans=0.1 2023-06-15 03:53:06,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 3.077e+02 3.837e+02 5.079e+02 9.527e+02, threshold=7.674e+02, percent-clipped=15.0 2023-06-15 03:53:27,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=24100.0, ans=0.0 2023-06-15 03:53:57,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24233.333333333332, ans=0.125 2023-06-15 03:54:09,652 INFO [train.py:988] (3/4) Epoch 7, batch 450, loss[loss=0.3379, simple_loss=0.4016, pruned_loss=0.1371, over 17640.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3791, pruned_loss=0.1447, over 3386972.56 frames. ], batch size: 67, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:54:10,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=24300.0, ans=0.0 2023-06-15 03:54:12,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=24300.0, ans=0.125 2023-06-15 03:54:26,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=24366.666666666668, ans=22.5 2023-06-15 03:54:28,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-06-15 03:54:46,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.15 vs. limit=15.0 2023-06-15 03:54:48,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=24433.333333333332, ans=0.025 2023-06-15 03:55:11,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=24500.0, ans=0.2 2023-06-15 03:55:11,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24500.0, ans=0.125 2023-06-15 03:55:28,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.87 vs. limit=15.0 2023-06-15 03:55:35,133 INFO [train.py:988] (3/4) Epoch 7, batch 500, loss[loss=0.3701, simple_loss=0.427, pruned_loss=0.1566, over 17898.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.3781, pruned_loss=0.1441, over 3477255.70 frames. ], batch size: 68, lr: 3.03e-02, grad_scale: 32.0 2023-06-15 03:55:58,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.744e+02 3.251e+02 4.322e+02 7.093e+02, threshold=6.501e+02, percent-clipped=0.0 2023-06-15 03:55:58,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=24700.0, ans=0.125 2023-06-15 03:56:18,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=24766.666666666668, ans=0.0 2023-06-15 03:56:23,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=24833.333333333332, ans=0.0 2023-06-15 03:56:53,622 INFO [train.py:988] (3/4) Epoch 8, batch 0, loss[loss=0.3381, simple_loss=0.3829, pruned_loss=0.1467, over 18928.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3829, pruned_loss=0.1467, over 18928.00 frames. ], batch size: 86, lr: 2.89e-02, grad_scale: 32.0 2023-06-15 03:56:53,623 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 03:56:59,692 INFO [train.py:1020] (3/4) Epoch 8, validation: loss=0.2482, simple_loss=0.3483, pruned_loss=0.0741, over 143649.00 frames. 2023-06-15 03:56:59,693 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 03:57:21,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=15.0 2023-06-15 03:57:52,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=25046.666666666668, ans=0.00542463768115942 2023-06-15 03:58:28,229 INFO [train.py:988] (3/4) Epoch 8, batch 50, loss[loss=0.3181, simple_loss=0.3577, pruned_loss=0.1392, over 20666.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3734, pruned_loss=0.1414, over 868812.34 frames. ], batch size: 211, lr: 2.88e-02, grad_scale: 32.0 2023-06-15 03:58:36,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.76 vs. limit=15.0 2023-06-15 03:59:04,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-15 03:59:25,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.660e+02 2.942e+02 3.457e+02 5.575e+02, threshold=5.885e+02, percent-clipped=0.0 2023-06-15 03:59:45,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.94 vs. limit=22.5 2023-06-15 03:59:57,948 INFO [train.py:988] (3/4) Epoch 8, batch 100, loss[loss=0.3369, simple_loss=0.3933, pruned_loss=0.1403, over 18279.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3728, pruned_loss=0.1408, over 1513013.22 frames. ], batch size: 74, lr: 2.88e-02, grad_scale: 32.0 2023-06-15 04:00:02,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.41 vs. limit=15.0 2023-06-15 04:01:01,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=25713.333333333332, ans=0.125 2023-06-15 04:01:14,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=25780.0, ans=0.05 2023-06-15 04:01:27,349 INFO [train.py:988] (3/4) Epoch 8, batch 150, loss[loss=0.3089, simple_loss=0.3648, pruned_loss=0.1265, over 19771.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3723, pruned_loss=0.141, over 2035450.77 frames. ], batch size: 115, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:02:10,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25980.0, ans=0.125 2023-06-15 04:02:11,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=25980.0, ans=15.0 2023-06-15 04:02:12,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=25980.0, ans=0.125 2023-06-15 04:02:23,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 2.617e+02 3.387e+02 4.495e+02 9.103e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-15 04:02:43,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=26113.333333333332, ans=0.2 2023-06-15 04:02:49,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=26113.333333333332, ans=0.125 2023-06-15 04:02:55,973 INFO [train.py:988] (3/4) Epoch 8, batch 200, loss[loss=0.3135, simple_loss=0.3621, pruned_loss=0.1325, over 20269.00 frames. ], tot_loss[loss=0.3252, simple_loss=0.3709, pruned_loss=0.1398, over 2431169.35 frames. ], batch size: 149, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:03:16,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=26246.666666666668, ans=0.04949747468305833 2023-06-15 04:03:25,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=26246.666666666668, ans=0.0 2023-06-15 04:03:26,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-15 04:03:40,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.16 vs. limit=15.0 2023-06-15 04:03:47,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=26380.0, ans=0.125 2023-06-15 04:04:01,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.53 vs. limit=15.0 2023-06-15 04:04:25,086 INFO [train.py:988] (3/4) Epoch 8, batch 250, loss[loss=0.3548, simple_loss=0.3541, pruned_loss=0.1778, over 16800.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3697, pruned_loss=0.1385, over 2738112.52 frames. ], batch size: 391, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:04:33,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2023-06-15 04:05:25,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.237e+02 2.856e+02 3.826e+02 6.923e+02, threshold=5.713e+02, percent-clipped=1.0 2023-06-15 04:05:43,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=26780.0, ans=0.05 2023-06-15 04:05:57,937 INFO [train.py:988] (3/4) Epoch 8, batch 300, loss[loss=0.312, simple_loss=0.3664, pruned_loss=0.1288, over 19063.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.37, pruned_loss=0.1376, over 2975559.43 frames. ], batch size: 89, lr: 2.86e-02, grad_scale: 32.0 2023-06-15 04:06:04,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-15 04:06:32,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=26980.0, ans=0.125 2023-06-15 04:06:36,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.58 vs. limit=22.5 2023-06-15 04:06:48,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-06-15 04:07:04,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=27046.666666666668, ans=0.004989855072463767 2023-06-15 04:07:27,561 INFO [train.py:988] (3/4) Epoch 8, batch 350, loss[loss=0.3134, simple_loss=0.3518, pruned_loss=0.1375, over 20700.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3699, pruned_loss=0.137, over 3146220.29 frames. ], batch size: 211, lr: 2.86e-02, grad_scale: 32.0 2023-06-15 04:08:08,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=27313.333333333332, ans=0.5 2023-06-15 04:08:09,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=27313.333333333332, ans=0.125 2023-06-15 04:08:18,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=27380.0, ans=0.0 2023-06-15 04:08:24,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.602e+02 3.097e+02 4.121e+02 7.485e+02, threshold=6.195e+02, percent-clipped=4.0 2023-06-15 04:08:33,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=27380.0, ans=0.125 2023-06-15 04:08:56,015 INFO [train.py:988] (3/4) Epoch 8, batch 400, loss[loss=0.3076, simple_loss=0.3653, pruned_loss=0.1249, over 18636.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3705, pruned_loss=0.1367, over 3286918.82 frames. ], batch size: 80, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:08:58,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=27513.333333333332, ans=0.0 2023-06-15 04:09:12,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27513.333333333332, ans=0.1 2023-06-15 04:09:24,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27580.0, ans=0.125 2023-06-15 04:09:29,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=27580.0, ans=0.0 2023-06-15 04:09:53,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=27713.333333333332, ans=0.125 2023-06-15 04:09:55,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=27713.333333333332, ans=0.0 2023-06-15 04:10:27,004 INFO [train.py:988] (3/4) Epoch 8, batch 450, loss[loss=0.3309, simple_loss=0.3939, pruned_loss=0.134, over 16295.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3706, pruned_loss=0.1369, over 3402304.22 frames. ], batch size: 52, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:11:08,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=27980.0, ans=0.125 2023-06-15 04:11:18,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=28046.666666666668, ans=0.125 2023-06-15 04:11:23,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.599e+02 3.032e+02 3.753e+02 5.594e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-15 04:11:37,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28113.333333333332, ans=0.125 2023-06-15 04:11:45,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=28113.333333333332, ans=0.125 2023-06-15 04:11:53,601 INFO [train.py:988] (3/4) Epoch 8, batch 500, loss[loss=0.3293, simple_loss=0.3664, pruned_loss=0.1461, over 20667.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3702, pruned_loss=0.1362, over 3496791.26 frames. ], batch size: 211, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:11:57,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=28180.0, ans=0.2 2023-06-15 04:12:11,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28246.666666666668, ans=0.1 2023-06-15 04:12:29,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=28313.333333333332, ans=0.125 2023-06-15 04:12:33,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28313.333333333332, ans=0.1 2023-06-15 04:13:14,344 INFO [train.py:988] (3/4) Epoch 9, batch 0, loss[loss=0.3351, simple_loss=0.3681, pruned_loss=0.151, over 20630.00 frames. ], tot_loss[loss=0.3351, simple_loss=0.3681, pruned_loss=0.151, over 20630.00 frames. ], batch size: 189, lr: 2.72e-02, grad_scale: 32.0 2023-06-15 04:13:14,344 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 04:13:20,346 INFO [train.py:1020] (3/4) Epoch 9, validation: loss=0.2394, simple_loss=0.343, pruned_loss=0.06786, over 143649.00 frames. 2023-06-15 04:13:20,348 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 04:13:26,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=28393.333333333332, ans=0.1 2023-06-15 04:13:30,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.63 vs. limit=15.0 2023-06-15 04:13:47,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=28460.0, ans=0.125 2023-06-15 04:14:49,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.336e+02 2.823e+02 3.585e+02 6.203e+02, threshold=5.645e+02, percent-clipped=2.0 2023-06-15 04:14:49,824 INFO [train.py:988] (3/4) Epoch 9, batch 50, loss[loss=0.3556, simple_loss=0.4135, pruned_loss=0.1488, over 18308.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3665, pruned_loss=0.1349, over 860056.46 frames. ], batch size: 72, lr: 2.71e-02, grad_scale: 32.0 2023-06-15 04:15:40,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=28926.666666666668, ans=10.0 2023-06-15 04:15:48,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=28926.666666666668, ans=0.125 2023-06-15 04:16:10,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=28993.333333333332, ans=0.2 2023-06-15 04:16:17,506 INFO [train.py:988] (3/4) Epoch 9, batch 100, loss[loss=0.3289, simple_loss=0.3791, pruned_loss=0.1393, over 18772.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3674, pruned_loss=0.1334, over 1518454.77 frames. ], batch size: 83, lr: 2.71e-02, grad_scale: 32.0 2023-06-15 04:16:24,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=29060.0, ans=0.004552173913043479 2023-06-15 04:16:32,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29060.0, ans=0.1 2023-06-15 04:17:02,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=29193.333333333332, ans=0.2 2023-06-15 04:17:21,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=29260.0, ans=0.125 2023-06-15 04:17:37,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.68 vs. limit=22.5 2023-06-15 04:17:44,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.451e+02 3.023e+02 4.117e+02 8.643e+02, threshold=6.045e+02, percent-clipped=4.0 2023-06-15 04:17:44,362 INFO [train.py:988] (3/4) Epoch 9, batch 150, loss[loss=0.2987, simple_loss=0.3441, pruned_loss=0.1266, over 20123.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3663, pruned_loss=0.1329, over 2021008.89 frames. ], batch size: 133, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:17:57,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=6.0 2023-06-15 04:18:06,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=29460.0, ans=0.1 2023-06-15 04:18:23,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=29526.666666666668, ans=10.0 2023-06-15 04:19:12,169 INFO [train.py:988] (3/4) Epoch 9, batch 200, loss[loss=0.2788, simple_loss=0.3512, pruned_loss=0.1032, over 10669.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3665, pruned_loss=0.1327, over 2391734.88 frames. ], batch size: 30, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:19:21,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=29726.666666666668, ans=0.07 2023-06-15 04:19:28,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=29793.333333333332, ans=0.0043927536231884055 2023-06-15 04:19:55,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=29860.0, ans=0.125 2023-06-15 04:19:59,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=29860.0, ans=0.125 2023-06-15 04:20:11,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-15 04:20:15,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=29926.666666666668, ans=0.0 2023-06-15 04:20:18,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=29926.666666666668, ans=0.125 2023-06-15 04:20:34,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.62 vs. limit=15.0 2023-06-15 04:20:41,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.377e+02 2.781e+02 3.474e+02 5.000e+02, threshold=5.562e+02, percent-clipped=0.0 2023-06-15 04:20:41,798 INFO [train.py:988] (3/4) Epoch 9, batch 250, loss[loss=0.3337, simple_loss=0.3813, pruned_loss=0.143, over 18793.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3647, pruned_loss=0.1324, over 2715452.17 frames. ], batch size: 83, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:20:57,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=30126.666666666668, ans=0.0 2023-06-15 04:21:06,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=30126.666666666668, ans=0.004320289855072463 2023-06-15 04:21:12,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.57 vs. limit=10.0 2023-06-15 04:21:21,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=30193.333333333332, ans=0.125 2023-06-15 04:21:26,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=30193.333333333332, ans=0.125 2023-06-15 04:21:52,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=30326.666666666668, ans=0.125 2023-06-15 04:22:08,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=30393.333333333332, ans=0.07 2023-06-15 04:22:09,955 INFO [train.py:988] (3/4) Epoch 9, batch 300, loss[loss=0.2967, simple_loss=0.3535, pruned_loss=0.1199, over 18605.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3648, pruned_loss=0.1318, over 2950042.52 frames. ], batch size: 80, lr: 2.69e-02, grad_scale: 32.0 2023-06-15 04:22:26,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=30460.0, ans=0.0042478260869565215 2023-06-15 04:22:49,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=30526.666666666668, ans=0.125 2023-06-15 04:23:12,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=30593.333333333332, ans=0.1 2023-06-15 04:23:13,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=30593.333333333332, ans=0.125 2023-06-15 04:23:24,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=30660.0, ans=0.004204347826086956 2023-06-15 04:23:24,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=30660.0, ans=0.004204347826086956 2023-06-15 04:23:26,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=30660.0, ans=0.05 2023-06-15 04:23:33,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=30660.0, ans=0.125 2023-06-15 04:23:40,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.682e+02 3.189e+02 4.179e+02 7.690e+02, threshold=6.378e+02, percent-clipped=10.0 2023-06-15 04:23:40,553 INFO [train.py:988] (3/4) Epoch 9, batch 350, loss[loss=0.3327, simple_loss=0.3571, pruned_loss=0.1542, over 19871.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3638, pruned_loss=0.1317, over 3131899.14 frames. ], batch size: 293, lr: 2.69e-02, grad_scale: 32.0 2023-06-15 04:23:45,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=30726.666666666668, ans=0.0 2023-06-15 04:24:01,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=30793.333333333332, ans=0.125 2023-06-15 04:24:56,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30993.333333333332, ans=0.1 2023-06-15 04:25:09,347 INFO [train.py:988] (3/4) Epoch 9, batch 400, loss[loss=0.3139, simple_loss=0.3612, pruned_loss=0.1333, over 20645.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3646, pruned_loss=0.1312, over 3268492.64 frames. ], batch size: 173, lr: 2.68e-02, grad_scale: 32.0 2023-06-15 04:25:28,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31126.666666666668, ans=0.1 2023-06-15 04:25:31,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31126.666666666668, ans=0.1 2023-06-15 04:25:49,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=31193.333333333332, ans=0.0 2023-06-15 04:26:15,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.71 vs. limit=22.5 2023-06-15 04:26:36,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.307e+02 2.949e+02 3.902e+02 6.879e+02, threshold=5.899e+02, percent-clipped=2.0 2023-06-15 04:26:36,200 INFO [train.py:988] (3/4) Epoch 9, batch 450, loss[loss=0.3115, simple_loss=0.3596, pruned_loss=0.1317, over 20657.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3635, pruned_loss=0.1305, over 3383850.74 frames. ], batch size: 211, lr: 2.68e-02, grad_scale: 32.0 2023-06-15 04:26:40,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=31393.333333333332, ans=0.05 2023-06-15 04:26:40,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-15 04:26:57,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=31460.0, ans=0.05 2023-06-15 04:27:04,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=31460.0, ans=0.125 2023-06-15 04:27:04,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=31460.0, ans=0.125 2023-06-15 04:27:28,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=31593.333333333332, ans=0.2 2023-06-15 04:27:31,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31593.333333333332, ans=0.1 2023-06-15 04:27:37,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=31593.333333333332, ans=0.125 2023-06-15 04:27:48,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=22.5 2023-06-15 04:27:52,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=31660.0, ans=0.1 2023-06-15 04:28:01,928 INFO [train.py:988] (3/4) Epoch 9, batch 500, loss[loss=0.3198, simple_loss=0.3644, pruned_loss=0.1376, over 19960.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3636, pruned_loss=0.13, over 3468077.45 frames. ], batch size: 126, lr: 2.68e-02, grad_scale: 64.0 2023-06-15 04:28:10,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=31726.666666666668, ans=0.125 2023-06-15 04:28:12,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=31726.666666666668, ans=0.0 2023-06-15 04:28:20,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=31793.333333333332, ans=0.003957971014492754 2023-06-15 04:28:29,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.35 vs. limit=22.5 2023-06-15 04:28:40,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=31860.0, ans=0.2 2023-06-15 04:29:21,629 INFO [train.py:988] (3/4) Epoch 10, batch 0, loss[loss=0.2935, simple_loss=0.3463, pruned_loss=0.1203, over 20296.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3463, pruned_loss=0.1203, over 20296.00 frames. ], batch size: 141, lr: 2.56e-02, grad_scale: 64.0 2023-06-15 04:29:21,630 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 04:29:28,493 INFO [train.py:1020] (3/4) Epoch 10, validation: loss=0.2327, simple_loss=0.3375, pruned_loss=0.06395, over 143649.00 frames. 2023-06-15 04:29:28,494 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 04:29:39,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=31940.0, ans=0.125 2023-06-15 04:29:50,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=32006.666666666668, ans=0.02 2023-06-15 04:30:01,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.277e+02 2.643e+02 3.234e+02 5.475e+02, threshold=5.286e+02, percent-clipped=0.0 2023-06-15 04:30:15,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.01 vs. limit=15.0 2023-06-15 04:30:16,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=32073.333333333332, ans=0.0 2023-06-15 04:30:21,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=32140.0, ans=0.2 2023-06-15 04:30:44,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32206.666666666668, ans=0.125 2023-06-15 04:30:58,904 INFO [train.py:988] (3/4) Epoch 10, batch 50, loss[loss=0.3079, simple_loss=0.3638, pruned_loss=0.126, over 20325.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.359, pruned_loss=0.1284, over 847052.64 frames. ], batch size: 149, lr: 2.56e-02, grad_scale: 64.0 2023-06-15 04:31:16,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=15.0 2023-06-15 04:31:20,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32340.0, ans=0.1 2023-06-15 04:31:36,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=32406.666666666668, ans=0.125 2023-06-15 04:32:01,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=32473.333333333332, ans=0.0 2023-06-15 04:32:28,860 INFO [train.py:988] (3/4) Epoch 10, batch 100, loss[loss=0.3013, simple_loss=0.3508, pruned_loss=0.1259, over 20292.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3577, pruned_loss=0.1282, over 1496633.09 frames. ], batch size: 149, lr: 2.55e-02, grad_scale: 64.0 2023-06-15 04:32:33,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2023-06-15 04:33:02,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 2.450e+02 2.873e+02 3.278e+02 7.765e+02, threshold=5.745e+02, percent-clipped=3.0 2023-06-15 04:33:06,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-15 04:33:09,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=32740.0, ans=0.125 2023-06-15 04:33:39,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=32806.666666666664, ans=0.05 2023-06-15 04:33:59,940 INFO [train.py:988] (3/4) Epoch 10, batch 150, loss[loss=0.3113, simple_loss=0.3701, pruned_loss=0.1263, over 19073.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.36, pruned_loss=0.1269, over 1999927.97 frames. ], batch size: 89, lr: 2.55e-02, grad_scale: 64.0 2023-06-15 04:34:11,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.18 vs. limit=15.0 2023-06-15 04:34:17,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=33006.666666666664, ans=0.2 2023-06-15 04:35:01,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.37 vs. limit=22.5 2023-06-15 04:35:30,177 INFO [train.py:988] (3/4) Epoch 10, batch 200, loss[loss=0.3343, simple_loss=0.3944, pruned_loss=0.137, over 17610.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.36, pruned_loss=0.1267, over 2376595.96 frames. ], batch size: 67, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:35:37,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=33273.333333333336, ans=0.125 2023-06-15 04:36:02,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.312e+02 2.745e+02 3.367e+02 5.641e+02, threshold=5.490e+02, percent-clipped=0.0 2023-06-15 04:36:17,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=33406.666666666664, ans=0.125 2023-06-15 04:36:34,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33473.333333333336, ans=0.1 2023-06-15 04:36:50,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=33540.0, ans=0.125 2023-06-15 04:36:55,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=33540.0, ans=0.125 2023-06-15 04:36:59,972 INFO [train.py:988] (3/4) Epoch 10, batch 250, loss[loss=0.307, simple_loss=0.3557, pruned_loss=0.1292, over 19962.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3601, pruned_loss=0.1265, over 2695969.97 frames. ], batch size: 126, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:37:04,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.08 vs. limit=5.0 2023-06-15 04:37:07,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=33606.666666666664, ans=0.0 2023-06-15 04:37:10,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=33606.666666666664, ans=0.125 2023-06-15 04:37:14,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=33606.666666666664, ans=15.0 2023-06-15 04:37:17,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=33673.333333333336, ans=0.07 2023-06-15 04:37:36,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33740.0, ans=0.1 2023-06-15 04:38:28,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.54 vs. limit=22.5 2023-06-15 04:38:29,342 INFO [train.py:988] (3/4) Epoch 10, batch 300, loss[loss=0.3023, simple_loss=0.3544, pruned_loss=0.125, over 19066.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3575, pruned_loss=0.1252, over 2956808.44 frames. ], batch size: 89, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:38:44,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-15 04:39:01,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.483e+02 2.953e+02 3.724e+02 5.914e+02, threshold=5.906e+02, percent-clipped=1.0 2023-06-15 04:39:18,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=34073.333333333336, ans=0.2 2023-06-15 04:39:23,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.23 vs. limit=15.0 2023-06-15 04:39:35,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34140.0, ans=0.1 2023-06-15 04:39:48,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=34206.666666666664, ans=0.125 2023-06-15 04:39:59,125 INFO [train.py:988] (3/4) Epoch 10, batch 350, loss[loss=0.3099, simple_loss=0.3579, pruned_loss=0.1309, over 19100.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3572, pruned_loss=0.1248, over 3139441.75 frames. ], batch size: 94, lr: 2.53e-02, grad_scale: 64.0 2023-06-15 04:40:09,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=34273.333333333336, ans=0.0034188405797101447 2023-06-15 04:40:10,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-06-15 04:40:11,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=34273.333333333336, ans=0.1 2023-06-15 04:40:32,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34340.0, ans=0.1 2023-06-15 04:40:37,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=34406.666666666664, ans=0.125 2023-06-15 04:40:46,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-06-15 04:40:51,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34473.333333333336, ans=0.125 2023-06-15 04:41:01,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=34473.333333333336, ans=0.0 2023-06-15 04:41:13,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.11 vs. limit=6.0 2023-06-15 04:41:29,063 INFO [train.py:988] (3/4) Epoch 10, batch 400, loss[loss=0.2799, simple_loss=0.333, pruned_loss=0.1134, over 20107.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3589, pruned_loss=0.1251, over 3253027.82 frames. ], batch size: 133, lr: 2.53e-02, grad_scale: 32.0 2023-06-15 04:41:53,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=34673.333333333336, ans=0.125 2023-06-15 04:42:02,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.389e+02 2.906e+02 3.855e+02 6.206e+02, threshold=5.812e+02, percent-clipped=1.0 2023-06-15 04:42:11,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=34740.0, ans=0.2 2023-06-15 04:42:12,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=34740.0, ans=0.125 2023-06-15 04:42:47,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=34873.333333333336, ans=10.0 2023-06-15 04:42:52,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=34873.333333333336, ans=0.125 2023-06-15 04:42:57,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2023-06-15 04:42:58,379 INFO [train.py:988] (3/4) Epoch 10, batch 450, loss[loss=0.2947, simple_loss=0.3551, pruned_loss=0.1171, over 19234.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3583, pruned_loss=0.1235, over 3358436.74 frames. ], batch size: 92, lr: 2.52e-02, grad_scale: 32.0 2023-06-15 04:43:13,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34940.0, ans=0.125 2023-06-15 04:43:19,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=35006.666666666664, ans=0.125 2023-06-15 04:44:04,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=35140.0, ans=0.0 2023-06-15 04:44:04,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=35140.0, ans=0.0 2023-06-15 04:44:20,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=35206.666666666664, ans=0.1 2023-06-15 04:44:24,844 INFO [train.py:988] (3/4) Epoch 10, batch 500, loss[loss=0.3087, simple_loss=0.3592, pruned_loss=0.1291, over 19518.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3573, pruned_loss=0.1234, over 3463309.21 frames. ], batch size: 102, lr: 2.52e-02, grad_scale: 32.0 2023-06-15 04:44:38,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=35273.333333333336, ans=0.125 2023-06-15 04:44:44,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=35340.0, ans=0.125 2023-06-15 04:44:56,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.429e+02 2.839e+02 3.294e+02 4.521e+02, threshold=5.678e+02, percent-clipped=0.0 2023-06-15 04:44:56,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=35406.666666666664, ans=0.0031724637681159427 2023-06-15 04:45:06,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=35406.666666666664, ans=0.07 2023-06-15 04:45:44,023 INFO [train.py:988] (3/4) Epoch 11, batch 0, loss[loss=0.3008, simple_loss=0.3586, pruned_loss=0.1214, over 19534.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3586, pruned_loss=0.1214, over 19534.00 frames. ], batch size: 102, lr: 2.42e-02, grad_scale: 32.0 2023-06-15 04:45:44,023 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 04:45:50,094 INFO [train.py:1020] (3/4) Epoch 11, validation: loss=0.2306, simple_loss=0.3357, pruned_loss=0.06271, over 143649.00 frames. 2023-06-15 04:45:50,096 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 04:45:51,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-15 04:45:58,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=35493.333333333336, ans=0.2 2023-06-15 04:46:41,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=35693.333333333336, ans=0.125 2023-06-15 04:46:43,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35693.333333333336, ans=0.1 2023-06-15 04:46:52,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=35693.333333333336, ans=0.125 2023-06-15 04:47:03,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=35760.0, ans=0.025 2023-06-15 04:47:19,227 INFO [train.py:988] (3/4) Epoch 11, batch 50, loss[loss=0.2974, simple_loss=0.3153, pruned_loss=0.1398, over 16707.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3521, pruned_loss=0.1214, over 858742.93 frames. ], batch size: 391, lr: 2.41e-02, grad_scale: 32.0 2023-06-15 04:47:28,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=35826.666666666664, ans=0.125 2023-06-15 04:47:31,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=35826.666666666664, ans=0.05 2023-06-15 04:47:31,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=35826.666666666664, ans=0.0 2023-06-15 04:47:31,649 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:47:39,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=35893.333333333336, ans=0.125 2023-06-15 04:47:51,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=35893.333333333336, ans=0.0 2023-06-15 04:47:56,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=35960.0, ans=0.125 2023-06-15 04:48:22,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.367e+02 2.815e+02 3.714e+02 5.103e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-15 04:48:32,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=36093.333333333336, ans=0.0 2023-06-15 04:48:32,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=36093.333333333336, ans=0.2 2023-06-15 04:48:47,416 INFO [train.py:988] (3/4) Epoch 11, batch 100, loss[loss=0.2853, simple_loss=0.332, pruned_loss=0.1193, over 20641.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3506, pruned_loss=0.119, over 1511549.76 frames. ], batch size: 211, lr: 2.41e-02, grad_scale: 32.0 2023-06-15 04:48:49,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-06-15 04:48:55,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=36160.0, ans=0.2 2023-06-15 04:49:08,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2023-06-15 04:49:20,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=36226.666666666664, ans=0.125 2023-06-15 04:49:51,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36360.0, ans=0.1 2023-06-15 04:49:56,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=36360.0, ans=0.0029652173913043475 2023-06-15 04:50:10,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=36426.666666666664, ans=0.0 2023-06-15 04:50:18,757 INFO [train.py:988] (3/4) Epoch 11, batch 150, loss[loss=0.307, simple_loss=0.317, pruned_loss=0.1485, over 17197.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3519, pruned_loss=0.1194, over 2005776.20 frames. ], batch size: 391, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:50:21,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-15 04:51:17,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=36693.333333333336, ans=0.0 2023-06-15 04:51:17,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=36693.333333333336, ans=0.1 2023-06-15 04:51:22,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 2.254e+02 2.489e+02 3.022e+02 4.758e+02, threshold=4.979e+02, percent-clipped=0.0 2023-06-15 04:51:24,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=36693.333333333336, ans=0.125 2023-06-15 04:51:37,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36760.0, ans=0.1 2023-06-15 04:51:44,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=36760.0, ans=0.2 2023-06-15 04:51:46,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=36826.666666666664, ans=0.125 2023-06-15 04:51:47,725 INFO [train.py:988] (3/4) Epoch 11, batch 200, loss[loss=0.3076, simple_loss=0.3197, pruned_loss=0.1477, over 16789.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.352, pruned_loss=0.1191, over 2402919.15 frames. ], batch size: 391, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:51:48,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=36826.666666666664, ans=0.2 2023-06-15 04:52:01,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.33 vs. limit=15.0 2023-06-15 04:52:21,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.84 vs. limit=15.0 2023-06-15 04:52:47,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=37026.666666666664, ans=0.2 2023-06-15 04:52:55,766 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:53:01,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=37093.333333333336, ans=0.125 2023-06-15 04:53:17,705 INFO [train.py:988] (3/4) Epoch 11, batch 250, loss[loss=0.2732, simple_loss=0.3341, pruned_loss=0.1062, over 19537.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3511, pruned_loss=0.1191, over 2719627.39 frames. ], batch size: 102, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:53:27,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37160.0, ans=0.125 2023-06-15 04:53:50,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-06-15 04:53:57,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=37293.333333333336, ans=0.125 2023-06-15 04:54:08,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37293.333333333336, ans=0.1 2023-06-15 04:54:09,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=37360.0, ans=10.0 2023-06-15 04:54:13,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=37360.0, ans=0.125 2023-06-15 04:54:16,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2023-06-15 04:54:21,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37360.0, ans=0.1 2023-06-15 04:54:22,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.173e+02 2.592e+02 3.214e+02 4.591e+02, threshold=5.183e+02, percent-clipped=0.0 2023-06-15 04:54:48,160 INFO [train.py:988] (3/4) Epoch 11, batch 300, loss[loss=0.2984, simple_loss=0.3631, pruned_loss=0.1168, over 19059.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3508, pruned_loss=0.1187, over 2952647.35 frames. ], batch size: 89, lr: 2.39e-02, grad_scale: 32.0 2023-06-15 04:54:57,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=37493.333333333336, ans=0.0027188405797101455 2023-06-15 04:55:37,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37626.666666666664, ans=0.1 2023-06-15 04:55:51,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=37693.333333333336, ans=0.125 2023-06-15 04:55:57,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.40 vs. limit=22.5 2023-06-15 04:56:18,503 INFO [train.py:988] (3/4) Epoch 11, batch 350, loss[loss=0.2981, simple_loss=0.3483, pruned_loss=0.1239, over 20239.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3506, pruned_loss=0.1195, over 3128348.74 frames. ], batch size: 141, lr: 2.39e-02, grad_scale: 32.0 2023-06-15 04:56:54,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=37960.0, ans=0.2 2023-06-15 04:57:23,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.398e+02 2.846e+02 3.298e+02 5.496e+02, threshold=5.692e+02, percent-clipped=3.0 2023-06-15 04:57:48,724 INFO [train.py:988] (3/4) Epoch 11, batch 400, loss[loss=0.2913, simple_loss=0.3348, pruned_loss=0.1239, over 20343.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.351, pruned_loss=0.119, over 3256807.12 frames. ], batch size: 239, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 04:58:08,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2023-06-15 04:58:12,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=38226.666666666664, ans=0.125 2023-06-15 04:58:24,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=38293.333333333336, ans=0.125 2023-06-15 04:58:49,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-06-15 04:58:52,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=38360.0, ans=0.07 2023-06-15 04:59:18,290 INFO [train.py:988] (3/4) Epoch 11, batch 450, loss[loss=0.2858, simple_loss=0.351, pruned_loss=0.1103, over 18461.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3516, pruned_loss=0.1184, over 3369571.25 frames. ], batch size: 77, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 04:59:22,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=38493.333333333336, ans=0.05 2023-06-15 05:00:04,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.59 vs. limit=15.0 2023-06-15 05:00:21,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.134e+02 2.582e+02 3.273e+02 5.590e+02, threshold=5.163e+02, percent-clipped=0.0 2023-06-15 05:00:35,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38760.0, ans=0.125 2023-06-15 05:00:45,634 INFO [train.py:988] (3/4) Epoch 11, batch 500, loss[loss=0.2636, simple_loss=0.326, pruned_loss=0.1006, over 18777.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3512, pruned_loss=0.1186, over 3454416.00 frames. ], batch size: 83, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 05:02:04,592 INFO [train.py:988] (3/4) Epoch 12, batch 0, loss[loss=0.3012, simple_loss=0.3762, pruned_loss=0.1131, over 16717.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3762, pruned_loss=0.1131, over 16717.00 frames. ], batch size: 59, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:02:04,592 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 05:02:10,663 INFO [train.py:1020] (3/4) Epoch 12, validation: loss=0.2286, simple_loss=0.3321, pruned_loss=0.06259, over 143649.00 frames. 2023-06-15 05:02:10,664 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 05:02:34,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=39106.666666666664, ans=0.2 2023-06-15 05:02:35,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=39106.666666666664, ans=0.125 2023-06-15 05:02:43,441 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:02:52,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=39173.333333333336, ans=0.95 2023-06-15 05:03:03,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.34 vs. limit=15.0 2023-06-15 05:03:22,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=39306.666666666664, ans=0.2 2023-06-15 05:03:39,957 INFO [train.py:988] (3/4) Epoch 12, batch 50, loss[loss=0.265, simple_loss=0.3328, pruned_loss=0.09859, over 19099.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3507, pruned_loss=0.1173, over 851872.48 frames. ], batch size: 89, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:03:47,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.229e+02 2.614e+02 3.246e+02 5.755e+02, threshold=5.228e+02, percent-clipped=1.0 2023-06-15 05:03:53,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-06-15 05:03:58,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=39440.0, ans=0.125 2023-06-15 05:04:01,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=39440.0, ans=0.0 2023-06-15 05:04:10,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=39440.0, ans=0.2 2023-06-15 05:05:08,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=39706.666666666664, ans=0.5 2023-06-15 05:05:09,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=12.0 2023-06-15 05:05:09,984 INFO [train.py:988] (3/4) Epoch 12, batch 100, loss[loss=0.2879, simple_loss=0.3518, pruned_loss=0.112, over 19937.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3497, pruned_loss=0.1168, over 1509115.80 frames. ], batch size: 126, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:05:15,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=39706.666666666664, ans=0.09899494936611666 2023-06-15 05:05:23,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=39706.666666666664, ans=0.5 2023-06-15 05:05:33,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=39773.333333333336, ans=0.0022231884057971017 2023-06-15 05:06:40,217 INFO [train.py:988] (3/4) Epoch 12, batch 150, loss[loss=0.3104, simple_loss=0.3835, pruned_loss=0.1187, over 16302.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3506, pruned_loss=0.117, over 2007353.08 frames. ], batch size: 52, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:06:46,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 2.288e+02 2.648e+02 3.128e+02 5.617e+02, threshold=5.296e+02, percent-clipped=1.0 2023-06-15 05:06:57,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.98 vs. limit=15.0 2023-06-15 05:06:57,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-15 05:07:55,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=40306.666666666664, ans=0.125 2023-06-15 05:08:09,676 INFO [train.py:988] (3/4) Epoch 12, batch 200, loss[loss=0.3056, simple_loss=0.3736, pruned_loss=0.1187, over 16774.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3518, pruned_loss=0.1173, over 2400782.55 frames. ], batch size: 59, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:08:11,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=40373.333333333336, ans=0.125 2023-06-15 05:08:23,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.18 vs. limit=10.0 2023-06-15 05:08:32,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=40440.0, ans=0.1 2023-06-15 05:08:34,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-06-15 05:08:49,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40506.666666666664, ans=0.125 2023-06-15 05:09:26,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=40640.0, ans=0.0 2023-06-15 05:09:39,244 INFO [train.py:988] (3/4) Epoch 12, batch 250, loss[loss=0.3044, simple_loss=0.3735, pruned_loss=0.1176, over 18331.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3519, pruned_loss=0.1169, over 2692209.49 frames. ], batch size: 72, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:09:46,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 2.177e+02 2.544e+02 3.093e+02 5.809e+02, threshold=5.088e+02, percent-clipped=2.0 2023-06-15 05:09:47,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40706.666666666664, ans=0.1 2023-06-15 05:09:51,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40706.666666666664, ans=0.1 2023-06-15 05:10:12,688 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:10:44,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=40906.666666666664, ans=0.125 2023-06-15 05:11:05,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-06-15 05:11:09,729 INFO [train.py:988] (3/4) Epoch 12, batch 300, loss[loss=0.3088, simple_loss=0.3198, pruned_loss=0.1489, over 17047.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3511, pruned_loss=0.1169, over 2945593.30 frames. ], batch size: 391, lr: 2.26e-02, grad_scale: 32.0 2023-06-15 05:11:19,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=41040.0, ans=0.125 2023-06-15 05:11:26,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=41106.666666666664, ans=10.0 2023-06-15 05:11:41,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=41106.666666666664, ans=0.0019333333333333338 2023-06-15 05:12:28,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41306.666666666664, ans=0.1 2023-06-15 05:12:39,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=41373.333333333336, ans=0.0018753623188405791 2023-06-15 05:12:40,653 INFO [train.py:988] (3/4) Epoch 12, batch 350, loss[loss=0.2972, simple_loss=0.3697, pruned_loss=0.1124, over 17123.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3502, pruned_loss=0.116, over 3120737.03 frames. ], batch size: 60, lr: 2.26e-02, grad_scale: 32.0 2023-06-15 05:12:46,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=41373.333333333336, ans=0.125 2023-06-15 05:12:47,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.134e+02 2.417e+02 3.017e+02 4.561e+02, threshold=4.834e+02, percent-clipped=0.0 2023-06-15 05:13:02,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=41440.0, ans=0.0018608695652173914 2023-06-15 05:13:03,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-15 05:13:28,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.39 vs. limit=10.0 2023-06-15 05:13:37,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2023-06-15 05:13:43,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-15 05:13:48,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.68 vs. limit=22.5 2023-06-15 05:14:05,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=41640.0, ans=0.125 2023-06-15 05:14:10,557 INFO [train.py:988] (3/4) Epoch 12, batch 400, loss[loss=0.2821, simple_loss=0.3487, pruned_loss=0.1078, over 19232.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3487, pruned_loss=0.1151, over 3270307.64 frames. ], batch size: 92, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:14:27,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41773.333333333336, ans=0.125 2023-06-15 05:14:40,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=22.5 2023-06-15 05:14:48,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=41840.0, ans=0.125 2023-06-15 05:15:40,744 INFO [train.py:988] (3/4) Epoch 12, batch 450, loss[loss=0.2843, simple_loss=0.3613, pruned_loss=0.1036, over 18307.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3481, pruned_loss=0.1147, over 3385695.99 frames. ], batch size: 72, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:15:45,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=42040.0, ans=0.2 2023-06-15 05:15:46,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=42040.0, ans=0.125 2023-06-15 05:15:48,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.312e+02 2.678e+02 3.302e+02 6.342e+02, threshold=5.355e+02, percent-clipped=8.0 2023-06-15 05:15:57,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=42106.666666666664, ans=0.2 2023-06-15 05:16:33,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=15.0 2023-06-15 05:16:39,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=42240.0, ans=0.2 2023-06-15 05:16:43,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.38 vs. limit=22.5 2023-06-15 05:17:08,283 INFO [train.py:988] (3/4) Epoch 12, batch 500, loss[loss=0.2726, simple_loss=0.337, pruned_loss=0.1041, over 19858.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.348, pruned_loss=0.1147, over 3476328.84 frames. ], batch size: 120, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:17:11,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.14 vs. limit=22.5 2023-06-15 05:17:30,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=42440.0, ans=0.2 2023-06-15 05:17:37,661 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:17:48,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=42506.666666666664, ans=0.2 2023-06-15 05:18:28,971 INFO [train.py:988] (3/4) Epoch 13, batch 0, loss[loss=0.2767, simple_loss=0.3448, pruned_loss=0.1044, over 19342.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3448, pruned_loss=0.1044, over 19342.00 frames. ], batch size: 98, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:18:28,971 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 05:18:35,086 INFO [train.py:1020] (3/4) Epoch 13, validation: loss=0.2246, simple_loss=0.3282, pruned_loss=0.06053, over 143649.00 frames. 2023-06-15 05:18:35,087 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 05:18:39,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42593.333333333336, ans=0.125 2023-06-15 05:19:01,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=42660.0, ans=0.0 2023-06-15 05:19:13,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.235e+02 2.660e+02 3.477e+02 4.514e+02, threshold=5.320e+02, percent-clipped=0.0 2023-06-15 05:19:26,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=42726.666666666664, ans=10.0 2023-06-15 05:20:04,779 INFO [train.py:988] (3/4) Epoch 13, batch 50, loss[loss=0.2789, simple_loss=0.3409, pruned_loss=0.1084, over 19837.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3455, pruned_loss=0.1146, over 852212.97 frames. ], batch size: 120, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:20:10,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=42926.666666666664, ans=0.0 2023-06-15 05:21:24,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=43193.333333333336, ans=0.0 2023-06-15 05:21:32,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43260.0, ans=0.125 2023-06-15 05:21:33,300 INFO [train.py:988] (3/4) Epoch 13, batch 100, loss[loss=0.281, simple_loss=0.3397, pruned_loss=0.1111, over 19338.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3454, pruned_loss=0.1122, over 1506761.68 frames. ], batch size: 98, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:21:43,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43260.0, ans=0.1 2023-06-15 05:21:49,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=43326.666666666664, ans=0.00145072463768116 2023-06-15 05:21:51,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=43326.666666666664, ans=0.09899494936611666 2023-06-15 05:21:57,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=43326.666666666664, ans=0.125 2023-06-15 05:22:10,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.949e+02 2.269e+02 2.644e+02 4.836e+02, threshold=4.538e+02, percent-clipped=0.0 2023-06-15 05:22:10,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43393.333333333336, ans=0.0 2023-06-15 05:22:15,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-15 05:22:25,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-15 05:22:37,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43460.0, ans=0.0 2023-06-15 05:22:47,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=43526.666666666664, ans=0.125 2023-06-15 05:23:00,758 INFO [train.py:988] (3/4) Epoch 13, batch 150, loss[loss=0.2856, simple_loss=0.3491, pruned_loss=0.111, over 19354.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3448, pruned_loss=0.112, over 2010669.85 frames. ], batch size: 98, lr: 2.15e-02, grad_scale: 32.0 2023-06-15 05:23:06,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=43593.333333333336, ans=0.125 2023-06-15 05:23:13,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43593.333333333336, ans=0.1 2023-06-15 05:23:16,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43660.0, ans=0.125 2023-06-15 05:23:27,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=43660.0, ans=0.95 2023-06-15 05:23:52,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.36 vs. limit=15.0 2023-06-15 05:23:58,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=43793.333333333336, ans=0.1 2023-06-15 05:24:00,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43793.333333333336, ans=0.0 2023-06-15 05:24:02,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43793.333333333336, ans=0.1 2023-06-15 05:24:13,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=43860.0, ans=0.125 2023-06-15 05:24:28,509 INFO [train.py:988] (3/4) Epoch 13, batch 200, loss[loss=0.253, simple_loss=0.33, pruned_loss=0.08799, over 19223.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3449, pruned_loss=0.1118, over 2401944.05 frames. ], batch size: 92, lr: 2.15e-02, grad_scale: 32.0 2023-06-15 05:24:44,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43993.333333333336, ans=0.125 2023-06-15 05:24:59,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43993.333333333336, ans=0.1 2023-06-15 05:25:05,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.172e+02 2.424e+02 2.924e+02 5.184e+02, threshold=4.848e+02, percent-clipped=5.0 2023-06-15 05:25:16,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44060.0, ans=0.1 2023-06-15 05:25:23,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=44126.666666666664, ans=0.125 2023-06-15 05:25:56,674 INFO [train.py:988] (3/4) Epoch 13, batch 250, loss[loss=0.2535, simple_loss=0.3219, pruned_loss=0.09259, over 19837.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3445, pruned_loss=0.1119, over 2705826.22 frames. ], batch size: 120, lr: 2.15e-02, grad_scale: 16.0 2023-06-15 05:26:23,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=44326.666666666664, ans=0.2 2023-06-15 05:26:34,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=44393.333333333336, ans=0.5 2023-06-15 05:26:46,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44393.333333333336, ans=0.125 2023-06-15 05:26:46,837 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.484e-03 2023-06-15 05:26:57,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.71 vs. limit=10.0 2023-06-15 05:27:14,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44526.666666666664, ans=0.1 2023-06-15 05:27:18,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=44526.666666666664, ans=0.0011898550724637694 2023-06-15 05:27:24,265 INFO [train.py:988] (3/4) Epoch 13, batch 300, loss[loss=0.2541, simple_loss=0.3206, pruned_loss=0.0938, over 19899.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3436, pruned_loss=0.1118, over 2933978.05 frames. ], batch size: 120, lr: 2.14e-02, grad_scale: 16.0 2023-06-15 05:27:27,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.06 vs. limit=15.0 2023-06-15 05:27:28,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=44593.333333333336, ans=0.0 2023-06-15 05:27:58,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=44726.666666666664, ans=0.2 2023-06-15 05:28:03,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.137e+02 2.490e+02 3.192e+02 5.767e+02, threshold=4.980e+02, percent-clipped=3.0 2023-06-15 05:28:19,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=44793.333333333336, ans=0.0 2023-06-15 05:28:52,661 INFO [train.py:988] (3/4) Epoch 13, batch 350, loss[loss=0.2731, simple_loss=0.3337, pruned_loss=0.1063, over 19542.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3433, pruned_loss=0.1116, over 3104272.76 frames. ], batch size: 102, lr: 2.14e-02, grad_scale: 16.0 2023-06-15 05:29:45,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=45126.666666666664, ans=0.0010594202898550724 2023-06-15 05:30:18,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=45260.0, ans=0.2 2023-06-15 05:30:19,950 INFO [train.py:988] (3/4) Epoch 13, batch 400, loss[loss=0.2551, simple_loss=0.3216, pruned_loss=0.09428, over 19457.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3435, pruned_loss=0.1109, over 3243173.74 frames. ], batch size: 105, lr: 2.14e-02, grad_scale: 32.0 2023-06-15 05:30:26,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=45260.0, ans=0.125 2023-06-15 05:30:31,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45260.0, ans=0.1 2023-06-15 05:30:59,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.082e+02 2.371e+02 2.770e+02 5.646e+02, threshold=4.742e+02, percent-clipped=0.0 2023-06-15 05:31:00,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45393.333333333336, ans=0.125 2023-06-15 05:31:03,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=45393.333333333336, ans=0.0 2023-06-15 05:31:08,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45393.333333333336, ans=0.1 2023-06-15 05:31:48,744 INFO [train.py:988] (3/4) Epoch 13, batch 450, loss[loss=0.2544, simple_loss=0.3245, pruned_loss=0.09215, over 19461.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3433, pruned_loss=0.1106, over 3366882.43 frames. ], batch size: 105, lr: 2.13e-02, grad_scale: 32.0 2023-06-15 05:31:52,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=45593.333333333336, ans=0.0009579710144927527 2023-06-15 05:32:03,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=45593.333333333336, ans=10.0 2023-06-15 05:32:16,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45660.0, ans=0.1 2023-06-15 05:32:16,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=45660.0, ans=0.0009434782608695649 2023-06-15 05:32:17,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.58 vs. limit=15.0 2023-06-15 05:32:50,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=45793.333333333336, ans=0.0009144927536231875 2023-06-15 05:33:06,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=45860.0, ans=0.125 2023-06-15 05:33:13,475 INFO [train.py:988] (3/4) Epoch 13, batch 500, loss[loss=0.2876, simple_loss=0.3451, pruned_loss=0.115, over 18946.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3425, pruned_loss=0.1099, over 3468623.51 frames. ], batch size: 86, lr: 2.13e-02, grad_scale: 32.0 2023-06-15 05:33:30,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=45993.333333333336, ans=0.04949747468305833 2023-06-15 05:33:35,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.50 vs. limit=22.5 2023-06-15 05:33:49,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.112e+02 2.450e+02 3.177e+02 4.704e+02, threshold=4.901e+02, percent-clipped=1.0 2023-06-15 05:34:31,003 INFO [train.py:988] (3/4) Epoch 14, batch 0, loss[loss=0.2998, simple_loss=0.3714, pruned_loss=0.1141, over 13513.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3714, pruned_loss=0.1141, over 13513.00 frames. ], batch size: 38, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:34:31,004 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 05:34:37,022 INFO [train.py:1020] (3/4) Epoch 14, validation: loss=0.2205, simple_loss=0.3248, pruned_loss=0.05804, over 143649.00 frames. 2023-06-15 05:34:37,022 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 05:34:54,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46206.666666666664, ans=0.125 2023-06-15 05:36:03,656 INFO [train.py:988] (3/4) Epoch 14, batch 50, loss[loss=0.2748, simple_loss=0.3426, pruned_loss=0.1035, over 19334.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3401, pruned_loss=0.1068, over 842476.21 frames. ], batch size: 98, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:36:36,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=46540.0, ans=0.05 2023-06-15 05:36:38,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=46606.666666666664, ans=0.125 2023-06-15 05:36:40,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=46606.666666666664, ans=0.07 2023-06-15 05:37:14,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.122e+02 2.332e+02 2.601e+02 5.252e+02, threshold=4.663e+02, percent-clipped=1.0 2023-06-15 05:37:21,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=46740.0, ans=10.0 2023-06-15 05:37:31,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-06-15 05:37:32,791 INFO [train.py:988] (3/4) Epoch 14, batch 100, loss[loss=0.2687, simple_loss=0.3364, pruned_loss=0.1005, over 18632.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3387, pruned_loss=0.1101, over 1493334.33 frames. ], batch size: 80, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:37:47,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=46806.666666666664, ans=0.125 2023-06-15 05:38:30,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=47006.666666666664, ans=0.0 2023-06-15 05:38:47,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-15 05:38:59,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=47140.0, ans=0.125 2023-06-15 05:39:01,372 INFO [train.py:988] (3/4) Epoch 14, batch 150, loss[loss=0.2826, simple_loss=0.3362, pruned_loss=0.1145, over 19976.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.339, pruned_loss=0.1088, over 1994760.90 frames. ], batch size: 126, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:39:15,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=47140.0, ans=0.07 2023-06-15 05:39:41,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=47273.333333333336, ans=0.2 2023-06-15 05:39:43,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47273.333333333336, ans=0.1 2023-06-15 05:40:11,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 2.106e+02 2.352e+02 2.722e+02 4.842e+02, threshold=4.704e+02, percent-clipped=2.0 2023-06-15 05:40:18,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=47406.666666666664, ans=0.125 2023-06-15 05:40:20,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=47406.666666666664, ans=0.0005637681159420295 2023-06-15 05:40:24,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=47406.666666666664, ans=0.125 2023-06-15 05:40:28,798 INFO [train.py:988] (3/4) Epoch 14, batch 200, loss[loss=0.2799, simple_loss=0.3454, pruned_loss=0.1072, over 18304.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3381, pruned_loss=0.1078, over 2384970.86 frames. ], batch size: 74, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:40:57,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.36 vs. limit=22.5 2023-06-15 05:40:59,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-15 05:41:13,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.98 vs. limit=15.0 2023-06-15 05:41:31,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.77 vs. limit=15.0 2023-06-15 05:41:49,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=47740.0, ans=0.125 2023-06-15 05:41:56,681 INFO [train.py:988] (3/4) Epoch 14, batch 250, loss[loss=0.2896, simple_loss=0.3466, pruned_loss=0.1163, over 20128.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3384, pruned_loss=0.1077, over 2689795.28 frames. ], batch size: 133, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:41:57,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=47806.666666666664, ans=0.125 2023-06-15 05:42:02,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=47806.666666666664, ans=0.0 2023-06-15 05:42:49,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48006.666666666664, ans=0.1 2023-06-15 05:42:55,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.33 vs. limit=10.0 2023-06-15 05:43:06,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.140e+02 2.401e+02 2.980e+02 6.123e+02, threshold=4.801e+02, percent-clipped=4.0 2023-06-15 05:43:06,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=48073.333333333336, ans=0.125 2023-06-15 05:43:23,495 INFO [train.py:988] (3/4) Epoch 14, batch 300, loss[loss=0.273, simple_loss=0.3416, pruned_loss=0.1022, over 16167.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.339, pruned_loss=0.1079, over 2936458.43 frames. ], batch size: 51, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:43:23,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48140.0, ans=0.1 2023-06-15 05:43:34,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0 2023-06-15 05:43:35,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48140.0, ans=0.125 2023-06-15 05:44:42,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48406.666666666664, ans=0.125 2023-06-15 05:44:50,836 INFO [train.py:988] (3/4) Epoch 14, batch 350, loss[loss=0.3083, simple_loss=0.3176, pruned_loss=0.1496, over 17026.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3382, pruned_loss=0.1076, over 3123376.20 frames. ], batch size: 392, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:44:55,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48473.333333333336, ans=0.125 2023-06-15 05:45:01,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=48473.333333333336, ans=0.125 2023-06-15 05:45:58,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48740.0, ans=0.125 2023-06-15 05:45:59,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.238e+02 2.722e+02 3.535e+02 5.292e+02, threshold=5.444e+02, percent-clipped=1.0 2023-06-15 05:46:03,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48740.0, ans=0.125 2023-06-15 05:46:05,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=48740.0, ans=0.125 2023-06-15 05:46:16,376 INFO [train.py:988] (3/4) Epoch 14, batch 400, loss[loss=0.2718, simple_loss=0.3327, pruned_loss=0.1055, over 19096.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3387, pruned_loss=0.108, over 3273355.50 frames. ], batch size: 89, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:46:20,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=48806.666666666664, ans=0.02 2023-06-15 05:47:00,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=48940.0, ans=0.07 2023-06-15 05:47:21,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=49006.666666666664, ans=0.05 2023-06-15 05:47:42,470 INFO [train.py:988] (3/4) Epoch 14, batch 450, loss[loss=0.2705, simple_loss=0.3343, pruned_loss=0.1033, over 18934.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.339, pruned_loss=0.1083, over 3377679.88 frames. ], batch size: 86, lr: 2.02e-02, grad_scale: 32.0 2023-06-15 05:48:04,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=49206.666666666664, ans=0.125 2023-06-15 05:48:10,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=6.0 2023-06-15 05:48:11,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=49206.666666666664, ans=0.125 2023-06-15 05:48:50,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.110e+02 2.476e+02 3.051e+02 4.850e+02, threshold=4.953e+02, percent-clipped=0.0 2023-06-15 05:49:06,428 INFO [train.py:988] (3/4) Epoch 14, batch 500, loss[loss=0.2485, simple_loss=0.3179, pruned_loss=0.0896, over 19504.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3382, pruned_loss=0.108, over 3458939.69 frames. ], batch size: 105, lr: 2.02e-02, grad_scale: 32.0 2023-06-15 05:49:22,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=49540.0, ans=10.0 2023-06-15 05:49:35,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=49540.0, ans=0.125 2023-06-15 05:49:36,736 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:49:51,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=49606.666666666664, ans=0.125 2023-06-15 05:50:25,279 INFO [train.py:988] (3/4) Epoch 15, batch 0, loss[loss=0.288, simple_loss=0.3613, pruned_loss=0.1074, over 18297.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3613, pruned_loss=0.1074, over 18297.00 frames. ], batch size: 74, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:50:25,280 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 05:50:31,410 INFO [train.py:1020] (3/4) Epoch 15, validation: loss=0.2189, simple_loss=0.3232, pruned_loss=0.05727, over 143649.00 frames. 2023-06-15 05:50:31,411 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 05:50:36,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49693.333333333336, ans=0.1 2023-06-15 05:50:44,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=49693.333333333336, ans=0.125 2023-06-15 05:50:46,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=49760.0, ans=0.125 2023-06-15 05:51:12,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=49826.666666666664, ans=3.768115942029068e-05 2023-06-15 05:51:32,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.98 vs. limit=15.0 2023-06-15 05:51:55,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=49960.0, ans=0.04949747468305833 2023-06-15 05:51:58,710 INFO [train.py:988] (3/4) Epoch 15, batch 50, loss[loss=0.2701, simple_loss=0.3247, pruned_loss=0.1077, over 20702.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3376, pruned_loss=0.1067, over 861191.03 frames. ], batch size: 211, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:52:10,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.143e+02 2.454e+02 2.855e+02 6.420e+02, threshold=4.907e+02, percent-clipped=3.0 2023-06-15 05:53:00,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-15 05:53:18,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.86 vs. limit=15.0 2023-06-15 05:53:26,703 INFO [train.py:988] (3/4) Epoch 15, batch 100, loss[loss=0.2864, simple_loss=0.3059, pruned_loss=0.1335, over 16984.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.338, pruned_loss=0.1064, over 1498400.81 frames. ], batch size: 391, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:53:45,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50426.666666666664, ans=0.125 2023-06-15 05:53:59,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50493.333333333336, ans=0.1 2023-06-15 05:54:21,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=15.0 2023-06-15 05:54:54,018 INFO [train.py:988] (3/4) Epoch 15, batch 150, loss[loss=0.2563, simple_loss=0.3292, pruned_loss=0.0917, over 19440.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3375, pruned_loss=0.1062, over 2007762.12 frames. ], batch size: 105, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:55:06,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.090e+02 2.403e+02 2.891e+02 4.165e+02, threshold=4.806e+02, percent-clipped=0.0 2023-06-15 05:55:15,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=50760.0, ans=0.125 2023-06-15 05:55:52,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=50893.333333333336, ans=0.2 2023-06-15 05:56:02,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=50960.0, ans=0.2 2023-06-15 05:56:10,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.91 vs. limit=22.5 2023-06-15 05:56:22,230 INFO [train.py:988] (3/4) Epoch 15, batch 200, loss[loss=0.2662, simple_loss=0.3221, pruned_loss=0.1052, over 20122.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.337, pruned_loss=0.1056, over 2409313.39 frames. ], batch size: 133, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:56:36,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=51026.666666666664, ans=0.2 2023-06-15 05:56:49,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=51093.333333333336, ans=0.125 2023-06-15 05:56:59,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-15 05:57:00,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=51160.0, ans=0.0 2023-06-15 05:57:16,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=51226.666666666664, ans=0.2 2023-06-15 05:57:36,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=51293.333333333336, ans=0.0 2023-06-15 05:57:50,253 INFO [train.py:988] (3/4) Epoch 15, batch 250, loss[loss=0.2671, simple_loss=0.3551, pruned_loss=0.08955, over 18319.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3364, pruned_loss=0.1047, over 2714695.17 frames. ], batch size: 72, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:58:03,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.983e+02 2.255e+02 2.708e+02 4.170e+02, threshold=4.510e+02, percent-clipped=0.0 2023-06-15 05:58:37,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=51493.333333333336, ans=0.0 2023-06-15 05:58:55,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=51560.0, ans=0.125 2023-06-15 05:59:02,091 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:59:19,793 INFO [train.py:988] (3/4) Epoch 15, batch 300, loss[loss=0.2807, simple_loss=0.3408, pruned_loss=0.1103, over 19345.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.336, pruned_loss=0.1048, over 2957116.00 frames. ], batch size: 98, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 05:59:22,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=51693.333333333336, ans=0.125 2023-06-15 05:59:41,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=51760.0, ans=0.125 2023-06-15 05:59:42,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.32 vs. limit=15.0 2023-06-15 05:59:52,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=51826.666666666664, ans=0.0 2023-06-15 05:59:52,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51826.666666666664, ans=0.1 2023-06-15 06:00:31,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=51960.0, ans=0.125 2023-06-15 06:00:47,516 INFO [train.py:988] (3/4) Epoch 15, batch 350, loss[loss=0.2588, simple_loss=0.3258, pruned_loss=0.09596, over 19108.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3365, pruned_loss=0.1047, over 3143618.33 frames. ], batch size: 94, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 06:00:51,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=52026.666666666664, ans=0.125 2023-06-15 06:01:00,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.062e+02 2.429e+02 2.907e+02 4.781e+02, threshold=4.857e+02, percent-clipped=2.0 2023-06-15 06:01:15,675 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:02:02,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52293.333333333336, ans=0.0 2023-06-15 06:02:07,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=52293.333333333336, ans=0.0 2023-06-15 06:02:09,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=52293.333333333336, ans=0.0 2023-06-15 06:02:13,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=52293.333333333336, ans=0.125 2023-06-15 06:02:16,555 INFO [train.py:988] (3/4) Epoch 15, batch 400, loss[loss=0.2835, simple_loss=0.3653, pruned_loss=0.1008, over 17641.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3358, pruned_loss=0.1042, over 3286695.50 frames. ], batch size: 67, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 06:02:43,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.13 vs. limit=22.5 2023-06-15 06:02:49,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=52493.333333333336, ans=0.02 2023-06-15 06:03:38,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52626.666666666664, ans=0.1 2023-06-15 06:03:43,722 INFO [train.py:988] (3/4) Epoch 15, batch 450, loss[loss=0.2774, simple_loss=0.3541, pruned_loss=0.1004, over 16752.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3356, pruned_loss=0.1038, over 3395999.82 frames. ], batch size: 59, lr: 1.92e-02, grad_scale: 32.0 2023-06-15 06:03:56,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.130e+02 2.421e+02 3.094e+02 4.907e+02, threshold=4.841e+02, percent-clipped=1.0 2023-06-15 06:04:18,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52826.666666666664, ans=0.0 2023-06-15 06:04:31,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=52826.666666666664, ans=0.125 2023-06-15 06:05:10,617 INFO [train.py:988] (3/4) Epoch 15, batch 500, loss[loss=0.2628, simple_loss=0.3324, pruned_loss=0.09665, over 19488.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3352, pruned_loss=0.1034, over 3479518.03 frames. ], batch size: 105, lr: 1.92e-02, grad_scale: 32.0 2023-06-15 06:05:10,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=53026.666666666664, ans=0.2 2023-06-15 06:05:36,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.79 vs. limit=15.0 2023-06-15 06:05:39,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=53093.333333333336, ans=0.125 2023-06-15 06:06:28,436 INFO [train.py:988] (3/4) Epoch 16, batch 0, loss[loss=0.2624, simple_loss=0.3315, pruned_loss=0.09659, over 18511.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3315, pruned_loss=0.09659, over 18511.00 frames. ], batch size: 77, lr: 1.86e-02, grad_scale: 32.0 2023-06-15 06:06:28,437 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 06:06:34,510 INFO [train.py:1020] (3/4) Epoch 16, validation: loss=0.2134, simple_loss=0.3194, pruned_loss=0.05367, over 143649.00 frames. 2023-06-15 06:06:34,511 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 06:06:44,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=53240.0, ans=0.125 2023-06-15 06:06:47,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=53240.0, ans=0.125 2023-06-15 06:06:54,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=53306.666666666664, ans=0.0 2023-06-15 06:07:13,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=53373.333333333336, ans=0.125 2023-06-15 06:07:19,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.212e+02 2.676e+02 3.191e+02 5.269e+02, threshold=5.353e+02, percent-clipped=1.0 2023-06-15 06:08:03,134 INFO [train.py:988] (3/4) Epoch 16, batch 50, loss[loss=0.2515, simple_loss=0.32, pruned_loss=0.09149, over 18916.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3314, pruned_loss=0.1018, over 868236.24 frames. ], batch size: 86, lr: 1.86e-02, grad_scale: 32.0 2023-06-15 06:08:10,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=53573.333333333336, ans=0.125 2023-06-15 06:08:53,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=53773.333333333336, ans=0.125 2023-06-15 06:09:29,423 INFO [train.py:988] (3/4) Epoch 16, batch 100, loss[loss=0.2707, simple_loss=0.3421, pruned_loss=0.09965, over 19102.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3318, pruned_loss=0.101, over 1521586.66 frames. ], batch size: 94, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:09:57,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-15 06:10:12,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.999e+02 2.214e+02 2.667e+02 3.874e+02, threshold=4.428e+02, percent-clipped=0.0 2023-06-15 06:10:55,318 INFO [train.py:988] (3/4) Epoch 16, batch 150, loss[loss=0.2506, simple_loss=0.3077, pruned_loss=0.09669, over 20329.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3322, pruned_loss=0.1006, over 2009490.94 frames. ], batch size: 239, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:11:01,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=54240.0, ans=0.125 2023-06-15 06:11:20,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=54306.666666666664, ans=0.125 2023-06-15 06:11:24,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=54306.666666666664, ans=0.125 2023-06-15 06:11:59,997 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:12:22,655 INFO [train.py:988] (3/4) Epoch 16, batch 200, loss[loss=0.2599, simple_loss=0.3198, pruned_loss=0.09995, over 20003.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3332, pruned_loss=0.1011, over 2401921.93 frames. ], batch size: 126, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:12:28,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54573.333333333336, ans=0.1 2023-06-15 06:12:42,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=54640.0, ans=0.0 2023-06-15 06:13:05,783 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.168e+02 2.479e+02 3.039e+02 4.350e+02, threshold=4.958e+02, percent-clipped=0.0 2023-06-15 06:13:17,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=22.5 2023-06-15 06:13:21,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=54773.333333333336, ans=0.125 2023-06-15 06:13:50,151 INFO [train.py:988] (3/4) Epoch 16, batch 250, loss[loss=0.3024, simple_loss=0.3744, pruned_loss=0.1152, over 18309.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3333, pruned_loss=0.1013, over 2693528.70 frames. ], batch size: 72, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:13:57,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=54906.666666666664, ans=0.125 2023-06-15 06:14:04,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=54906.666666666664, ans=0.2 2023-06-15 06:14:16,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=54973.333333333336, ans=0.09899494936611666 2023-06-15 06:14:23,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=55040.0, ans=0.0 2023-06-15 06:14:46,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=55106.666666666664, ans=0.2 2023-06-15 06:14:48,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=55106.666666666664, ans=0.1 2023-06-15 06:14:59,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=55173.333333333336, ans=0.125 2023-06-15 06:15:14,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=55240.0, ans=0.125 2023-06-15 06:15:15,811 INFO [train.py:988] (3/4) Epoch 16, batch 300, loss[loss=0.2884, simple_loss=0.348, pruned_loss=0.1144, over 20597.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3314, pruned_loss=0.1009, over 2950661.14 frames. ], batch size: 173, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:15:27,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=55240.0, ans=0.2 2023-06-15 06:15:50,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-06-15 06:15:59,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.170e+02 2.637e+02 3.215e+02 4.848e+02, threshold=5.274e+02, percent-clipped=0.0 2023-06-15 06:16:41,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=55573.333333333336, ans=10.0 2023-06-15 06:16:43,060 INFO [train.py:988] (3/4) Epoch 16, batch 350, loss[loss=0.2491, simple_loss=0.3208, pruned_loss=0.08866, over 18642.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3312, pruned_loss=0.1011, over 3133645.54 frames. ], batch size: 80, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:16:54,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55573.333333333336, ans=0.1 2023-06-15 06:17:18,487 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:17:22,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55706.666666666664, ans=0.1 2023-06-15 06:18:10,031 INFO [train.py:988] (3/4) Epoch 16, batch 400, loss[loss=0.2639, simple_loss=0.3255, pruned_loss=0.1011, over 20438.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3316, pruned_loss=0.1012, over 3277568.60 frames. ], batch size: 160, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:18:10,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=55906.666666666664, ans=0.2 2023-06-15 06:18:14,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-06-15 06:18:53,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.986e+02 2.238e+02 2.515e+02 3.872e+02, threshold=4.476e+02, percent-clipped=0.0 2023-06-15 06:18:57,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-06-15 06:19:08,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=22.5 2023-06-15 06:19:18,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=56173.333333333336, ans=0.125 2023-06-15 06:19:26,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=56173.333333333336, ans=0.125 2023-06-15 06:19:36,758 INFO [train.py:988] (3/4) Epoch 16, batch 450, loss[loss=0.2728, simple_loss=0.3428, pruned_loss=0.1014, over 16675.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3307, pruned_loss=0.1012, over 3394671.36 frames. ], batch size: 59, lr: 1.83e-02, grad_scale: 32.0 2023-06-15 06:19:50,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=56240.0, ans=0.05 2023-06-15 06:20:07,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-15 06:20:59,734 INFO [train.py:988] (3/4) Epoch 16, batch 500, loss[loss=0.2775, simple_loss=0.3296, pruned_loss=0.1127, over 20599.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3308, pruned_loss=0.1008, over 3483738.30 frames. ], batch size: 189, lr: 1.83e-02, grad_scale: 32.0 2023-06-15 06:21:09,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=56573.333333333336, ans=0.125 2023-06-15 06:21:14,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=56640.0, ans=0.125 2023-06-15 06:21:18,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.07 vs. limit=15.0 2023-06-15 06:21:34,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.14 vs. limit=15.0 2023-06-15 06:21:40,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.020e+02 2.317e+02 2.823e+02 4.617e+02, threshold=4.634e+02, percent-clipped=2.0 2023-06-15 06:21:40,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=56706.666666666664, ans=22.5 2023-06-15 06:22:11,791 INFO [train.py:988] (3/4) Epoch 17, batch 0, loss[loss=0.272, simple_loss=0.3347, pruned_loss=0.1047, over 20101.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3347, pruned_loss=0.1047, over 20101.00 frames. ], batch size: 133, lr: 1.78e-02, grad_scale: 32.0 2023-06-15 06:22:11,792 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 06:22:17,835 INFO [train.py:1020] (3/4) Epoch 17, validation: loss=0.2144, simple_loss=0.3175, pruned_loss=0.05564, over 143649.00 frames. 2023-06-15 06:22:17,836 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 06:22:36,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=56853.333333333336, ans=0.2 2023-06-15 06:22:38,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-06-15 06:22:41,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=56853.333333333336, ans=0.125 2023-06-15 06:22:55,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=56920.0, ans=0.05 2023-06-15 06:22:57,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=56920.0, ans=0.2 2023-06-15 06:22:57,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=56920.0, ans=0.125 2023-06-15 06:22:59,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=56920.0, ans=0.125 2023-06-15 06:23:26,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=57053.333333333336, ans=0.125 2023-06-15 06:23:28,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=57053.333333333336, ans=0.125 2023-06-15 06:23:45,531 INFO [train.py:988] (3/4) Epoch 17, batch 50, loss[loss=0.2669, simple_loss=0.3228, pruned_loss=0.1055, over 20709.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3273, pruned_loss=0.1018, over 860443.98 frames. ], batch size: 211, lr: 1.77e-02, grad_scale: 32.0 2023-06-15 06:23:48,977 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:24:09,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=57186.666666666664, ans=0.0 2023-06-15 06:24:27,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57253.333333333336, ans=0.1 2023-06-15 06:24:35,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=57253.333333333336, ans=0.1 2023-06-15 06:24:35,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=57253.333333333336, ans=0.0 2023-06-15 06:24:51,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57320.0, ans=0.1 2023-06-15 06:25:01,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 2.063e+02 2.327e+02 2.647e+02 3.796e+02, threshold=4.655e+02, percent-clipped=0.0 2023-06-15 06:25:04,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=57386.666666666664, ans=0.0 2023-06-15 06:25:14,205 INFO [train.py:988] (3/4) Epoch 17, batch 100, loss[loss=0.252, simple_loss=0.3229, pruned_loss=0.09055, over 18624.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3297, pruned_loss=0.09996, over 1509064.22 frames. ], batch size: 80, lr: 1.77e-02, grad_scale: 32.0 2023-06-15 06:25:38,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=57520.0, ans=0.09899494936611666 2023-06-15 06:25:39,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=57520.0, ans=0.0 2023-06-15 06:25:39,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=57520.0, ans=0.125 2023-06-15 06:25:40,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-15 06:26:31,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=57720.0, ans=0.125 2023-06-15 06:26:42,439 INFO [train.py:988] (3/4) Epoch 17, batch 150, loss[loss=0.2811, simple_loss=0.3534, pruned_loss=0.1044, over 17060.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3289, pruned_loss=0.1002, over 2028620.64 frames. ], batch size: 60, lr: 1.77e-02, grad_scale: 64.0 2023-06-15 06:26:44,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=57786.666666666664, ans=0.125 2023-06-15 06:26:58,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=57853.333333333336, ans=0.0 2023-06-15 06:27:34,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=57986.666666666664, ans=0.125 2023-06-15 06:27:34,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2023-06-15 06:27:49,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=57986.666666666664, ans=0.0 2023-06-15 06:27:53,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=58053.333333333336, ans=0.125 2023-06-15 06:27:57,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.306e+02 2.737e+02 3.252e+02 5.355e+02, threshold=5.474e+02, percent-clipped=3.0 2023-06-15 06:28:01,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=58053.333333333336, ans=0.125 2023-06-15 06:28:01,304 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:28:09,783 INFO [train.py:988] (3/4) Epoch 17, batch 200, loss[loss=0.2368, simple_loss=0.3103, pruned_loss=0.08168, over 18931.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3286, pruned_loss=0.09849, over 2417907.01 frames. ], batch size: 86, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:28:15,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=58120.0, ans=0.07 2023-06-15 06:28:15,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-06-15 06:28:31,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=58186.666666666664, ans=0.125 2023-06-15 06:28:35,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=58186.666666666664, ans=0.95 2023-06-15 06:28:38,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=58186.666666666664, ans=0.125 2023-06-15 06:29:04,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-15 06:29:11,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2023-06-15 06:29:19,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=58386.666666666664, ans=0.0 2023-06-15 06:29:19,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=58386.666666666664, ans=0.07 2023-06-15 06:29:38,281 INFO [train.py:988] (3/4) Epoch 17, batch 250, loss[loss=0.2957, simple_loss=0.3149, pruned_loss=0.1382, over 16628.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3294, pruned_loss=0.09856, over 2717937.66 frames. ], batch size: 392, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:29:40,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=58453.333333333336, ans=0.0 2023-06-15 06:29:53,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=58453.333333333336, ans=0.125 2023-06-15 06:30:29,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=58653.333333333336, ans=0.1 2023-06-15 06:30:31,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-15 06:30:54,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.968e+02 2.205e+02 2.465e+02 3.814e+02, threshold=4.411e+02, percent-clipped=0.0 2023-06-15 06:30:56,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58720.0, ans=0.1 2023-06-15 06:31:06,207 INFO [train.py:988] (3/4) Epoch 17, batch 300, loss[loss=0.267, simple_loss=0.3321, pruned_loss=0.1009, over 18649.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3295, pruned_loss=0.09867, over 2949803.06 frames. ], batch size: 80, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:31:23,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=58853.333333333336, ans=0.125 2023-06-15 06:31:43,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58920.0, ans=0.1 2023-06-15 06:31:48,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=58920.0, ans=0.0 2023-06-15 06:32:21,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-15 06:32:26,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-06-15 06:32:26,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.21 vs. limit=6.0 2023-06-15 06:32:33,738 INFO [train.py:988] (3/4) Epoch 17, batch 350, loss[loss=0.2601, simple_loss=0.3275, pruned_loss=0.09638, over 19328.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3288, pruned_loss=0.09837, over 3149839.19 frames. ], batch size: 98, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:32:44,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=59120.0, ans=0.125 2023-06-15 06:33:01,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=59186.666666666664, ans=0.05 2023-06-15 06:33:25,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=59320.0, ans=0.125 2023-06-15 06:33:49,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 2.042e+02 2.346e+02 2.711e+02 3.857e+02, threshold=4.693e+02, percent-clipped=0.0 2023-06-15 06:33:52,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=15.0 2023-06-15 06:34:01,724 INFO [train.py:988] (3/4) Epoch 17, batch 400, loss[loss=0.2641, simple_loss=0.3308, pruned_loss=0.09863, over 19684.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3281, pruned_loss=0.09758, over 3302056.75 frames. ], batch size: 110, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:34:05,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59453.333333333336, ans=0.0 2023-06-15 06:34:14,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59453.333333333336, ans=0.1 2023-06-15 06:34:15,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59453.333333333336, ans=0.0 2023-06-15 06:34:26,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=22.5 2023-06-15 06:34:29,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=59520.0, ans=0.0 2023-06-15 06:34:29,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=59520.0, ans=0.0 2023-06-15 06:34:32,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-15 06:34:58,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=59653.333333333336, ans=0.125 2023-06-15 06:35:27,600 INFO [train.py:988] (3/4) Epoch 17, batch 450, loss[loss=0.2591, simple_loss=0.3227, pruned_loss=0.0978, over 20555.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3277, pruned_loss=0.09706, over 3412172.01 frames. ], batch size: 189, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:35:43,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.14 vs. limit=22.5 2023-06-15 06:35:57,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=59853.333333333336, ans=0.125 2023-06-15 06:36:41,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 2.138e+02 2.622e+02 3.155e+02 6.039e+02, threshold=5.245e+02, percent-clipped=6.0 2023-06-15 06:36:52,678 INFO [train.py:988] (3/4) Epoch 17, batch 500, loss[loss=0.2598, simple_loss=0.3238, pruned_loss=0.09784, over 20343.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3282, pruned_loss=0.09787, over 3493611.30 frames. ], batch size: 149, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:37:13,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60186.666666666664, ans=0.1 2023-06-15 06:37:30,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=60253.333333333336, ans=0.0 2023-06-15 06:38:08,707 INFO [train.py:988] (3/4) Epoch 18, batch 0, loss[loss=0.2592, simple_loss=0.3196, pruned_loss=0.09934, over 20060.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3196, pruned_loss=0.09934, over 20060.00 frames. ], batch size: 133, lr: 1.70e-02, grad_scale: 64.0 2023-06-15 06:38:08,707 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 06:38:14,730 INFO [train.py:1020] (3/4) Epoch 18, validation: loss=0.2126, simple_loss=0.3161, pruned_loss=0.05459, over 143649.00 frames. 2023-06-15 06:38:14,731 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 06:38:23,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=60333.333333333336, ans=0.0 2023-06-15 06:38:27,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.91 vs. limit=15.0 2023-06-15 06:39:35,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=60600.0, ans=0.0 2023-06-15 06:39:40,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=60666.666666666664, ans=0.125 2023-06-15 06:39:42,211 INFO [train.py:988] (3/4) Epoch 18, batch 50, loss[loss=0.2719, simple_loss=0.2907, pruned_loss=0.1265, over 16811.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3244, pruned_loss=0.09965, over 853391.74 frames. ], batch size: 392, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:40:00,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.965e+02 2.316e+02 2.715e+02 4.312e+02, threshold=4.632e+02, percent-clipped=0.0 2023-06-15 06:40:19,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=60800.0, ans=0.0 2023-06-15 06:40:25,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=15.0 2023-06-15 06:40:31,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=60800.0, ans=0.0 2023-06-15 06:40:37,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.02 vs. limit=15.0 2023-06-15 06:40:39,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=60866.666666666664, ans=0.0 2023-06-15 06:41:03,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60933.333333333336, ans=0.1 2023-06-15 06:41:10,130 INFO [train.py:988] (3/4) Epoch 18, batch 100, loss[loss=0.2757, simple_loss=0.3263, pruned_loss=0.1126, over 20156.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3242, pruned_loss=0.09916, over 1509078.60 frames. ], batch size: 239, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:41:15,470 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:41:18,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=61000.0, ans=0.05 2023-06-15 06:41:30,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=61066.666666666664, ans=0.2 2023-06-15 06:41:45,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=61133.333333333336, ans=0.2 2023-06-15 06:41:55,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=61133.333333333336, ans=0.125 2023-06-15 06:41:58,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=15.0 2023-06-15 06:41:59,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61133.333333333336, ans=0.1 2023-06-15 06:42:05,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=61200.0, ans=0.2 2023-06-15 06:42:07,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=61200.0, ans=0.125 2023-06-15 06:42:37,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-15 06:42:37,674 INFO [train.py:988] (3/4) Epoch 18, batch 150, loss[loss=0.2642, simple_loss=0.3313, pruned_loss=0.09849, over 19797.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.325, pruned_loss=0.09829, over 1999146.91 frames. ], batch size: 115, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:42:57,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.047e+02 2.339e+02 2.700e+02 3.981e+02, threshold=4.677e+02, percent-clipped=0.0 2023-06-15 06:43:06,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=61400.0, ans=0.0 2023-06-15 06:43:15,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=61466.666666666664, ans=0.0 2023-06-15 06:43:22,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2023-06-15 06:43:25,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.87 vs. limit=15.0 2023-06-15 06:43:36,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=61533.333333333336, ans=0.0 2023-06-15 06:43:48,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=61600.0, ans=0.125 2023-06-15 06:44:06,112 INFO [train.py:988] (3/4) Epoch 18, batch 200, loss[loss=0.271, simple_loss=0.353, pruned_loss=0.09446, over 18320.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3235, pruned_loss=0.09603, over 2396659.56 frames. ], batch size: 72, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:44:06,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=61666.666666666664, ans=0.1 2023-06-15 06:45:16,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=61933.333333333336, ans=0.125 2023-06-15 06:45:26,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=61933.333333333336, ans=0.0 2023-06-15 06:45:33,956 INFO [train.py:988] (3/4) Epoch 18, batch 250, loss[loss=0.2734, simple_loss=0.3276, pruned_loss=0.1096, over 20310.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3248, pruned_loss=0.09655, over 2697859.14 frames. ], batch size: 149, lr: 1.68e-02, grad_scale: 64.0 2023-06-15 06:45:53,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 2.076e+02 2.248e+02 2.609e+02 3.858e+02, threshold=4.496e+02, percent-clipped=0.0 2023-06-15 06:46:06,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=62066.666666666664, ans=0.2 2023-06-15 06:46:10,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=62133.333333333336, ans=0.0 2023-06-15 06:46:15,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62133.333333333336, ans=0.1 2023-06-15 06:46:18,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=62133.333333333336, ans=0.125 2023-06-15 06:46:32,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=62200.0, ans=0.05 2023-06-15 06:47:00,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=62333.333333333336, ans=0.1 2023-06-15 06:47:02,688 INFO [train.py:988] (3/4) Epoch 18, batch 300, loss[loss=0.2719, simple_loss=0.3308, pruned_loss=0.1066, over 19970.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3244, pruned_loss=0.09585, over 2942689.29 frames. ], batch size: 126, lr: 1.68e-02, grad_scale: 64.0 2023-06-15 06:48:30,184 INFO [train.py:988] (3/4) Epoch 18, batch 350, loss[loss=0.2518, simple_loss=0.313, pruned_loss=0.09531, over 20130.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3242, pruned_loss=0.09565, over 3131560.24 frames. ], batch size: 133, lr: 1.68e-02, grad_scale: 32.0 2023-06-15 06:48:30,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=62666.666666666664, ans=0.125 2023-06-15 06:48:32,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-06-15 06:48:46,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=62733.333333333336, ans=0.0 2023-06-15 06:48:51,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 2.098e+02 2.384e+02 2.713e+02 4.621e+02, threshold=4.767e+02, percent-clipped=2.0 2023-06-15 06:49:06,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.79 vs. limit=15.0 2023-06-15 06:49:16,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=62800.0, ans=0.125 2023-06-15 06:49:22,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.87 vs. limit=12.0 2023-06-15 06:49:29,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-15 06:49:44,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=62933.333333333336, ans=0.0 2023-06-15 06:49:47,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62933.333333333336, ans=0.1 2023-06-15 06:49:56,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63000.0, ans=0.1 2023-06-15 06:49:57,685 INFO [train.py:988] (3/4) Epoch 18, batch 400, loss[loss=0.2551, simple_loss=0.3294, pruned_loss=0.09038, over 18472.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3243, pruned_loss=0.09558, over 3280614.77 frames. ], batch size: 77, lr: 1.68e-02, grad_scale: 32.0 2023-06-15 06:49:59,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=63000.0, ans=0.125 2023-06-15 06:50:03,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-06-15 06:50:04,639 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:50:10,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63000.0, ans=0.1 2023-06-15 06:50:14,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=63066.666666666664, ans=0.2 2023-06-15 06:51:26,162 INFO [train.py:988] (3/4) Epoch 18, batch 450, loss[loss=0.2681, simple_loss=0.335, pruned_loss=0.1006, over 20533.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.324, pruned_loss=0.09526, over 3391434.61 frames. ], batch size: 160, lr: 1.67e-02, grad_scale: 32.0 2023-06-15 06:51:47,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.045e+02 2.298e+02 2.779e+02 4.422e+02, threshold=4.596e+02, percent-clipped=0.0 2023-06-15 06:51:48,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=63400.0, ans=0.125 2023-06-15 06:51:56,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=22.5 2023-06-15 06:51:58,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=63400.0, ans=0.125 2023-06-15 06:52:37,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-06-15 06:52:48,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=63600.0, ans=0.125 2023-06-15 06:52:48,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=63600.0, ans=0.0 2023-06-15 06:52:51,118 INFO [train.py:988] (3/4) Epoch 18, batch 500, loss[loss=0.2428, simple_loss=0.3184, pruned_loss=0.08361, over 18468.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3247, pruned_loss=0.09564, over 3470852.00 frames. ], batch size: 77, lr: 1.67e-02, grad_scale: 32.0 2023-06-15 06:52:53,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=63666.666666666664, ans=0.125 2023-06-15 06:53:15,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=63733.333333333336, ans=0.0 2023-06-15 06:53:23,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63800.0, ans=0.1 2023-06-15 06:53:26,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63800.0, ans=0.125 2023-06-15 06:53:31,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=63800.0, ans=0.0 2023-06-15 06:53:31,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-06-15 06:53:34,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63800.0, ans=0.1 2023-06-15 06:53:40,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=63866.666666666664, ans=0.0 2023-06-15 06:54:08,033 INFO [train.py:988] (3/4) Epoch 19, batch 0, loss[loss=0.2425, simple_loss=0.3133, pruned_loss=0.08586, over 19484.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3133, pruned_loss=0.08586, over 19484.00 frames. ], batch size: 105, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:54:08,033 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 06:54:14,157 INFO [train.py:1020] (3/4) Epoch 19, validation: loss=0.2113, simple_loss=0.3157, pruned_loss=0.05349, over 143649.00 frames. 2023-06-15 06:54:14,158 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 06:54:17,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=63880.0, ans=0.125 2023-06-15 06:54:32,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=63946.666666666664, ans=0.0 2023-06-15 06:54:45,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=63946.666666666664, ans=15.0 2023-06-15 06:55:05,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.944e+02 2.133e+02 2.428e+02 3.266e+02, threshold=4.266e+02, percent-clipped=0.0 2023-06-15 06:55:12,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=64080.0, ans=0.125 2023-06-15 06:55:14,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64080.0, ans=0.1 2023-06-15 06:55:40,317 INFO [train.py:988] (3/4) Epoch 19, batch 50, loss[loss=0.2414, simple_loss=0.3138, pruned_loss=0.08448, over 19176.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3256, pruned_loss=0.09366, over 859144.91 frames. ], batch size: 92, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:55:40,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=64213.333333333336, ans=0.125 2023-06-15 06:55:52,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=64213.333333333336, ans=0.0 2023-06-15 06:56:10,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=64280.0, ans=0.0 2023-06-15 06:56:11,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=64280.0, ans=0.0 2023-06-15 06:56:29,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=64346.666666666664, ans=0.2 2023-06-15 06:56:36,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=64413.333333333336, ans=0.05 2023-06-15 06:57:08,425 INFO [train.py:988] (3/4) Epoch 19, batch 100, loss[loss=0.2421, simple_loss=0.3088, pruned_loss=0.08771, over 20271.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3247, pruned_loss=0.09228, over 1503548.70 frames. ], batch size: 141, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:57:08,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=64546.666666666664, ans=0.0 2023-06-15 06:57:17,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=64546.666666666664, ans=10.0 2023-06-15 06:57:30,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=64613.333333333336, ans=0.0 2023-06-15 06:57:44,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=64680.0, ans=0.1 2023-06-15 06:58:00,783 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 2.073e+02 2.297e+02 2.619e+02 4.375e+02, threshold=4.594e+02, percent-clipped=1.0 2023-06-15 06:58:05,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.73 vs. limit=15.0 2023-06-15 06:58:08,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64746.666666666664, ans=0.125 2023-06-15 06:58:24,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=64813.333333333336, ans=0.125 2023-06-15 06:58:36,695 INFO [train.py:988] (3/4) Epoch 19, batch 150, loss[loss=0.2489, simple_loss=0.3214, pruned_loss=0.08819, over 19794.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3238, pruned_loss=0.09262, over 2000085.04 frames. ], batch size: 115, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:58:42,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=64880.0, ans=0.125 2023-06-15 06:58:53,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-15 06:58:56,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=64946.666666666664, ans=0.0 2023-06-15 06:59:16,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=65013.333333333336, ans=0.0 2023-06-15 06:59:38,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=65080.0, ans=0.025 2023-06-15 06:59:50,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.91 vs. limit=6.0 2023-06-15 06:59:55,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=65146.666666666664, ans=0.125 2023-06-15 06:59:55,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=65146.666666666664, ans=0.125 2023-06-15 06:59:56,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=65146.666666666664, ans=0.2 2023-06-15 07:00:04,048 INFO [train.py:988] (3/4) Epoch 19, batch 200, loss[loss=0.2697, simple_loss=0.2943, pruned_loss=0.1225, over 17279.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3226, pruned_loss=0.09239, over 2407332.52 frames. ], batch size: 391, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:00:07,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=65213.333333333336, ans=0.125 2023-06-15 07:00:22,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65280.0, ans=0.125 2023-06-15 07:00:56,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.984e+02 2.245e+02 2.601e+02 3.971e+02, threshold=4.489e+02, percent-clipped=0.0 2023-06-15 07:00:59,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=65413.333333333336, ans=0.0 2023-06-15 07:01:04,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=65413.333333333336, ans=0.125 2023-06-15 07:01:09,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-06-15 07:01:32,666 INFO [train.py:988] (3/4) Epoch 19, batch 250, loss[loss=0.2631, simple_loss=0.3276, pruned_loss=0.09933, over 19963.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3225, pruned_loss=0.09208, over 2703436.37 frames. ], batch size: 126, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:01:40,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=65546.66666666667, ans=0.2 2023-06-15 07:02:48,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=65813.33333333333, ans=0.2 2023-06-15 07:02:51,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=65813.33333333333, ans=0.125 2023-06-15 07:03:01,179 INFO [train.py:988] (3/4) Epoch 19, batch 300, loss[loss=0.2641, simple_loss=0.3241, pruned_loss=0.1021, over 20527.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3227, pruned_loss=0.09203, over 2945710.77 frames. ], batch size: 160, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:03:01,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65880.0, ans=0.1 2023-06-15 07:03:03,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=65880.0, ans=0.125 2023-06-15 07:03:13,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=65880.0, ans=0.2 2023-06-15 07:03:29,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=65946.66666666667, ans=0.09899494936611666 2023-06-15 07:03:45,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=66013.33333333333, ans=0.2 2023-06-15 07:03:53,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.956e+02 2.157e+02 2.430e+02 3.269e+02, threshold=4.313e+02, percent-clipped=0.0 2023-06-15 07:03:58,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=66080.0, ans=0.125 2023-06-15 07:04:06,538 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:04:17,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.42 vs. limit=10.0 2023-06-15 07:04:28,899 INFO [train.py:988] (3/4) Epoch 19, batch 350, loss[loss=0.2567, simple_loss=0.3204, pruned_loss=0.09645, over 20283.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3223, pruned_loss=0.09222, over 3133188.52 frames. ], batch size: 141, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:04:41,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=66213.33333333333, ans=0.125 2023-06-15 07:05:09,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.85 vs. limit=12.0 2023-06-15 07:05:29,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=66413.33333333333, ans=0.0 2023-06-15 07:05:43,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=66480.0, ans=0.1 2023-06-15 07:05:54,454 INFO [train.py:988] (3/4) Epoch 19, batch 400, loss[loss=0.2495, simple_loss=0.319, pruned_loss=0.09005, over 18938.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3229, pruned_loss=0.09239, over 3281909.24 frames. ], batch size: 86, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:06:02,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=66546.66666666667, ans=0.2 2023-06-15 07:06:14,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=66613.33333333333, ans=0.0 2023-06-15 07:06:47,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.968e+02 2.388e+02 2.889e+02 4.258e+02, threshold=4.776e+02, percent-clipped=0.0 2023-06-15 07:06:54,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66746.66666666667, ans=0.125 2023-06-15 07:06:57,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-15 07:07:15,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=66813.33333333333, ans=0.0 2023-06-15 07:07:22,215 INFO [train.py:988] (3/4) Epoch 19, batch 450, loss[loss=0.2507, simple_loss=0.3208, pruned_loss=0.09033, over 20295.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3223, pruned_loss=0.09303, over 3396309.29 frames. ], batch size: 141, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:07:59,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67013.33333333333, ans=0.1 2023-06-15 07:07:59,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67013.33333333333, ans=0.1 2023-06-15 07:08:00,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-06-15 07:08:01,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67013.33333333333, ans=0.1 2023-06-15 07:08:13,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=67080.0, ans=0.125 2023-06-15 07:08:13,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-15 07:08:15,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=67080.0, ans=0.0 2023-06-15 07:08:27,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-15 07:08:34,325 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2023-06-15 07:08:46,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-15 07:08:48,901 INFO [train.py:988] (3/4) Epoch 19, batch 500, loss[loss=0.2567, simple_loss=0.3306, pruned_loss=0.09143, over 19203.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3218, pruned_loss=0.09305, over 3495917.14 frames. ], batch size: 92, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:08:56,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-15 07:09:08,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=67280.0, ans=0.125 2023-06-15 07:09:27,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-06-15 07:09:37,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.943e+02 2.120e+02 2.445e+02 3.405e+02, threshold=4.239e+02, percent-clipped=0.0 2023-06-15 07:10:07,200 INFO [train.py:988] (3/4) Epoch 20, batch 0, loss[loss=0.2789, simple_loss=0.3536, pruned_loss=0.1021, over 16400.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3536, pruned_loss=0.1021, over 16400.00 frames. ], batch size: 52, lr: 1.56e-02, grad_scale: 32.0 2023-06-15 07:10:07,201 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 07:10:13,268 INFO [train.py:1020] (3/4) Epoch 20, validation: loss=0.2092, simple_loss=0.3126, pruned_loss=0.05295, over 143649.00 frames. 2023-06-15 07:10:13,269 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 07:10:35,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=67500.0, ans=0.0 2023-06-15 07:10:37,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=67500.0, ans=0.125 2023-06-15 07:11:32,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=67700.0, ans=0.0 2023-06-15 07:11:34,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=67700.0, ans=0.0 2023-06-15 07:11:41,413 INFO [train.py:988] (3/4) Epoch 20, batch 50, loss[loss=0.2498, simple_loss=0.3227, pruned_loss=0.08849, over 19041.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3219, pruned_loss=0.09496, over 865554.79 frames. ], batch size: 89, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:12:31,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=67966.66666666667, ans=0.125 2023-06-15 07:12:52,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=68033.33333333333, ans=0.125 2023-06-15 07:12:55,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=68033.33333333333, ans=0.125 2023-06-15 07:13:03,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 2.103e+02 2.332e+02 2.749e+02 4.592e+02, threshold=4.664e+02, percent-clipped=1.0 2023-06-15 07:13:09,719 INFO [train.py:988] (3/4) Epoch 20, batch 100, loss[loss=0.2332, simple_loss=0.3168, pruned_loss=0.07482, over 18315.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.32, pruned_loss=0.09227, over 1513306.02 frames. ], batch size: 74, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:13:15,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=68100.0, ans=0.125 2023-06-15 07:13:48,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=68233.33333333333, ans=0.125 2023-06-15 07:13:54,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=68233.33333333333, ans=0.2 2023-06-15 07:14:04,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=68300.0, ans=0.035 2023-06-15 07:14:18,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=68366.66666666667, ans=0.0 2023-06-15 07:14:21,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=68366.66666666667, ans=0.125 2023-06-15 07:14:25,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.31 vs. limit=12.0 2023-06-15 07:14:37,881 INFO [train.py:988] (3/4) Epoch 20, batch 150, loss[loss=0.2472, simple_loss=0.3099, pruned_loss=0.09228, over 20567.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3214, pruned_loss=0.0914, over 2004923.91 frames. ], batch size: 189, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:14:45,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-15 07:14:46,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=68433.33333333333, ans=0.125 2023-06-15 07:14:48,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=68433.33333333333, ans=0.2 2023-06-15 07:15:00,991 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:15:21,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=68566.66666666667, ans=0.125 2023-06-15 07:15:56,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=68700.0, ans=0.125 2023-06-15 07:15:59,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.981e+02 2.212e+02 2.517e+02 3.865e+02, threshold=4.424e+02, percent-clipped=0.0 2023-06-15 07:16:05,159 INFO [train.py:988] (3/4) Epoch 20, batch 200, loss[loss=0.2406, simple_loss=0.3215, pruned_loss=0.07989, over 17636.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3207, pruned_loss=0.09127, over 2402064.04 frames. ], batch size: 67, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:16:25,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-15 07:16:36,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.46 vs. limit=15.0 2023-06-15 07:16:53,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=68900.0, ans=0.125 2023-06-15 07:16:53,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2023-06-15 07:17:16,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=69033.33333333333, ans=0.0 2023-06-15 07:17:21,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69033.33333333333, ans=0.1 2023-06-15 07:17:25,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=69033.33333333333, ans=0.0 2023-06-15 07:17:28,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=69033.33333333333, ans=0.125 2023-06-15 07:17:32,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=69100.0, ans=0.125 2023-06-15 07:17:33,540 INFO [train.py:988] (3/4) Epoch 20, batch 250, loss[loss=0.2758, simple_loss=0.3031, pruned_loss=0.1242, over 16720.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.32, pruned_loss=0.0914, over 2707117.19 frames. ], batch size: 392, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:17:49,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=69166.66666666667, ans=0.0 2023-06-15 07:18:22,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=69233.33333333333, ans=0.5 2023-06-15 07:18:27,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69300.0, ans=0.1 2023-06-15 07:18:32,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=69300.0, ans=0.125 2023-06-15 07:18:42,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=69366.66666666667, ans=0.0 2023-06-15 07:18:55,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.907e+02 2.171e+02 2.570e+02 4.262e+02, threshold=4.342e+02, percent-clipped=0.0 2023-06-15 07:19:00,687 INFO [train.py:988] (3/4) Epoch 20, batch 300, loss[loss=0.2359, simple_loss=0.3003, pruned_loss=0.08578, over 20633.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3207, pruned_loss=0.09112, over 2930803.43 frames. ], batch size: 189, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:20:28,561 INFO [train.py:988] (3/4) Epoch 20, batch 350, loss[loss=0.2638, simple_loss=0.3241, pruned_loss=0.1017, over 20607.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3209, pruned_loss=0.09132, over 3119313.40 frames. ], batch size: 173, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:20:37,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=8.0 2023-06-15 07:20:56,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=69833.33333333333, ans=0.05 2023-06-15 07:21:21,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=15.0 2023-06-15 07:21:24,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=69966.66666666667, ans=0.125 2023-06-15 07:21:30,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.11 vs. limit=10.0 2023-06-15 07:21:33,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=69966.66666666667, ans=0.2 2023-06-15 07:21:38,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70033.33333333333, ans=0.125 2023-06-15 07:21:38,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70033.33333333333, ans=0.125 2023-06-15 07:21:50,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.058e+02 2.277e+02 2.664e+02 3.684e+02, threshold=4.554e+02, percent-clipped=0.0 2023-06-15 07:21:52,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70033.33333333333, ans=0.0 2023-06-15 07:21:55,236 INFO [train.py:988] (3/4) Epoch 20, batch 400, loss[loss=0.2337, simple_loss=0.3116, pruned_loss=0.07785, over 19224.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3213, pruned_loss=0.09133, over 3274127.69 frames. ], batch size: 92, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:22:08,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=70100.0, ans=0.125 2023-06-15 07:22:38,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=70233.33333333333, ans=0.0 2023-06-15 07:22:39,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=70233.33333333333, ans=0.0 2023-06-15 07:22:52,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70300.0, ans=0.125 2023-06-15 07:23:14,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=70366.66666666667, ans=0.0 2023-06-15 07:23:23,990 INFO [train.py:988] (3/4) Epoch 20, batch 450, loss[loss=0.237, simple_loss=0.3105, pruned_loss=0.0818, over 19845.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3201, pruned_loss=0.09101, over 3391977.33 frames. ], batch size: 120, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:23:28,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-15 07:23:45,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=70500.0, ans=0.0 2023-06-15 07:23:49,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=70500.0, ans=0.0 2023-06-15 07:24:03,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.00 vs. limit=15.0 2023-06-15 07:24:08,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=70566.66666666667, ans=0.125 2023-06-15 07:24:16,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.19 vs. limit=22.5 2023-06-15 07:24:35,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=70700.0, ans=0.125 2023-06-15 07:24:43,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=70700.0, ans=0.02 2023-06-15 07:24:44,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.031e+02 2.228e+02 2.537e+02 4.676e+02, threshold=4.456e+02, percent-clipped=1.0 2023-06-15 07:24:49,669 INFO [train.py:988] (3/4) Epoch 20, batch 500, loss[loss=0.2132, simple_loss=0.2935, pruned_loss=0.0665, over 18811.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3195, pruned_loss=0.09059, over 3489460.31 frames. ], batch size: 83, lr: 1.53e-02, grad_scale: 32.0 2023-06-15 07:24:54,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-06-15 07:24:55,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=70766.66666666667, ans=0.125 2023-06-15 07:26:07,912 INFO [train.py:988] (3/4) Epoch 21, batch 0, loss[loss=0.2771, simple_loss=0.3556, pruned_loss=0.09932, over 18300.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3556, pruned_loss=0.09932, over 18300.00 frames. ], batch size: 72, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:26:07,913 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 07:26:14,416 INFO [train.py:1020] (3/4) Epoch 21, validation: loss=0.209, simple_loss=0.3126, pruned_loss=0.05274, over 143649.00 frames. 2023-06-15 07:26:14,418 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 07:26:35,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=71046.66666666667, ans=0.125 2023-06-15 07:26:35,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.72 vs. limit=22.5 2023-06-15 07:27:06,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=71180.0, ans=0.2 2023-06-15 07:27:29,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=71246.66666666667, ans=0.125 2023-06-15 07:27:31,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=71246.66666666667, ans=0.125 2023-06-15 07:27:41,936 INFO [train.py:988] (3/4) Epoch 21, batch 50, loss[loss=0.2499, simple_loss=0.3096, pruned_loss=0.09509, over 20266.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.321, pruned_loss=0.09182, over 854748.04 frames. ], batch size: 239, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:27:45,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=71313.33333333333, ans=0.0 2023-06-15 07:27:48,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71313.33333333333, ans=0.1 2023-06-15 07:27:48,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=71313.33333333333, ans=0.125 2023-06-15 07:28:06,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=71380.0, ans=0.125 2023-06-15 07:28:07,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.043e+02 2.374e+02 2.878e+02 4.060e+02, threshold=4.748e+02, percent-clipped=0.0 2023-06-15 07:28:34,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=71513.33333333333, ans=0.0 2023-06-15 07:28:36,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=71513.33333333333, ans=0.125 2023-06-15 07:28:51,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=22.5 2023-06-15 07:28:53,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=71580.0, ans=0.2 2023-06-15 07:28:59,026 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:29:09,312 INFO [train.py:988] (3/4) Epoch 21, batch 100, loss[loss=0.2472, simple_loss=0.3144, pruned_loss=0.09001, over 20067.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3197, pruned_loss=0.09157, over 1504083.61 frames. ], batch size: 133, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:30:06,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=71846.66666666667, ans=0.0 2023-06-15 07:30:08,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71846.66666666667, ans=0.1 2023-06-15 07:30:13,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=71846.66666666667, ans=0.035 2023-06-15 07:30:30,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71913.33333333333, ans=0.1 2023-06-15 07:30:35,492 INFO [train.py:988] (3/4) Epoch 21, batch 150, loss[loss=0.2571, simple_loss=0.3182, pruned_loss=0.09799, over 20567.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3183, pruned_loss=0.09123, over 2006455.81 frames. ], batch size: 189, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:30:39,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=71980.0, ans=0.0 2023-06-15 07:31:01,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.007e+02 2.293e+02 2.740e+02 3.931e+02, threshold=4.586e+02, percent-clipped=0.0 2023-06-15 07:31:35,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=72180.0, ans=0.125 2023-06-15 07:31:40,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72180.0, ans=0.1 2023-06-15 07:31:42,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=72180.0, ans=0.0 2023-06-15 07:32:02,955 INFO [train.py:988] (3/4) Epoch 21, batch 200, loss[loss=0.2555, simple_loss=0.3024, pruned_loss=0.1043, over 19843.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3186, pruned_loss=0.09017, over 2404443.14 frames. ], batch size: 293, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:32:26,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.16 vs. limit=22.5 2023-06-15 07:32:48,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=72446.66666666667, ans=0.0 2023-06-15 07:33:19,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=72580.0, ans=0.05 2023-06-15 07:33:21,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=72580.0, ans=0.1 2023-06-15 07:33:29,807 INFO [train.py:988] (3/4) Epoch 21, batch 250, loss[loss=0.2367, simple_loss=0.3142, pruned_loss=0.07953, over 19108.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.318, pruned_loss=0.08951, over 2720319.10 frames. ], batch size: 94, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:33:35,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=72646.66666666667, ans=0.0 2023-06-15 07:33:38,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=72646.66666666667, ans=0.125 2023-06-15 07:33:46,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=72713.33333333333, ans=0.025 2023-06-15 07:33:54,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.979e+02 2.224e+02 2.730e+02 4.525e+02, threshold=4.448e+02, percent-clipped=0.0 2023-06-15 07:33:58,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=72713.33333333333, ans=0.125 2023-06-15 07:34:07,980 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:34:26,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=72846.66666666667, ans=0.125 2023-06-15 07:34:55,986 INFO [train.py:988] (3/4) Epoch 21, batch 300, loss[loss=0.2688, simple_loss=0.3523, pruned_loss=0.09264, over 17622.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3174, pruned_loss=0.0894, over 2959484.00 frames. ], batch size: 67, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:35:21,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2023-06-15 07:35:34,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=73113.33333333333, ans=0.07 2023-06-15 07:35:41,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73113.33333333333, ans=0.1 2023-06-15 07:35:49,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=73180.0, ans=0.0 2023-06-15 07:36:11,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=73246.66666666667, ans=0.125 2023-06-15 07:36:14,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=73246.66666666667, ans=0.125 2023-06-15 07:36:23,406 INFO [train.py:988] (3/4) Epoch 21, batch 350, loss[loss=0.2325, simple_loss=0.3114, pruned_loss=0.07678, over 19197.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3163, pruned_loss=0.08899, over 3157409.99 frames. ], batch size: 92, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:36:23,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=73313.33333333333, ans=0.2 2023-06-15 07:36:29,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=73313.33333333333, ans=0.2 2023-06-15 07:36:49,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 2.024e+02 2.303e+02 3.042e+02 4.564e+02, threshold=4.607e+02, percent-clipped=2.0 2023-06-15 07:37:22,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=73513.33333333333, ans=0.2 2023-06-15 07:37:24,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73513.33333333333, ans=0.125 2023-06-15 07:37:25,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=73513.33333333333, ans=0.125 2023-06-15 07:37:39,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=73580.0, ans=0.0 2023-06-15 07:37:49,163 INFO [train.py:988] (3/4) Epoch 21, batch 400, loss[loss=0.236, simple_loss=0.2976, pruned_loss=0.08714, over 20688.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3169, pruned_loss=0.0888, over 3289192.64 frames. ], batch size: 211, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:37:49,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=73646.66666666667, ans=0.125 2023-06-15 07:37:56,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=8.0 2023-06-15 07:38:19,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=73713.33333333333, ans=0.0 2023-06-15 07:39:11,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73913.33333333333, ans=0.1 2023-06-15 07:39:15,503 INFO [train.py:988] (3/4) Epoch 21, batch 450, loss[loss=0.2502, simple_loss=0.3233, pruned_loss=0.08852, over 19099.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3177, pruned_loss=0.08876, over 3398048.56 frames. ], batch size: 94, lr: 1.47e-02, grad_scale: 32.0 2023-06-15 07:39:30,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=73980.0, ans=0.2 2023-06-15 07:39:37,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=74046.66666666667, ans=0.05 2023-06-15 07:39:42,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 2.035e+02 2.412e+02 2.747e+02 5.192e+02, threshold=4.824e+02, percent-clipped=2.0 2023-06-15 07:40:19,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74180.0, ans=0.1 2023-06-15 07:40:19,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=74180.0, ans=0.1 2023-06-15 07:40:22,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74246.66666666667, ans=0.125 2023-06-15 07:40:24,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=15.0 2023-06-15 07:40:40,666 INFO [train.py:988] (3/4) Epoch 21, batch 500, loss[loss=0.2394, simple_loss=0.3135, pruned_loss=0.08264, over 19891.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3174, pruned_loss=0.08832, over 3491046.21 frames. ], batch size: 120, lr: 1.47e-02, grad_scale: 32.0 2023-06-15 07:40:55,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=74380.0, ans=0.125 2023-06-15 07:40:57,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74380.0, ans=0.125 2023-06-15 07:41:06,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=22.5 2023-06-15 07:41:08,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=74380.0, ans=0.125 2023-06-15 07:41:12,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=15.0 2023-06-15 07:41:27,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=74513.33333333333, ans=0.125 2023-06-15 07:42:00,899 INFO [train.py:988] (3/4) Epoch 22, batch 0, loss[loss=0.2296, simple_loss=0.3102, pruned_loss=0.07451, over 19657.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3102, pruned_loss=0.07451, over 19657.00 frames. ], batch size: 110, lr: 1.44e-02, grad_scale: 32.0 2023-06-15 07:42:00,900 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 07:42:07,055 INFO [train.py:1020] (3/4) Epoch 22, validation: loss=0.2075, simple_loss=0.3107, pruned_loss=0.05212, over 143649.00 frames. 2023-06-15 07:42:07,056 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 07:42:12,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74533.33333333333, ans=0.125 2023-06-15 07:42:17,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=74533.33333333333, ans=0.2 2023-06-15 07:42:23,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74600.0, ans=0.125 2023-06-15 07:42:44,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=74666.66666666667, ans=0.1 2023-06-15 07:43:03,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.996e+02 2.190e+02 2.519e+02 3.668e+02, threshold=4.380e+02, percent-clipped=0.0 2023-06-15 07:43:16,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=74800.0, ans=0.2 2023-06-15 07:43:27,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=74800.0, ans=0.0 2023-06-15 07:43:35,416 INFO [train.py:988] (3/4) Epoch 22, batch 50, loss[loss=0.2432, simple_loss=0.3178, pruned_loss=0.08433, over 18637.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3177, pruned_loss=0.08767, over 852756.05 frames. ], batch size: 80, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:43:50,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74933.33333333333, ans=0.1 2023-06-15 07:44:14,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-15 07:44:18,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=75000.0, ans=0.125 2023-06-15 07:45:02,462 INFO [train.py:988] (3/4) Epoch 22, batch 100, loss[loss=0.2613, simple_loss=0.3157, pruned_loss=0.1034, over 20300.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3167, pruned_loss=0.08806, over 1499435.55 frames. ], batch size: 239, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:45:39,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=75333.33333333333, ans=0.02 2023-06-15 07:45:59,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.974e+02 2.218e+02 2.504e+02 3.922e+02, threshold=4.437e+02, percent-clipped=0.0 2023-06-15 07:46:16,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=75466.66666666667, ans=0.125 2023-06-15 07:46:31,657 INFO [train.py:988] (3/4) Epoch 22, batch 150, loss[loss=0.2647, simple_loss=0.336, pruned_loss=0.09668, over 18628.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3176, pruned_loss=0.0879, over 2008135.97 frames. ], batch size: 80, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:47:58,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=75866.66666666667, ans=0.2 2023-06-15 07:48:00,059 INFO [train.py:988] (3/4) Epoch 22, batch 200, loss[loss=0.2323, simple_loss=0.3072, pruned_loss=0.0787, over 18636.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3155, pruned_loss=0.08711, over 2406855.47 frames. ], batch size: 80, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:48:05,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=75866.66666666667, ans=0.1 2023-06-15 07:48:30,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=75933.33333333333, ans=0.125 2023-06-15 07:48:39,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=76000.0, ans=0.025 2023-06-15 07:48:40,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=76000.0, ans=10.0 2023-06-15 07:48:55,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=76066.66666666667, ans=0.125 2023-06-15 07:48:56,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.899e+02 2.203e+02 2.409e+02 3.907e+02, threshold=4.406e+02, percent-clipped=0.0 2023-06-15 07:49:04,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-06-15 07:49:16,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-06-15 07:49:29,219 INFO [train.py:988] (3/4) Epoch 22, batch 250, loss[loss=0.2283, simple_loss=0.3, pruned_loss=0.07825, over 19528.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3147, pruned_loss=0.08617, over 2725409.33 frames. ], batch size: 102, lr: 1.43e-02, grad_scale: 64.0 2023-06-15 07:49:31,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=76200.0, ans=0.125 2023-06-15 07:49:51,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=76266.66666666667, ans=0.0 2023-06-15 07:50:18,875 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:50:44,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.33 vs. limit=5.0 2023-06-15 07:50:47,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76466.66666666667, ans=0.1 2023-06-15 07:50:56,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.86 vs. limit=6.0 2023-06-15 07:50:57,248 INFO [train.py:988] (3/4) Epoch 22, batch 300, loss[loss=0.2605, simple_loss=0.332, pruned_loss=0.09444, over 20089.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3158, pruned_loss=0.08645, over 2949852.43 frames. ], batch size: 133, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:51:04,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=22.5 2023-06-15 07:51:20,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76600.0, ans=0.125 2023-06-15 07:51:23,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=76600.0, ans=0.0 2023-06-15 07:51:37,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=76666.66666666667, ans=0.125 2023-06-15 07:51:53,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 1.887e+02 2.089e+02 2.496e+02 3.491e+02, threshold=4.177e+02, percent-clipped=0.0 2023-06-15 07:52:04,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-15 07:52:10,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76800.0, ans=0.125 2023-06-15 07:52:17,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=76800.0, ans=0.2 2023-06-15 07:52:22,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.86 vs. limit=10.0 2023-06-15 07:52:24,361 INFO [train.py:988] (3/4) Epoch 22, batch 350, loss[loss=0.2324, simple_loss=0.3046, pruned_loss=0.08011, over 20262.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3163, pruned_loss=0.08686, over 3136618.83 frames. ], batch size: 141, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:52:37,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=76866.66666666667, ans=0.2 2023-06-15 07:52:41,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=76933.33333333333, ans=0.0 2023-06-15 07:52:46,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=76933.33333333333, ans=0.1 2023-06-15 07:53:02,215 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:53:38,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=77133.33333333333, ans=0.125 2023-06-15 07:53:54,157 INFO [train.py:988] (3/4) Epoch 22, batch 400, loss[loss=0.2666, simple_loss=0.3285, pruned_loss=0.1024, over 20498.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3159, pruned_loss=0.0873, over 3280357.73 frames. ], batch size: 160, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:54:00,379 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:54:20,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=77266.66666666667, ans=0.125 2023-06-15 07:54:25,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77266.66666666667, ans=0.1 2023-06-15 07:54:33,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=77333.33333333333, ans=0.125 2023-06-15 07:54:44,671 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:54:50,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.936e+02 2.140e+02 2.519e+02 4.231e+02, threshold=4.281e+02, percent-clipped=1.0 2023-06-15 07:54:51,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77400.0, ans=0.1 2023-06-15 07:55:07,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=77466.66666666667, ans=0.07 2023-06-15 07:55:10,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=77466.66666666667, ans=0.2 2023-06-15 07:55:17,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=77466.66666666667, ans=0.125 2023-06-15 07:55:22,554 INFO [train.py:988] (3/4) Epoch 22, batch 450, loss[loss=0.2405, simple_loss=0.3124, pruned_loss=0.08435, over 18802.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.316, pruned_loss=0.08719, over 3383202.04 frames. ], batch size: 83, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:55:45,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=77600.0, ans=0.5 2023-06-15 07:56:41,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=77800.0, ans=0.125 2023-06-15 07:56:49,268 INFO [train.py:988] (3/4) Epoch 22, batch 500, loss[loss=0.2559, simple_loss=0.3396, pruned_loss=0.08611, over 15381.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3156, pruned_loss=0.08711, over 3471486.71 frames. ], batch size: 44, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:56:51,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.05 vs. limit=22.5 2023-06-15 07:56:55,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=77866.66666666667, ans=0.125 2023-06-15 07:56:56,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=77866.66666666667, ans=0.0 2023-06-15 07:57:23,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=78000.0, ans=0.125 2023-06-15 07:57:33,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=78000.0, ans=0.0 2023-06-15 07:57:39,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=78066.66666666667, ans=0.125 2023-06-15 07:58:03,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.901e+02 2.129e+02 2.481e+02 3.635e+02, threshold=4.258e+02, percent-clipped=0.0 2023-06-15 07:58:03,883 INFO [train.py:988] (3/4) Epoch 23, batch 0, loss[loss=0.2507, simple_loss=0.2796, pruned_loss=0.1109, over 16955.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.2796, pruned_loss=0.1109, over 16955.00 frames. ], batch size: 391, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 07:58:03,884 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 07:58:10,167 INFO [train.py:1020] (3/4) Epoch 23, validation: loss=0.2051, simple_loss=0.3092, pruned_loss=0.05051, over 143649.00 frames. 2023-06-15 07:58:10,168 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 07:58:14,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-15 07:59:06,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=78280.0, ans=0.125 2023-06-15 07:59:14,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=78280.0, ans=0.04949747468305833 2023-06-15 07:59:35,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.64 vs. limit=22.5 2023-06-15 07:59:39,928 INFO [train.py:988] (3/4) Epoch 23, batch 50, loss[loss=0.2468, simple_loss=0.319, pruned_loss=0.0873, over 18276.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3119, pruned_loss=0.08649, over 858025.72 frames. ], batch size: 74, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:00:04,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78480.0, ans=0.125 2023-06-15 08:00:07,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=78480.0, ans=0.0 2023-06-15 08:00:13,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=78480.0, ans=0.1 2023-06-15 08:00:58,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=15.0 2023-06-15 08:01:01,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=78680.0, ans=0.1 2023-06-15 08:01:10,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.902e+02 2.065e+02 2.419e+02 3.199e+02, threshold=4.129e+02, percent-clipped=0.0 2023-06-15 08:01:10,834 INFO [train.py:988] (3/4) Epoch 23, batch 100, loss[loss=0.2463, simple_loss=0.3054, pruned_loss=0.09357, over 20240.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3116, pruned_loss=0.08743, over 1504096.22 frames. ], batch size: 239, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:01:34,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-15 08:02:21,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.83 vs. limit=15.0 2023-06-15 08:02:27,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=15.0 2023-06-15 08:02:40,326 INFO [train.py:988] (3/4) Epoch 23, batch 150, loss[loss=0.2397, simple_loss=0.3165, pruned_loss=0.08144, over 18254.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.314, pruned_loss=0.08629, over 1994338.31 frames. ], batch size: 74, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:02:43,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=15.0 2023-06-15 08:02:56,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=79146.66666666667, ans=0.0 2023-06-15 08:03:20,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.58 vs. limit=22.5 2023-06-15 08:03:50,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=79346.66666666667, ans=0.0 2023-06-15 08:04:09,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.913e+02 2.150e+02 2.448e+02 4.177e+02, threshold=4.300e+02, percent-clipped=1.0 2023-06-15 08:04:09,243 INFO [train.py:988] (3/4) Epoch 23, batch 200, loss[loss=0.2436, simple_loss=0.2982, pruned_loss=0.09449, over 19855.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.312, pruned_loss=0.08535, over 2400925.67 frames. ], batch size: 294, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:04:18,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.16 vs. limit=22.5 2023-06-15 08:05:03,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=79613.33333333333, ans=0.0 2023-06-15 08:05:29,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=79680.0, ans=0.125 2023-06-15 08:05:34,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=79680.0, ans=0.125 2023-06-15 08:05:37,764 INFO [train.py:988] (3/4) Epoch 23, batch 250, loss[loss=0.2178, simple_loss=0.3016, pruned_loss=0.06704, over 19478.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3122, pruned_loss=0.08503, over 2705731.20 frames. ], batch size: 105, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:05:38,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=79746.66666666667, ans=0.125 2023-06-15 08:05:58,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=79813.33333333333, ans=0.125 2023-06-15 08:06:37,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=79946.66666666667, ans=0.2 2023-06-15 08:06:58,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80013.33333333333, ans=0.1 2023-06-15 08:07:03,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.92 vs. limit=22.5 2023-06-15 08:07:04,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=80013.33333333333, ans=0.1 2023-06-15 08:07:10,031 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.868e+02 2.154e+02 2.660e+02 5.559e+02, threshold=4.308e+02, percent-clipped=2.0 2023-06-15 08:07:10,080 INFO [train.py:988] (3/4) Epoch 23, batch 300, loss[loss=0.2642, simple_loss=0.3426, pruned_loss=0.09291, over 16192.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3122, pruned_loss=0.08443, over 2938775.77 frames. ], batch size: 51, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:07:19,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.13 vs. limit=12.0 2023-06-15 08:07:31,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=80146.66666666667, ans=0.125 2023-06-15 08:08:09,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=80280.0, ans=0.015 2023-06-15 08:08:29,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=80346.66666666667, ans=0.07 2023-06-15 08:08:38,454 INFO [train.py:988] (3/4) Epoch 23, batch 350, loss[loss=0.2331, simple_loss=0.3159, pruned_loss=0.07521, over 18244.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3112, pruned_loss=0.08427, over 3134082.26 frames. ], batch size: 74, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:08:39,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.44 vs. limit=10.0 2023-06-15 08:08:50,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=80413.33333333333, ans=0.125 2023-06-15 08:09:06,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=80480.0, ans=0.125 2023-06-15 08:09:16,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=80546.66666666667, ans=0.2 2023-06-15 08:09:47,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=80680.0, ans=0.05 2023-06-15 08:09:56,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=80680.0, ans=0.125 2023-06-15 08:10:05,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.945e+02 2.149e+02 2.617e+02 4.906e+02, threshold=4.297e+02, percent-clipped=1.0 2023-06-15 08:10:05,694 INFO [train.py:988] (3/4) Epoch 23, batch 400, loss[loss=0.2408, simple_loss=0.3151, pruned_loss=0.08326, over 19071.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3116, pruned_loss=0.0848, over 3260224.69 frames. ], batch size: 89, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:10:23,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=80813.33333333333, ans=0.0 2023-06-15 08:10:44,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=80880.0, ans=0.125 2023-06-15 08:10:46,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80880.0, ans=0.1 2023-06-15 08:10:59,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=80946.66666666667, ans=0.0 2023-06-15 08:11:00,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.61 vs. limit=22.5 2023-06-15 08:11:34,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.02 vs. limit=15.0 2023-06-15 08:11:34,907 INFO [train.py:988] (3/4) Epoch 23, batch 450, loss[loss=0.2638, simple_loss=0.3465, pruned_loss=0.09053, over 15476.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3114, pruned_loss=0.08438, over 3376171.62 frames. ], batch size: 44, lr: 1.36e-02, grad_scale: 64.0 2023-06-15 08:12:01,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.93 vs. limit=22.5 2023-06-15 08:12:14,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=81213.33333333333, ans=0.125 2023-06-15 08:12:49,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=81346.66666666667, ans=0.125 2023-06-15 08:12:54,915 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:12:58,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=81346.66666666667, ans=0.0 2023-06-15 08:13:01,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.014e+02 2.230e+02 2.721e+02 4.611e+02, threshold=4.461e+02, percent-clipped=1.0 2023-06-15 08:13:01,066 INFO [train.py:988] (3/4) Epoch 23, batch 500, loss[loss=0.2402, simple_loss=0.307, pruned_loss=0.08666, over 19096.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3113, pruned_loss=0.08371, over 3470595.42 frames. ], batch size: 89, lr: 1.36e-02, grad_scale: 64.0 2023-06-15 08:13:01,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2023-06-15 08:13:34,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=81546.66666666667, ans=0.04949747468305833 2023-06-15 08:13:43,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.92 vs. limit=22.5 2023-06-15 08:14:20,672 INFO [train.py:988] (3/4) Epoch 24, batch 0, loss[loss=0.2558, simple_loss=0.3334, pruned_loss=0.08917, over 16306.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3334, pruned_loss=0.08917, over 16306.00 frames. ], batch size: 52, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:14:20,673 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 08:14:27,200 INFO [train.py:1020] (3/4) Epoch 24, validation: loss=0.2057, simple_loss=0.3089, pruned_loss=0.05123, over 143649.00 frames. 2023-06-15 08:14:27,200 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 08:14:38,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-06-15 08:14:39,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-15 08:14:47,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-06-15 08:14:55,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81693.33333333333, ans=0.1 2023-06-15 08:14:58,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=81693.33333333333, ans=0.1 2023-06-15 08:15:02,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=81760.0, ans=0.125 2023-06-15 08:15:35,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=81826.66666666667, ans=0.0 2023-06-15 08:15:41,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=81893.33333333333, ans=0.0 2023-06-15 08:15:47,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=81893.33333333333, ans=0.1 2023-06-15 08:15:57,120 INFO [train.py:988] (3/4) Epoch 24, batch 50, loss[loss=0.2321, simple_loss=0.3181, pruned_loss=0.07306, over 17631.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3105, pruned_loss=0.08322, over 839460.50 frames. ], batch size: 67, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:16:04,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=81960.0, ans=0.0 2023-06-15 08:16:24,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=82026.66666666667, ans=0.2 2023-06-15 08:16:29,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.940e+02 2.249e+02 2.625e+02 3.999e+02, threshold=4.499e+02, percent-clipped=0.0 2023-06-15 08:16:58,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-15 08:16:59,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=82160.0, ans=0.125 2023-06-15 08:17:25,786 INFO [train.py:988] (3/4) Epoch 24, batch 100, loss[loss=0.2553, simple_loss=0.3197, pruned_loss=0.09548, over 20527.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3088, pruned_loss=0.08274, over 1487301.47 frames. ], batch size: 160, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:17:39,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=82293.33333333333, ans=0.2 2023-06-15 08:17:50,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=82360.0, ans=0.125 2023-06-15 08:18:15,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.54 vs. limit=15.0 2023-06-15 08:18:26,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=82493.33333333333, ans=0.125 2023-06-15 08:18:30,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=82493.33333333333, ans=0.125 2023-06-15 08:18:54,704 INFO [train.py:988] (3/4) Epoch 24, batch 150, loss[loss=0.2313, simple_loss=0.2938, pruned_loss=0.08441, over 20631.00 frames. ], tot_loss[loss=0.238, simple_loss=0.31, pruned_loss=0.08297, over 2003199.24 frames. ], batch size: 173, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:18:55,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=82626.66666666667, ans=0.0 2023-06-15 08:19:26,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=15.0 2023-06-15 08:19:27,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.862e+02 2.076e+02 2.332e+02 3.767e+02, threshold=4.152e+02, percent-clipped=0.0 2023-06-15 08:19:42,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=82760.0, ans=0.0 2023-06-15 08:19:55,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=82826.66666666667, ans=0.035 2023-06-15 08:20:14,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=82893.33333333333, ans=0.0 2023-06-15 08:20:15,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=82893.33333333333, ans=0.0 2023-06-15 08:20:15,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=82893.33333333333, ans=0.125 2023-06-15 08:20:18,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-15 08:20:24,118 INFO [train.py:988] (3/4) Epoch 24, batch 200, loss[loss=0.2394, simple_loss=0.3301, pruned_loss=0.07437, over 18340.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3102, pruned_loss=0.08312, over 2408051.82 frames. ], batch size: 72, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:20:28,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-15 08:21:39,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-15 08:21:53,217 INFO [train.py:988] (3/4) Epoch 24, batch 250, loss[loss=0.2516, simple_loss=0.3232, pruned_loss=0.08995, over 19124.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3101, pruned_loss=0.08297, over 2726058.65 frames. ], batch size: 94, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:22:01,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=83293.33333333333, ans=0.125 2023-06-15 08:22:20,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=83360.0, ans=10.0 2023-06-15 08:22:24,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-06-15 08:22:24,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.954e+02 2.132e+02 2.511e+02 4.253e+02, threshold=4.265e+02, percent-clipped=1.0 2023-06-15 08:22:32,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83426.66666666667, ans=0.1 2023-06-15 08:22:35,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83426.66666666667, ans=0.1 2023-06-15 08:23:21,373 INFO [train.py:988] (3/4) Epoch 24, batch 300, loss[loss=0.2611, simple_loss=0.3423, pruned_loss=0.08994, over 16315.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3109, pruned_loss=0.08273, over 2957757.83 frames. ], batch size: 52, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:23:54,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.96 vs. limit=15.0 2023-06-15 08:23:57,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83760.0, ans=0.1 2023-06-15 08:24:06,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=83760.0, ans=0.0 2023-06-15 08:24:18,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=83826.66666666667, ans=0.1 2023-06-15 08:24:18,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=83826.66666666667, ans=0.025 2023-06-15 08:24:31,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=15.0 2023-06-15 08:24:49,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83960.0, ans=0.125 2023-06-15 08:24:50,661 INFO [train.py:988] (3/4) Epoch 24, batch 350, loss[loss=0.2667, simple_loss=0.3501, pruned_loss=0.09162, over 16409.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3103, pruned_loss=0.08272, over 3132479.42 frames. ], batch size: 52, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:25:22,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.988e+02 2.360e+02 2.767e+02 4.250e+02, threshold=4.720e+02, percent-clipped=0.0 2023-06-15 08:25:34,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=84093.33333333333, ans=0.125 2023-06-15 08:25:51,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84160.0, ans=0.0 2023-06-15 08:25:53,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84160.0, ans=0.1 2023-06-15 08:26:04,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84226.66666666667, ans=0.1 2023-06-15 08:26:05,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.20 vs. limit=15.0 2023-06-15 08:26:21,073 INFO [train.py:988] (3/4) Epoch 24, batch 400, loss[loss=0.2214, simple_loss=0.2974, pruned_loss=0.07275, over 19295.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3101, pruned_loss=0.08211, over 3279200.03 frames. ], batch size: 98, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:26:26,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=84293.33333333333, ans=0.0 2023-06-15 08:26:27,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-15 08:27:12,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=84493.33333333333, ans=0.125 2023-06-15 08:27:14,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84493.33333333333, ans=0.1 2023-06-15 08:27:19,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=84493.33333333333, ans=0.125 2023-06-15 08:27:26,746 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:27:28,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=84493.33333333333, ans=0.125 2023-06-15 08:27:40,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=84560.0, ans=0.125 2023-06-15 08:27:49,968 INFO [train.py:988] (3/4) Epoch 24, batch 450, loss[loss=0.253, simple_loss=0.334, pruned_loss=0.08605, over 17021.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3106, pruned_loss=0.08203, over 3384660.45 frames. ], batch size: 60, lr: 1.31e-02, grad_scale: 64.0 2023-06-15 08:28:20,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-15 08:28:21,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.787e+02 2.025e+02 2.287e+02 3.179e+02, threshold=4.050e+02, percent-clipped=0.0 2023-06-15 08:28:59,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.34 vs. limit=15.0 2023-06-15 08:29:15,499 INFO [train.py:988] (3/4) Epoch 24, batch 500, loss[loss=0.2416, simple_loss=0.2962, pruned_loss=0.09354, over 19852.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3114, pruned_loss=0.0826, over 3472380.72 frames. ], batch size: 293, lr: 1.31e-02, grad_scale: 32.0 2023-06-15 08:29:22,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-06-15 08:29:36,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=85026.66666666667, ans=22.5 2023-06-15 08:29:47,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=85093.33333333333, ans=0.125 2023-06-15 08:30:03,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=85160.0, ans=0.125 2023-06-15 08:30:31,637 INFO [train.py:988] (3/4) Epoch 25, batch 0, loss[loss=0.2134, simple_loss=0.2953, pruned_loss=0.06579, over 19234.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2953, pruned_loss=0.06579, over 19234.00 frames. ], batch size: 92, lr: 1.29e-02, grad_scale: 32.0 2023-06-15 08:30:31,638 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 08:30:37,733 INFO [train.py:1020] (3/4) Epoch 25, validation: loss=0.205, simple_loss=0.3085, pruned_loss=0.05071, over 143649.00 frames. 2023-06-15 08:30:37,735 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 08:31:32,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85373.33333333333, ans=0.1 2023-06-15 08:31:44,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.894e+02 2.218e+02 2.492e+02 3.446e+02, threshold=4.437e+02, percent-clipped=0.0 2023-06-15 08:31:53,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=85440.0, ans=0.2 2023-06-15 08:31:54,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.86 vs. limit=22.5 2023-06-15 08:32:07,389 INFO [train.py:988] (3/4) Epoch 25, batch 50, loss[loss=0.2145, simple_loss=0.2956, pruned_loss=0.06667, over 19079.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.308, pruned_loss=0.08068, over 854682.74 frames. ], batch size: 89, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:32:15,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=85506.66666666667, ans=0.125 2023-06-15 08:33:13,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=85706.66666666667, ans=0.0 2023-06-15 08:33:34,492 INFO [train.py:988] (3/4) Epoch 25, batch 100, loss[loss=0.2493, simple_loss=0.3243, pruned_loss=0.08717, over 19074.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3095, pruned_loss=0.08011, over 1501425.45 frames. ], batch size: 89, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:33:45,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85840.0, ans=0.1 2023-06-15 08:34:24,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.66 vs. limit=12.0 2023-06-15 08:34:25,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=86040.0, ans=0.04949747468305833 2023-06-15 08:34:39,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.831e+02 2.059e+02 2.312e+02 3.649e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-15 08:35:02,330 INFO [train.py:988] (3/4) Epoch 25, batch 150, loss[loss=0.2324, simple_loss=0.3118, pruned_loss=0.07649, over 19671.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3099, pruned_loss=0.08008, over 2005749.43 frames. ], batch size: 110, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:35:42,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=86306.66666666667, ans=0.2 2023-06-15 08:35:48,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=86306.66666666667, ans=0.125 2023-06-15 08:36:15,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=86440.0, ans=0.0 2023-06-15 08:36:22,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=86440.0, ans=0.0 2023-06-15 08:36:29,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86506.66666666667, ans=0.125 2023-06-15 08:36:30,648 INFO [train.py:988] (3/4) Epoch 25, batch 200, loss[loss=0.2425, simple_loss=0.3289, pruned_loss=0.07812, over 17584.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3105, pruned_loss=0.08055, over 2404949.03 frames. ], batch size: 67, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:36:53,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=86573.33333333333, ans=10.0 2023-06-15 08:36:57,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-15 08:37:12,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=86640.0, ans=0.125 2023-06-15 08:37:17,411 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:37:31,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=86706.66666666667, ans=0.0 2023-06-15 08:37:33,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=86706.66666666667, ans=0.125 2023-06-15 08:37:33,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=86706.66666666667, ans=0.125 2023-06-15 08:37:33,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=86706.66666666667, ans=0.2 2023-06-15 08:37:34,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.840e+02 2.036e+02 2.337e+02 3.806e+02, threshold=4.072e+02, percent-clipped=0.0 2023-06-15 08:37:41,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-15 08:37:58,075 INFO [train.py:988] (3/4) Epoch 25, batch 250, loss[loss=0.2971, simple_loss=0.3696, pruned_loss=0.1123, over 16325.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3104, pruned_loss=0.08144, over 2724819.82 frames. ], batch size: 52, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:38:05,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=86840.0, ans=0.125 2023-06-15 08:38:19,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=86906.66666666667, ans=0.125 2023-06-15 08:38:19,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=86906.66666666667, ans=0.0 2023-06-15 08:39:17,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=87106.66666666667, ans=0.125 2023-06-15 08:39:26,086 INFO [train.py:988] (3/4) Epoch 25, batch 300, loss[loss=0.2389, simple_loss=0.3077, pruned_loss=0.08509, over 20275.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3102, pruned_loss=0.08135, over 2947195.40 frames. ], batch size: 141, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:39:52,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-06-15 08:39:54,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87240.0, ans=0.1 2023-06-15 08:40:26,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=87373.33333333333, ans=0.025 2023-06-15 08:40:31,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.920e+02 2.106e+02 2.371e+02 3.707e+02, threshold=4.212e+02, percent-clipped=0.0 2023-06-15 08:40:34,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=87440.0, ans=0.0 2023-06-15 08:40:54,082 INFO [train.py:988] (3/4) Epoch 25, batch 350, loss[loss=0.2224, simple_loss=0.2935, pruned_loss=0.07563, over 20565.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3088, pruned_loss=0.08143, over 3142145.79 frames. ], batch size: 173, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:41:08,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87506.66666666667, ans=0.1 2023-06-15 08:41:09,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=87573.33333333333, ans=0.125 2023-06-15 08:41:34,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-15 08:42:09,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=87773.33333333333, ans=0.0 2023-06-15 08:42:20,004 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:42:21,289 INFO [train.py:988] (3/4) Epoch 25, batch 400, loss[loss=0.2247, simple_loss=0.3025, pruned_loss=0.07346, over 18952.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3091, pruned_loss=0.08207, over 3272510.67 frames. ], batch size: 86, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:42:24,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=87840.0, ans=0.0 2023-06-15 08:42:39,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=87906.66666666667, ans=0.125 2023-06-15 08:42:45,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.82 vs. limit=15.0 2023-06-15 08:43:04,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=87973.33333333333, ans=0.5 2023-06-15 08:43:21,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=88040.0, ans=10.0 2023-06-15 08:43:21,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-15 08:43:26,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=88040.0, ans=0.125 2023-06-15 08:43:26,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=88040.0, ans=0.0 2023-06-15 08:43:27,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.933e+02 2.148e+02 2.533e+02 3.587e+02, threshold=4.297e+02, percent-clipped=0.0 2023-06-15 08:43:35,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88106.66666666667, ans=0.1 2023-06-15 08:43:40,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=88106.66666666667, ans=0.0 2023-06-15 08:43:47,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.72 vs. limit=15.0 2023-06-15 08:43:50,324 INFO [train.py:988] (3/4) Epoch 25, batch 450, loss[loss=0.2157, simple_loss=0.288, pruned_loss=0.07174, over 20691.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3087, pruned_loss=0.08207, over 3388767.45 frames. ], batch size: 211, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:44:03,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.55 vs. limit=6.0 2023-06-15 08:44:04,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=88173.33333333333, ans=0.0 2023-06-15 08:44:06,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=88240.0, ans=10.0 2023-06-15 08:44:24,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=22.5 2023-06-15 08:44:40,720 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:45:02,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=88440.0, ans=0.0 2023-06-15 08:45:06,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=88440.0, ans=0.125 2023-06-15 08:45:15,640 INFO [train.py:988] (3/4) Epoch 25, batch 500, loss[loss=0.2288, simple_loss=0.2915, pruned_loss=0.0831, over 20278.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3086, pruned_loss=0.08175, over 3484602.33 frames. ], batch size: 239, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:45:30,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-06-15 08:45:37,491 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:45:42,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=88573.33333333333, ans=0.125 2023-06-15 08:45:58,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88640.0, ans=0.125 2023-06-15 08:46:04,866 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:46:29,693 INFO [train.py:988] (3/4) Epoch 26, batch 0, loss[loss=0.2646, simple_loss=0.3395, pruned_loss=0.09484, over 17581.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3395, pruned_loss=0.09484, over 17581.00 frames. ], batch size: 67, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:46:29,694 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 08:46:35,672 INFO [train.py:1020] (3/4) Epoch 26, validation: loss=0.2057, simple_loss=0.3076, pruned_loss=0.05187, over 143649.00 frames. 2023-06-15 08:46:35,673 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 08:46:42,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=88720.0, ans=0.125 2023-06-15 08:46:43,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.988e+02 2.148e+02 2.357e+02 3.601e+02, threshold=4.296e+02, percent-clipped=0.0 2023-06-15 08:46:51,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=88786.66666666667, ans=0.0 2023-06-15 08:46:54,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=88786.66666666667, ans=0.0 2023-06-15 08:46:57,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=88786.66666666667, ans=0.125 2023-06-15 08:47:05,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=88786.66666666667, ans=0.0 2023-06-15 08:47:10,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88853.33333333333, ans=0.1 2023-06-15 08:47:21,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=88853.33333333333, ans=0.2 2023-06-15 08:47:25,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-06-15 08:47:45,018 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:47:48,749 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:47:51,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.18 vs. limit=5.0 2023-06-15 08:48:02,848 INFO [train.py:988] (3/4) Epoch 26, batch 50, loss[loss=0.2266, simple_loss=0.2939, pruned_loss=0.07963, over 20559.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3084, pruned_loss=0.08005, over 874916.86 frames. ], batch size: 189, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:48:08,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=89053.33333333333, ans=0.125 2023-06-15 08:48:17,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=89053.33333333333, ans=0.0 2023-06-15 08:48:20,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=89120.0, ans=0.0 2023-06-15 08:48:39,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=12.0 2023-06-15 08:49:05,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=89253.33333333333, ans=0.125 2023-06-15 08:49:25,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=89320.0, ans=0.125 2023-06-15 08:49:33,000 INFO [train.py:988] (3/4) Epoch 26, batch 100, loss[loss=0.2245, simple_loss=0.3004, pruned_loss=0.07425, over 18635.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3071, pruned_loss=0.08067, over 1525631.07 frames. ], batch size: 80, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:49:33,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=89386.66666666667, ans=0.0 2023-06-15 08:49:41,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.958e+02 2.207e+02 2.501e+02 3.726e+02, threshold=4.413e+02, percent-clipped=0.0 2023-06-15 08:50:08,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=89520.0, ans=0.0 2023-06-15 08:50:12,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=89520.0, ans=0.125 2023-06-15 08:50:19,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=89520.0, ans=0.125 2023-06-15 08:50:38,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=89586.66666666667, ans=0.125 2023-06-15 08:50:47,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=89653.33333333333, ans=0.125 2023-06-15 08:51:01,450 INFO [train.py:988] (3/4) Epoch 26, batch 150, loss[loss=0.2187, simple_loss=0.2997, pruned_loss=0.06878, over 19674.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.308, pruned_loss=0.08006, over 2014411.98 frames. ], batch size: 110, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:51:22,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=89786.66666666667, ans=0.2 2023-06-15 08:51:54,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89920.0, ans=0.1 2023-06-15 08:52:00,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=89920.0, ans=0.125 2023-06-15 08:52:29,976 INFO [train.py:988] (3/4) Epoch 26, batch 200, loss[loss=0.2714, simple_loss=0.3474, pruned_loss=0.09764, over 18281.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3075, pruned_loss=0.0792, over 2413355.39 frames. ], batch size: 74, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:52:38,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2023-06-15 08:52:39,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.858e+02 1.962e+02 2.261e+02 3.889e+02, threshold=3.924e+02, percent-clipped=0.0 2023-06-15 08:53:23,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=90253.33333333333, ans=0.2 2023-06-15 08:53:32,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=90253.33333333333, ans=0.125 2023-06-15 08:53:38,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.36 vs. limit=15.0 2023-06-15 08:53:53,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=90320.0, ans=0.0 2023-06-15 08:53:58,449 INFO [train.py:988] (3/4) Epoch 26, batch 250, loss[loss=0.2157, simple_loss=0.2981, pruned_loss=0.0666, over 18792.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3071, pruned_loss=0.07872, over 2721524.34 frames. ], batch size: 83, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:54:04,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=90386.66666666667, ans=0.125 2023-06-15 08:54:54,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=90586.66666666667, ans=0.0 2023-06-15 08:55:04,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=90586.66666666667, ans=0.04949747468305833 2023-06-15 08:55:22,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=90653.33333333333, ans=0.0 2023-06-15 08:55:26,657 INFO [train.py:988] (3/4) Epoch 26, batch 300, loss[loss=0.234, simple_loss=0.3104, pruned_loss=0.07881, over 19462.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3076, pruned_loss=0.07992, over 2956830.51 frames. ], batch size: 105, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:55:36,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.907e+02 2.229e+02 2.657e+02 5.301e+02, threshold=4.457e+02, percent-clipped=1.0 2023-06-15 08:55:38,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90720.0, ans=0.0 2023-06-15 08:56:33,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=90920.0, ans=0.015 2023-06-15 08:56:37,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=90986.66666666667, ans=0.04949747468305833 2023-06-15 08:56:42,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90986.66666666667, ans=0.1 2023-06-15 08:56:55,559 INFO [train.py:988] (3/4) Epoch 26, batch 350, loss[loss=0.2417, simple_loss=0.3104, pruned_loss=0.08652, over 20314.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3072, pruned_loss=0.07966, over 3145955.17 frames. ], batch size: 149, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:58:16,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=91320.0, ans=0.0 2023-06-15 08:58:23,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=91386.66666666667, ans=0.125 2023-06-15 08:58:24,518 INFO [train.py:988] (3/4) Epoch 26, batch 400, loss[loss=0.25, simple_loss=0.2825, pruned_loss=0.1088, over 16900.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.307, pruned_loss=0.0794, over 3279521.27 frames. ], batch size: 391, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:58:33,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.996e+02 2.310e+02 2.636e+02 4.226e+02, threshold=4.620e+02, percent-clipped=0.0 2023-06-15 08:58:49,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-15 08:59:15,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=91520.0, ans=0.0 2023-06-15 08:59:15,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=15.0 2023-06-15 08:59:18,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=91586.66666666667, ans=0.0 2023-06-15 08:59:40,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=91653.33333333333, ans=0.2 2023-06-15 08:59:53,738 INFO [train.py:988] (3/4) Epoch 26, batch 450, loss[loss=0.2211, simple_loss=0.2894, pruned_loss=0.07643, over 19954.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3072, pruned_loss=0.07938, over 3402711.22 frames. ], batch size: 126, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 09:00:00,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=12.0 2023-06-15 09:00:49,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-06-15 09:01:11,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=91986.66666666667, ans=0.0 2023-06-15 09:01:20,839 INFO [train.py:988] (3/4) Epoch 26, batch 500, loss[loss=0.2394, simple_loss=0.3085, pruned_loss=0.08512, over 20561.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3065, pruned_loss=0.07933, over 3486465.40 frames. ], batch size: 173, lr: 1.22e-02, grad_scale: 32.0 2023-06-15 09:01:28,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.890e+02 2.061e+02 2.503e+02 4.030e+02, threshold=4.121e+02, percent-clipped=0.0 2023-06-15 09:01:31,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=92053.33333333333, ans=0.125 2023-06-15 09:01:42,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.83 vs. limit=15.0 2023-06-15 09:01:53,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=92186.66666666667, ans=0.025 2023-06-15 09:01:55,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2023-06-15 09:02:43,439 INFO [train.py:988] (3/4) Epoch 27, batch 0, loss[loss=0.2403, simple_loss=0.3154, pruned_loss=0.08263, over 18769.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3154, pruned_loss=0.08263, over 18769.00 frames. ], batch size: 83, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:02:43,440 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 09:02:52,297 INFO [train.py:1020] (3/4) Epoch 27, validation: loss=0.2009, simple_loss=0.305, pruned_loss=0.04841, over 143649.00 frames. 2023-06-15 09:02:52,298 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 09:02:58,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=92273.33333333333, ans=0.125 2023-06-15 09:03:43,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=92406.66666666667, ans=0.125 2023-06-15 09:03:50,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=92473.33333333333, ans=0.0 2023-06-15 09:03:50,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=92473.33333333333, ans=0.125 2023-06-15 09:04:21,893 INFO [train.py:988] (3/4) Epoch 27, batch 50, loss[loss=0.2283, simple_loss=0.3045, pruned_loss=0.07604, over 18924.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3079, pruned_loss=0.0769, over 851272.41 frames. ], batch size: 86, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:04:32,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=92606.66666666667, ans=0.125 2023-06-15 09:04:35,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=92606.66666666667, ans=0.015 2023-06-15 09:04:59,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.842e+02 2.115e+02 2.294e+02 3.126e+02, threshold=4.230e+02, percent-clipped=0.0 2023-06-15 09:05:09,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=92740.0, ans=0.2 2023-06-15 09:05:19,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=92806.66666666667, ans=0.5 2023-06-15 09:05:47,908 INFO [train.py:988] (3/4) Epoch 27, batch 100, loss[loss=0.2304, simple_loss=0.3132, pruned_loss=0.07385, over 18920.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3072, pruned_loss=0.07666, over 1471600.63 frames. ], batch size: 86, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:05:57,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=92940.0, ans=0.125 2023-06-15 09:06:21,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93073.33333333333, ans=0.0 2023-06-15 09:06:37,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=93073.33333333333, ans=0.125 2023-06-15 09:06:41,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=93140.0, ans=0.125 2023-06-15 09:06:52,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=93140.0, ans=0.1 2023-06-15 09:07:14,879 INFO [train.py:988] (3/4) Epoch 27, batch 150, loss[loss=0.2242, simple_loss=0.3062, pruned_loss=0.07114, over 18626.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3063, pruned_loss=0.07718, over 1975768.54 frames. ], batch size: 80, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:07:30,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=93273.33333333333, ans=0.125 2023-06-15 09:07:54,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.908e+02 2.202e+02 2.518e+02 3.722e+02, threshold=4.404e+02, percent-clipped=0.0 2023-06-15 09:08:06,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-06-15 09:08:09,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93473.33333333333, ans=0.1 2023-06-15 09:08:18,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2023-06-15 09:08:21,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=93473.33333333333, ans=0.125 2023-06-15 09:08:27,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=93540.0, ans=0.2 2023-06-15 09:08:32,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=93540.0, ans=0.125 2023-06-15 09:08:43,697 INFO [train.py:988] (3/4) Epoch 27, batch 200, loss[loss=0.2402, simple_loss=0.327, pruned_loss=0.07666, over 16811.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3058, pruned_loss=0.07781, over 2382683.54 frames. ], batch size: 59, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:09:12,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=93673.33333333333, ans=0.125 2023-06-15 09:09:16,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=93673.33333333333, ans=0.0 2023-06-15 09:09:32,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2023-06-15 09:09:41,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.94 vs. limit=22.5 2023-06-15 09:09:47,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93806.66666666667, ans=0.1 2023-06-15 09:10:08,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=93873.33333333333, ans=0.125 2023-06-15 09:10:11,429 INFO [train.py:988] (3/4) Epoch 27, batch 250, loss[loss=0.2117, simple_loss=0.2941, pruned_loss=0.06465, over 19454.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3053, pruned_loss=0.07765, over 2684295.54 frames. ], batch size: 105, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:10:39,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=12.0 2023-06-15 09:10:41,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=94006.66666666667, ans=0.125 2023-06-15 09:10:42,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=12.0 2023-06-15 09:10:50,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.782e+02 1.944e+02 2.302e+02 3.570e+02, threshold=3.888e+02, percent-clipped=0.0 2023-06-15 09:10:51,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-06-15 09:11:01,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.60 vs. limit=10.0 2023-06-15 09:11:05,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=94140.0, ans=0.125 2023-06-15 09:11:07,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94140.0, ans=0.1 2023-06-15 09:11:11,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=94140.0, ans=0.125 2023-06-15 09:11:24,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94206.66666666667, ans=0.125 2023-06-15 09:11:29,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=94206.66666666667, ans=0.95 2023-06-15 09:11:39,396 INFO [train.py:988] (3/4) Epoch 27, batch 300, loss[loss=0.2318, simple_loss=0.3115, pruned_loss=0.07608, over 18311.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.306, pruned_loss=0.07807, over 2924942.31 frames. ], batch size: 74, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:12:50,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=94540.0, ans=0.035 2023-06-15 09:12:52,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=94540.0, ans=0.05 2023-06-15 09:12:53,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=94540.0, ans=0.1 2023-06-15 09:13:06,369 INFO [train.py:988] (3/4) Epoch 27, batch 350, loss[loss=0.2113, simple_loss=0.2883, pruned_loss=0.0672, over 19440.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3055, pruned_loss=0.07776, over 3113761.35 frames. ], batch size: 105, lr: 1.19e-02, grad_scale: 16.0 2023-06-15 09:13:11,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=94606.66666666667, ans=0.0 2023-06-15 09:13:38,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=94673.33333333333, ans=0.0 2023-06-15 09:13:45,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=94740.0, ans=0.0 2023-06-15 09:13:45,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=94740.0, ans=0.0 2023-06-15 09:13:46,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.867e+02 2.054e+02 2.333e+02 3.496e+02, threshold=4.108e+02, percent-clipped=0.0 2023-06-15 09:13:55,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=94740.0, ans=0.1 2023-06-15 09:14:03,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2023-06-15 09:14:05,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.12 vs. limit=15.0 2023-06-15 09:14:15,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=94873.33333333333, ans=0.05 2023-06-15 09:14:33,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94940.0, ans=0.1 2023-06-15 09:14:35,036 INFO [train.py:988] (3/4) Epoch 27, batch 400, loss[loss=0.2212, simple_loss=0.3039, pruned_loss=0.06921, over 18650.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3064, pruned_loss=0.07782, over 3264526.97 frames. ], batch size: 80, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:14:39,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.53 vs. limit=15.0 2023-06-15 09:14:52,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=95006.66666666667, ans=0.0 2023-06-15 09:15:06,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=95006.66666666667, ans=0.125 2023-06-15 09:15:41,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-15 09:15:43,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=95206.66666666667, ans=0.2 2023-06-15 09:16:02,642 INFO [train.py:988] (3/4) Epoch 27, batch 450, loss[loss=0.2096, simple_loss=0.2911, pruned_loss=0.06401, over 18472.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3055, pruned_loss=0.07756, over 3397867.94 frames. ], batch size: 77, lr: 1.18e-02, grad_scale: 16.0 2023-06-15 09:16:11,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95273.33333333333, ans=0.1 2023-06-15 09:16:13,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-15 09:16:16,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.87 vs. limit=15.0 2023-06-15 09:16:21,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=95340.0, ans=0.125 2023-06-15 09:16:32,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=95340.0, ans=10.0 2023-06-15 09:16:40,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=95406.66666666667, ans=0.0 2023-06-15 09:16:43,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.922e+02 2.176e+02 2.834e+02 5.039e+02, threshold=4.352e+02, percent-clipped=1.0 2023-06-15 09:17:06,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=95473.33333333333, ans=0.5 2023-06-15 09:17:18,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-06-15 09:17:27,058 INFO [train.py:988] (3/4) Epoch 27, batch 500, loss[loss=0.2346, simple_loss=0.3031, pruned_loss=0.08304, over 20459.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3063, pruned_loss=0.07756, over 3472011.14 frames. ], batch size: 160, lr: 1.18e-02, grad_scale: 16.0 2023-06-15 09:17:27,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=95606.66666666667, ans=0.0 2023-06-15 09:17:37,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95606.66666666667, ans=0.125 2023-06-15 09:17:57,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.08 vs. limit=22.5 2023-06-15 09:18:07,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95740.0, ans=0.125 2023-06-15 09:18:14,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=95740.0, ans=0.125 2023-06-15 09:18:18,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=95806.66666666667, ans=0.0 2023-06-15 09:18:47,746 INFO [train.py:988] (3/4) Epoch 28, batch 0, loss[loss=0.2194, simple_loss=0.2928, pruned_loss=0.07304, over 20228.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2928, pruned_loss=0.07304, over 20228.00 frames. ], batch size: 141, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:18:47,747 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 09:18:53,820 INFO [train.py:1020] (3/4) Epoch 28, validation: loss=0.203, simple_loss=0.307, pruned_loss=0.0495, over 143649.00 frames. 2023-06-15 09:18:53,822 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 09:18:54,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.08 vs. limit=15.0 2023-06-15 09:19:27,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2023-06-15 09:19:35,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=95960.0, ans=0.125 2023-06-15 09:19:48,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=96026.66666666667, ans=0.0 2023-06-15 09:20:05,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.841e+02 2.127e+02 2.560e+02 4.411e+02, threshold=4.254e+02, percent-clipped=1.0 2023-06-15 09:20:07,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=96093.33333333333, ans=0.0 2023-06-15 09:20:08,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-15 09:20:21,886 INFO [train.py:988] (3/4) Epoch 28, batch 50, loss[loss=0.2469, simple_loss=0.3305, pruned_loss=0.08169, over 16347.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3054, pruned_loss=0.0771, over 840101.88 frames. ], batch size: 52, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:20:28,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=96160.0, ans=0.0 2023-06-15 09:20:32,800 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:20:45,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.24 vs. limit=22.5 2023-06-15 09:21:03,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=96293.33333333333, ans=0.0 2023-06-15 09:21:49,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96493.33333333333, ans=0.125 2023-06-15 09:21:51,039 INFO [train.py:988] (3/4) Epoch 28, batch 100, loss[loss=0.2364, simple_loss=0.2952, pruned_loss=0.08884, over 19831.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3042, pruned_loss=0.07814, over 1486092.63 frames. ], batch size: 293, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:21:51,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96493.33333333333, ans=0.1 2023-06-15 09:21:51,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96493.33333333333, ans=0.1 2023-06-15 09:22:04,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=96493.33333333333, ans=0.2 2023-06-15 09:22:08,192 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:22:59,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=96760.0, ans=0.125 2023-06-15 09:23:02,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.826e+02 2.059e+02 2.321e+02 3.339e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-15 09:23:18,519 INFO [train.py:988] (3/4) Epoch 28, batch 150, loss[loss=0.2153, simple_loss=0.2927, pruned_loss=0.06893, over 19522.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3043, pruned_loss=0.077, over 1993267.66 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:23:57,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=96960.0, ans=15.0 2023-06-15 09:24:44,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=97160.0, ans=0.2 2023-06-15 09:24:46,178 INFO [train.py:988] (3/4) Epoch 28, batch 200, loss[loss=0.2449, simple_loss=0.3043, pruned_loss=0.09275, over 20212.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3047, pruned_loss=0.07719, over 2379485.83 frames. ], batch size: 239, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:24:48,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=97160.0, ans=0.125 2023-06-15 09:25:20,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97293.33333333333, ans=0.1 2023-06-15 09:25:43,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97360.0, ans=0.1 2023-06-15 09:25:56,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.821e+02 1.977e+02 2.337e+02 4.646e+02, threshold=3.954e+02, percent-clipped=1.0 2023-06-15 09:26:11,651 INFO [train.py:988] (3/4) Epoch 28, batch 250, loss[loss=0.2482, simple_loss=0.3312, pruned_loss=0.08261, over 17607.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.306, pruned_loss=0.07679, over 2675870.02 frames. ], batch size: 67, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:26:29,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=97560.0, ans=0.125 2023-06-15 09:26:31,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=97560.0, ans=0.125 2023-06-15 09:26:36,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=97560.0, ans=0.125 2023-06-15 09:26:52,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=97626.66666666667, ans=0.0 2023-06-15 09:27:09,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97693.33333333333, ans=0.1 2023-06-15 09:27:12,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=97693.33333333333, ans=0.0 2023-06-15 09:27:32,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=97760.0, ans=0.125 2023-06-15 09:27:41,184 INFO [train.py:988] (3/4) Epoch 28, batch 300, loss[loss=0.2169, simple_loss=0.2998, pruned_loss=0.06703, over 19515.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3046, pruned_loss=0.07706, over 2926355.69 frames. ], batch size: 102, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:28:06,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-15 09:28:17,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=97960.0, ans=0.125 2023-06-15 09:28:52,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.947e+02 2.273e+02 2.757e+02 4.812e+02, threshold=4.546e+02, percent-clipped=3.0 2023-06-15 09:29:07,336 INFO [train.py:988] (3/4) Epoch 28, batch 350, loss[loss=0.2103, simple_loss=0.2893, pruned_loss=0.06564, over 19491.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3038, pruned_loss=0.07644, over 3123765.26 frames. ], batch size: 105, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:29:11,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=98160.0, ans=0.2 2023-06-15 09:29:15,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=98160.0, ans=0.0 2023-06-15 09:29:37,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=22.5 2023-06-15 09:30:20,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-15 09:30:34,473 INFO [train.py:988] (3/4) Epoch 28, batch 400, loss[loss=0.2382, simple_loss=0.3088, pruned_loss=0.08379, over 20298.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3038, pruned_loss=0.07671, over 3274360.66 frames. ], batch size: 149, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:30:44,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=98493.33333333333, ans=0.125 2023-06-15 09:30:56,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=98560.0, ans=0.125 2023-06-15 09:30:58,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=98560.0, ans=0.0 2023-06-15 09:31:07,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98560.0, ans=0.125 2023-06-15 09:31:33,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=98693.33333333333, ans=0.0 2023-06-15 09:31:47,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=98760.0, ans=0.2 2023-06-15 09:31:48,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.911e+02 2.084e+02 2.355e+02 3.960e+02, threshold=4.169e+02, percent-clipped=0.0 2023-06-15 09:31:49,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=98760.0, ans=0.0 2023-06-15 09:31:49,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98760.0, ans=0.1 2023-06-15 09:31:49,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=15.0 2023-06-15 09:32:01,902 INFO [train.py:988] (3/4) Epoch 28, batch 450, loss[loss=0.2179, simple_loss=0.2874, pruned_loss=0.0742, over 20529.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3035, pruned_loss=0.07705, over 3401717.19 frames. ], batch size: 189, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:32:20,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98893.33333333333, ans=0.1 2023-06-15 09:32:25,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=12.0 2023-06-15 09:32:29,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=98893.33333333333, ans=0.125 2023-06-15 09:32:52,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=99026.66666666667, ans=0.125 2023-06-15 09:32:54,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=99026.66666666667, ans=0.125 2023-06-15 09:33:09,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=99093.33333333333, ans=0.0 2023-06-15 09:33:11,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=99093.33333333333, ans=0.125 2023-06-15 09:33:16,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=99093.33333333333, ans=0.0 2023-06-15 09:33:24,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2023-06-15 09:33:28,927 INFO [train.py:988] (3/4) Epoch 28, batch 500, loss[loss=0.2452, simple_loss=0.3083, pruned_loss=0.09104, over 20739.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3039, pruned_loss=0.07726, over 3482374.61 frames. ], batch size: 211, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:33:42,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=99160.0, ans=0.1 2023-06-15 09:33:45,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=99226.66666666667, ans=0.2 2023-06-15 09:34:46,546 INFO [train.py:988] (3/4) Epoch 29, batch 0, loss[loss=0.2268, simple_loss=0.3083, pruned_loss=0.07262, over 19091.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3083, pruned_loss=0.07262, over 19091.00 frames. ], batch size: 94, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:34:46,547 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 09:34:52,705 INFO [train.py:1020] (3/4) Epoch 29, validation: loss=0.2012, simple_loss=0.3049, pruned_loss=0.04872, over 143649.00 frames. 2023-06-15 09:34:52,706 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 09:35:07,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.808e+02 2.025e+02 2.226e+02 3.535e+02, threshold=4.050e+02, percent-clipped=0.0 2023-06-15 09:35:12,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=99446.66666666667, ans=0.1 2023-06-15 09:36:13,115 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:36:20,604 INFO [train.py:988] (3/4) Epoch 29, batch 50, loss[loss=0.204, simple_loss=0.2883, pruned_loss=0.05983, over 18916.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3021, pruned_loss=0.07543, over 851518.16 frames. ], batch size: 86, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:36:27,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=99713.33333333333, ans=0.125 2023-06-15 09:36:52,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=99780.0, ans=10.0 2023-06-15 09:37:00,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=99846.66666666667, ans=0.125 2023-06-15 09:37:06,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=99846.66666666667, ans=0.125 2023-06-15 09:37:08,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=99846.66666666667, ans=0.125 2023-06-15 09:37:08,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=99846.66666666667, ans=0.125 2023-06-15 09:37:15,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-15 09:37:38,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=99980.0, ans=0.0 2023-06-15 09:37:46,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100046.66666666667, ans=0.1 2023-06-15 09:37:48,024 INFO [train.py:988] (3/4) Epoch 29, batch 100, loss[loss=0.2341, simple_loss=0.299, pruned_loss=0.08465, over 20571.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3008, pruned_loss=0.07497, over 1505208.35 frames. ], batch size: 189, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:37:56,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=22.5 2023-06-15 09:38:03,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.850e+02 2.039e+02 2.478e+02 3.886e+02, threshold=4.079e+02, percent-clipped=0.0 2023-06-15 09:38:25,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=100180.0, ans=0.125 2023-06-15 09:38:50,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100246.66666666667, ans=0.1 2023-06-15 09:39:15,400 INFO [train.py:988] (3/4) Epoch 29, batch 150, loss[loss=0.2319, simple_loss=0.3073, pruned_loss=0.07819, over 20270.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3022, pruned_loss=0.07442, over 2012451.88 frames. ], batch size: 141, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:40:23,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=100646.66666666667, ans=0.125 2023-06-15 09:40:29,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=100646.66666666667, ans=0.125 2023-06-15 09:40:40,938 INFO [train.py:988] (3/4) Epoch 29, batch 200, loss[loss=0.2432, simple_loss=0.306, pruned_loss=0.09015, over 20574.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3036, pruned_loss=0.07409, over 2403694.10 frames. ], batch size: 173, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:40:57,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.792e+02 2.001e+02 2.365e+02 3.519e+02, threshold=4.002e+02, percent-clipped=0.0 2023-06-15 09:41:22,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=100846.66666666667, ans=0.125 2023-06-15 09:41:24,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-06-15 09:41:30,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=100846.66666666667, ans=0.125 2023-06-15 09:42:08,462 INFO [train.py:988] (3/4) Epoch 29, batch 250, loss[loss=0.2283, simple_loss=0.3089, pruned_loss=0.07383, over 18287.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3019, pruned_loss=0.07421, over 2717749.58 frames. ], batch size: 74, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:42:17,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=101046.66666666667, ans=0.125 2023-06-15 09:42:26,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=101113.33333333333, ans=0.125 2023-06-15 09:42:43,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=101180.0, ans=0.125 2023-06-15 09:42:43,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=101180.0, ans=0.125 2023-06-15 09:42:48,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=101180.0, ans=0.2 2023-06-15 09:42:55,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=101180.0, ans=0.125 2023-06-15 09:43:35,041 INFO [train.py:988] (3/4) Epoch 29, batch 300, loss[loss=0.2119, simple_loss=0.295, pruned_loss=0.06435, over 19053.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3019, pruned_loss=0.07418, over 2968447.33 frames. ], batch size: 89, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:43:46,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=101380.0, ans=0.125 2023-06-15 09:43:50,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.806e+02 2.034e+02 2.286e+02 3.270e+02, threshold=4.068e+02, percent-clipped=0.0 2023-06-15 09:44:00,243 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:44:08,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101513.33333333333, ans=0.1 2023-06-15 09:44:13,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=101513.33333333333, ans=0.125 2023-06-15 09:45:02,601 INFO [train.py:988] (3/4) Epoch 29, batch 350, loss[loss=0.2122, simple_loss=0.2844, pruned_loss=0.07004, over 19471.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3018, pruned_loss=0.07406, over 3157234.36 frames. ], batch size: 105, lr: 1.11e-02, grad_scale: 16.0 2023-06-15 09:45:13,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=101713.33333333333, ans=0.1 2023-06-15 09:45:18,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101780.0, ans=0.1 2023-06-15 09:45:28,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=101780.0, ans=0.125 2023-06-15 09:45:29,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.31 vs. limit=22.5 2023-06-15 09:45:29,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-15 09:45:42,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=101846.66666666667, ans=0.0 2023-06-15 09:45:46,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=101846.66666666667, ans=0.05 2023-06-15 09:45:54,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=101913.33333333333, ans=0.125 2023-06-15 09:46:24,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-06-15 09:46:29,150 INFO [train.py:988] (3/4) Epoch 29, batch 400, loss[loss=0.2314, simple_loss=0.3124, pruned_loss=0.07516, over 19054.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3014, pruned_loss=0.07487, over 3297776.34 frames. ], batch size: 89, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:46:34,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=102046.66666666667, ans=0.2 2023-06-15 09:46:45,197 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:46:45,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=102113.33333333333, ans=0.05 2023-06-15 09:46:46,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.969e+02 2.274e+02 2.602e+02 3.577e+02, threshold=4.548e+02, percent-clipped=0.0 2023-06-15 09:46:56,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=102113.33333333333, ans=0.125 2023-06-15 09:46:57,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102113.33333333333, ans=0.1 2023-06-15 09:47:03,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=102180.0, ans=0.0 2023-06-15 09:47:09,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.54 vs. limit=15.0 2023-06-15 09:47:10,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=102180.0, ans=0.07 2023-06-15 09:47:40,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=102313.33333333333, ans=0.125 2023-06-15 09:47:55,546 INFO [train.py:988] (3/4) Epoch 29, batch 450, loss[loss=0.2471, simple_loss=0.3417, pruned_loss=0.07623, over 18349.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3014, pruned_loss=0.07462, over 3415829.30 frames. ], batch size: 72, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:48:18,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=102446.66666666667, ans=0.2 2023-06-15 09:48:36,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=102513.33333333333, ans=0.0 2023-06-15 09:48:41,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2023-06-15 09:48:48,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=102580.0, ans=0.125 2023-06-15 09:48:57,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=102580.0, ans=0.125 2023-06-15 09:49:19,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=102713.33333333333, ans=0.125 2023-06-15 09:49:20,989 INFO [train.py:988] (3/4) Epoch 29, batch 500, loss[loss=0.2253, simple_loss=0.2948, pruned_loss=0.07791, over 20559.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3024, pruned_loss=0.07493, over 3501011.36 frames. ], batch size: 189, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:49:37,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.900e+02 2.167e+02 2.548e+02 3.918e+02, threshold=4.335e+02, percent-clipped=0.0 2023-06-15 09:49:50,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102780.0, ans=0.125 2023-06-15 09:49:50,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=102780.0, ans=0.07 2023-06-15 09:49:54,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=102846.66666666667, ans=0.0 2023-06-15 09:49:57,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=22.5 2023-06-15 09:49:58,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=102846.66666666667, ans=0.0 2023-06-15 09:50:00,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=102846.66666666667, ans=0.0 2023-06-15 09:50:35,337 INFO [train.py:988] (3/4) Epoch 30, batch 0, loss[loss=0.2174, simple_loss=0.3022, pruned_loss=0.0663, over 19083.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.3022, pruned_loss=0.0663, over 19083.00 frames. ], batch size: 89, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:50:35,338 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 09:50:41,562 INFO [train.py:1020] (3/4) Epoch 30, validation: loss=0.2006, simple_loss=0.3036, pruned_loss=0.04881, over 143649.00 frames. 2023-06-15 09:50:41,563 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 09:50:51,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.75 vs. limit=15.0 2023-06-15 09:50:51,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=102926.66666666667, ans=0.125 2023-06-15 09:51:12,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=102993.33333333333, ans=0.0 2023-06-15 09:51:41,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=103126.66666666667, ans=0.0 2023-06-15 09:52:01,714 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:52:07,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103260.0, ans=0.1 2023-06-15 09:52:08,101 INFO [train.py:988] (3/4) Epoch 30, batch 50, loss[loss=0.2099, simple_loss=0.2851, pruned_loss=0.0674, over 20469.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3014, pruned_loss=0.07429, over 862615.34 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:52:46,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=103393.33333333333, ans=0.125 2023-06-15 09:52:49,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=103393.33333333333, ans=0.2 2023-06-15 09:52:55,885 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.895e+02 2.155e+02 2.482e+02 4.117e+02, threshold=4.310e+02, percent-clipped=0.0 2023-06-15 09:52:56,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=103393.33333333333, ans=0.125 2023-06-15 09:53:07,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=103460.0, ans=0.07 2023-06-15 09:53:34,758 INFO [train.py:988] (3/4) Epoch 30, batch 100, loss[loss=0.2168, simple_loss=0.2948, pruned_loss=0.06934, over 19455.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3019, pruned_loss=0.07426, over 1500962.69 frames. ], batch size: 105, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:53:39,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-15 09:53:41,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=103593.33333333333, ans=0.2 2023-06-15 09:53:54,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=103660.0, ans=0.125 2023-06-15 09:54:03,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.19 vs. limit=6.0 2023-06-15 09:54:27,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.28 vs. limit=22.5 2023-06-15 09:54:48,208 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:54:55,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=103860.0, ans=0.07 2023-06-15 09:55:01,994 INFO [train.py:988] (3/4) Epoch 30, batch 150, loss[loss=0.2154, simple_loss=0.3014, pruned_loss=0.06473, over 19306.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3021, pruned_loss=0.07507, over 2007848.77 frames. ], batch size: 98, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:55:27,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.55 vs. limit=22.5 2023-06-15 09:55:40,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=104060.0, ans=0.125 2023-06-15 09:55:51,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.792e+02 2.087e+02 2.377e+02 3.180e+02, threshold=4.174e+02, percent-clipped=0.0 2023-06-15 09:55:54,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=104126.66666666667, ans=0.125 2023-06-15 09:56:07,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.94 vs. limit=10.0 2023-06-15 09:56:26,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=104193.33333333333, ans=0.0 2023-06-15 09:56:29,876 INFO [train.py:988] (3/4) Epoch 30, batch 200, loss[loss=0.2232, simple_loss=0.2954, pruned_loss=0.07546, over 19852.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.302, pruned_loss=0.07456, over 2414428.28 frames. ], batch size: 120, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:56:30,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=104260.0, ans=0.07 2023-06-15 09:56:40,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=104260.0, ans=0.0 2023-06-15 09:56:42,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=12.0 2023-06-15 09:57:17,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=15.0 2023-06-15 09:57:56,080 INFO [train.py:988] (3/4) Epoch 30, batch 250, loss[loss=0.2179, simple_loss=0.2958, pruned_loss=0.06996, over 20316.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3024, pruned_loss=0.07462, over 2716960.95 frames. ], batch size: 141, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:57:56,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=104593.33333333333, ans=0.125 2023-06-15 09:58:33,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=104726.66666666667, ans=0.2 2023-06-15 09:58:45,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.806e+02 1.918e+02 2.139e+02 3.492e+02, threshold=3.836e+02, percent-clipped=0.0 2023-06-15 09:59:11,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=104860.0, ans=0.025 2023-06-15 09:59:17,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-15 09:59:23,423 INFO [train.py:988] (3/4) Epoch 30, batch 300, loss[loss=0.2114, simple_loss=0.2932, pruned_loss=0.06482, over 19695.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3017, pruned_loss=0.0741, over 2958817.70 frames. ], batch size: 110, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:59:31,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.85 vs. limit=10.0 2023-06-15 09:59:36,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104926.66666666667, ans=0.1 2023-06-15 09:59:39,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=104993.33333333333, ans=0.2 2023-06-15 09:59:41,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=104993.33333333333, ans=0.125 2023-06-15 09:59:44,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104993.33333333333, ans=0.1 2023-06-15 10:00:08,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=105060.0, ans=0.125 2023-06-15 10:00:32,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.12 vs. limit=10.0 2023-06-15 10:00:36,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=105193.33333333333, ans=0.0 2023-06-15 10:00:46,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105193.33333333333, ans=0.0 2023-06-15 10:00:49,825 INFO [train.py:988] (3/4) Epoch 30, batch 350, loss[loss=0.2466, simple_loss=0.3334, pruned_loss=0.07993, over 16353.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3005, pruned_loss=0.07412, over 3145842.62 frames. ], batch size: 52, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:01:38,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.770e+02 1.941e+02 2.233e+02 2.810e+02, threshold=3.882e+02, percent-clipped=0.0 2023-06-15 10:01:42,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=105460.0, ans=0.125 2023-06-15 10:01:49,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=105460.0, ans=0.0 2023-06-15 10:01:52,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=105460.0, ans=0.125 2023-06-15 10:02:02,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=105526.66666666667, ans=0.125 2023-06-15 10:02:14,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=15.0 2023-06-15 10:02:15,103 INFO [train.py:988] (3/4) Epoch 30, batch 400, loss[loss=0.1932, simple_loss=0.279, pruned_loss=0.05366, over 18445.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3006, pruned_loss=0.07419, over 3274127.74 frames. ], batch size: 77, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:02:17,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105593.33333333333, ans=0.0 2023-06-15 10:02:36,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-15 10:02:55,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=105726.66666666667, ans=0.125 2023-06-15 10:03:18,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=105793.33333333333, ans=0.2 2023-06-15 10:03:22,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105860.0, ans=0.1 2023-06-15 10:03:40,284 INFO [train.py:988] (3/4) Epoch 30, batch 450, loss[loss=0.2186, simple_loss=0.304, pruned_loss=0.06659, over 18924.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3006, pruned_loss=0.07408, over 3385694.92 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:04:04,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=105993.33333333333, ans=6.0 2023-06-15 10:04:08,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=105993.33333333333, ans=0.0 2023-06-15 10:04:10,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=105993.33333333333, ans=0.125 2023-06-15 10:04:13,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106060.0, ans=0.1 2023-06-15 10:04:26,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=106060.0, ans=0.0 2023-06-15 10:04:28,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.828e+02 2.027e+02 2.323e+02 3.183e+02, threshold=4.054e+02, percent-clipped=0.0 2023-06-15 10:04:51,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106193.33333333333, ans=0.1 2023-06-15 10:05:04,414 INFO [train.py:988] (3/4) Epoch 30, batch 500, loss[loss=0.2241, simple_loss=0.3047, pruned_loss=0.07176, over 18936.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3, pruned_loss=0.07345, over 3470506.50 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:05:17,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-15 10:06:22,329 INFO [train.py:988] (3/4) Epoch 31, batch 0, loss[loss=0.2304, simple_loss=0.3054, pruned_loss=0.07776, over 19075.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3054, pruned_loss=0.07776, over 19075.00 frames. ], batch size: 89, lr: 1.06e-02, grad_scale: 32.0 2023-06-15 10:06:22,330 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 10:06:28,528 INFO [train.py:1020] (3/4) Epoch 31, validation: loss=0.2014, simple_loss=0.3032, pruned_loss=0.0498, over 143649.00 frames. 2023-06-15 10:06:28,529 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 10:07:22,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=106680.0, ans=0.125 2023-06-15 10:07:25,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=106680.0, ans=0.125 2023-06-15 10:07:37,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106746.66666666667, ans=0.0 2023-06-15 10:07:48,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.719e+02 1.932e+02 2.142e+02 3.238e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-15 10:07:53,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106746.66666666667, ans=0.1 2023-06-15 10:07:56,261 INFO [train.py:988] (3/4) Epoch 31, batch 50, loss[loss=0.2051, simple_loss=0.2855, pruned_loss=0.06232, over 10639.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2988, pruned_loss=0.0756, over 839795.09 frames. ], batch size: 30, lr: 1.06e-02, grad_scale: 32.0 2023-06-15 10:08:10,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-15 10:08:33,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.50 vs. limit=15.0 2023-06-15 10:08:50,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=107013.33333333333, ans=0.125 2023-06-15 10:08:50,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107013.33333333333, ans=0.125 2023-06-15 10:09:17,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2023-06-15 10:09:22,806 INFO [train.py:988] (3/4) Epoch 31, batch 100, loss[loss=0.2211, simple_loss=0.2877, pruned_loss=0.07722, over 20119.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2986, pruned_loss=0.07389, over 1513577.49 frames. ], batch size: 239, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:09:52,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=107213.33333333333, ans=0.0 2023-06-15 10:09:55,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=107280.0, ans=0.1 2023-06-15 10:10:06,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107280.0, ans=0.1 2023-06-15 10:10:33,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107413.33333333333, ans=0.1 2023-06-15 10:10:40,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.809e+02 2.076e+02 2.464e+02 3.768e+02, threshold=4.152e+02, percent-clipped=0.0 2023-06-15 10:10:40,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=107413.33333333333, ans=0.0 2023-06-15 10:10:49,500 INFO [train.py:988] (3/4) Epoch 31, batch 150, loss[loss=0.216, simple_loss=0.285, pruned_loss=0.0735, over 20549.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2976, pruned_loss=0.07346, over 2011609.11 frames. ], batch size: 173, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:10:51,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=107480.0, ans=0.0 2023-06-15 10:11:58,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=107746.66666666667, ans=0.2 2023-06-15 10:12:15,593 INFO [train.py:988] (3/4) Epoch 31, batch 200, loss[loss=0.2081, simple_loss=0.2837, pruned_loss=0.06624, over 19988.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2993, pruned_loss=0.07362, over 2385155.96 frames. ], batch size: 126, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:12:22,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=107813.33333333333, ans=0.125 2023-06-15 10:12:40,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107880.0, ans=0.1 2023-06-15 10:13:03,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.51 vs. limit=22.5 2023-06-15 10:13:10,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=108013.33333333333, ans=0.125 2023-06-15 10:13:29,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=15.0 2023-06-15 10:13:33,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.862e+02 2.218e+02 2.544e+02 4.456e+02, threshold=4.436e+02, percent-clipped=1.0 2023-06-15 10:13:41,908 INFO [train.py:988] (3/4) Epoch 31, batch 250, loss[loss=0.2201, simple_loss=0.2937, pruned_loss=0.07324, over 20314.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2998, pruned_loss=0.07306, over 2682700.82 frames. ], batch size: 149, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:13:54,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=108146.66666666667, ans=0.125 2023-06-15 10:13:58,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=108213.33333333333, ans=0.0 2023-06-15 10:14:03,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2023-06-15 10:14:10,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=108213.33333333333, ans=0.2 2023-06-15 10:14:10,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=108213.33333333333, ans=22.5 2023-06-15 10:14:34,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=108346.66666666667, ans=0.125 2023-06-15 10:15:09,064 INFO [train.py:988] (3/4) Epoch 31, batch 300, loss[loss=0.225, simple_loss=0.3112, pruned_loss=0.06939, over 18310.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2989, pruned_loss=0.07344, over 2933779.45 frames. ], batch size: 72, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:15:13,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108480.0, ans=0.1 2023-06-15 10:15:37,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.42 vs. limit=10.0 2023-06-15 10:15:39,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.35 vs. limit=22.5 2023-06-15 10:15:40,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-15 10:16:04,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108680.0, ans=0.1 2023-06-15 10:16:11,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108680.0, ans=0.1 2023-06-15 10:16:20,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=108746.66666666667, ans=0.125 2023-06-15 10:16:26,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.814e+02 2.012e+02 2.294e+02 3.766e+02, threshold=4.023e+02, percent-clipped=0.0 2023-06-15 10:16:34,828 INFO [train.py:988] (3/4) Epoch 31, batch 350, loss[loss=0.2128, simple_loss=0.2973, pruned_loss=0.06415, over 19463.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2991, pruned_loss=0.07317, over 3125213.10 frames. ], batch size: 105, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:16:42,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108813.33333333333, ans=0.125 2023-06-15 10:16:54,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=108880.0, ans=0.125 2023-06-15 10:17:16,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=108946.66666666667, ans=0.125 2023-06-15 10:17:41,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=109013.33333333333, ans=0.07 2023-06-15 10:17:43,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=109080.0, ans=0.125 2023-06-15 10:18:00,833 INFO [train.py:988] (3/4) Epoch 31, batch 400, loss[loss=0.2183, simple_loss=0.2921, pruned_loss=0.07221, over 20110.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2993, pruned_loss=0.07263, over 3270579.59 frames. ], batch size: 133, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:19:12,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=109413.33333333333, ans=0.05 2023-06-15 10:19:19,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=109413.33333333333, ans=0.2 2023-06-15 10:19:20,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.873e+02 2.079e+02 2.420e+02 3.208e+02, threshold=4.158e+02, percent-clipped=0.0 2023-06-15 10:19:21,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=109413.33333333333, ans=0.95 2023-06-15 10:19:27,273 INFO [train.py:988] (3/4) Epoch 31, batch 450, loss[loss=0.2007, simple_loss=0.2849, pruned_loss=0.05826, over 19471.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2998, pruned_loss=0.07258, over 3378458.27 frames. ], batch size: 105, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:19:39,390 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:20:01,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-06-15 10:20:29,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109680.0, ans=0.125 2023-06-15 10:20:44,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=109746.66666666667, ans=10.0 2023-06-15 10:20:51,579 INFO [train.py:988] (3/4) Epoch 31, batch 500, loss[loss=0.2328, simple_loss=0.2991, pruned_loss=0.08323, over 20762.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.299, pruned_loss=0.07284, over 3475362.59 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 32.0 2023-06-15 10:20:53,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=109813.33333333333, ans=0.125 2023-06-15 10:21:00,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.79 vs. limit=15.0 2023-06-15 10:21:01,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=109813.33333333333, ans=0.0 2023-06-15 10:21:01,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-15 10:21:23,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=109946.66666666667, ans=0.0 2023-06-15 10:22:07,562 INFO [train.py:988] (3/4) Epoch 32, batch 0, loss[loss=0.2513, simple_loss=0.327, pruned_loss=0.0878, over 18276.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.327, pruned_loss=0.0878, over 18276.00 frames. ], batch size: 74, lr: 1.03e-02, grad_scale: 32.0 2023-06-15 10:22:07,563 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 10:22:13,531 INFO [train.py:1020] (3/4) Epoch 32, validation: loss=0.1996, simple_loss=0.3022, pruned_loss=0.04853, over 143649.00 frames. 2023-06-15 10:22:13,532 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 10:22:15,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110026.66666666667, ans=0.1 2023-06-15 10:22:17,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=110026.66666666667, ans=0.125 2023-06-15 10:22:32,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110093.33333333333, ans=0.1 2023-06-15 10:22:34,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=110093.33333333333, ans=0.125 2023-06-15 10:22:37,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.793e+02 2.026e+02 2.443e+02 4.216e+02, threshold=4.052e+02, percent-clipped=1.0 2023-06-15 10:22:57,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=110160.0, ans=0.0 2023-06-15 10:23:00,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=110160.0, ans=0.125 2023-06-15 10:23:22,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-15 10:23:32,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=110293.33333333333, ans=0.125 2023-06-15 10:23:35,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=110293.33333333333, ans=0.0 2023-06-15 10:23:40,691 INFO [train.py:988] (3/4) Epoch 32, batch 50, loss[loss=0.2194, simple_loss=0.2884, pruned_loss=0.07521, over 20554.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.299, pruned_loss=0.07265, over 852679.80 frames. ], batch size: 189, lr: 1.03e-02, grad_scale: 32.0 2023-06-15 10:23:51,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.72 vs. limit=12.0 2023-06-15 10:23:58,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=110426.66666666667, ans=10.0 2023-06-15 10:24:10,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110426.66666666667, ans=0.1 2023-06-15 10:24:29,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=22.5 2023-06-15 10:24:30,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=110560.0, ans=0.1 2023-06-15 10:24:32,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=110560.0, ans=0.0 2023-06-15 10:24:39,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=110560.0, ans=0.125 2023-06-15 10:25:08,076 INFO [train.py:988] (3/4) Epoch 32, batch 100, loss[loss=0.2171, simple_loss=0.3, pruned_loss=0.06712, over 19798.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2994, pruned_loss=0.0716, over 1510752.03 frames. ], batch size: 115, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:25:31,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.727e+02 1.862e+02 2.037e+02 3.271e+02, threshold=3.724e+02, percent-clipped=0.0 2023-06-15 10:25:51,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=110826.66666666667, ans=0.125 2023-06-15 10:26:27,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=110960.0, ans=0.09899494936611666 2023-06-15 10:26:34,706 INFO [train.py:988] (3/4) Epoch 32, batch 150, loss[loss=0.229, simple_loss=0.27, pruned_loss=0.09401, over 16884.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2992, pruned_loss=0.07229, over 2021022.34 frames. ], batch size: 392, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:26:38,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=111026.66666666667, ans=0.0 2023-06-15 10:26:39,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=111026.66666666667, ans=0.125 2023-06-15 10:27:08,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111160.0, ans=0.1 2023-06-15 10:27:34,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=12.0 2023-06-15 10:27:42,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.26 vs. limit=22.5 2023-06-15 10:27:59,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=111360.0, ans=0.0 2023-06-15 10:28:00,895 INFO [train.py:988] (3/4) Epoch 32, batch 200, loss[loss=0.2339, simple_loss=0.3275, pruned_loss=0.07015, over 18307.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.299, pruned_loss=0.07217, over 2433674.23 frames. ], batch size: 72, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:28:04,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2023-06-15 10:28:11,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=111360.0, ans=0.0 2023-06-15 10:28:24,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.849e+02 2.097e+02 2.469e+02 3.862e+02, threshold=4.194e+02, percent-clipped=1.0 2023-06-15 10:28:33,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111493.33333333333, ans=0.1 2023-06-15 10:28:34,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=111493.33333333333, ans=0.0 2023-06-15 10:28:39,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111493.33333333333, ans=0.1 2023-06-15 10:29:00,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=111560.0, ans=0.125 2023-06-15 10:29:08,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=111626.66666666667, ans=0.2 2023-06-15 10:29:09,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=111626.66666666667, ans=0.125 2023-06-15 10:29:09,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=111626.66666666667, ans=0.0 2023-06-15 10:29:26,812 INFO [train.py:988] (3/4) Epoch 32, batch 250, loss[loss=0.2133, simple_loss=0.2908, pruned_loss=0.06789, over 19525.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2972, pruned_loss=0.07178, over 2746528.36 frames. ], batch size: 102, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:29:45,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=22.5 2023-06-15 10:30:53,805 INFO [train.py:988] (3/4) Epoch 32, batch 300, loss[loss=0.2458, simple_loss=0.3352, pruned_loss=0.07827, over 16876.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2971, pruned_loss=0.0715, over 2982127.23 frames. ], batch size: 59, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:30:54,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=15.0 2023-06-15 10:30:55,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112026.66666666667, ans=0.1 2023-06-15 10:31:03,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=112026.66666666667, ans=0.125 2023-06-15 10:31:13,375 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:31:18,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.817e+02 2.017e+02 2.252e+02 3.365e+02, threshold=4.033e+02, percent-clipped=0.0 2023-06-15 10:32:16,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112293.33333333333, ans=0.1 2023-06-15 10:32:20,598 INFO [train.py:988] (3/4) Epoch 32, batch 350, loss[loss=0.2297, simple_loss=0.3035, pruned_loss=0.07791, over 20103.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2974, pruned_loss=0.07223, over 3174209.14 frames. ], batch size: 133, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:32:54,226 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:33:11,582 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:33:43,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=112693.33333333333, ans=0.125 2023-06-15 10:33:45,058 INFO [train.py:988] (3/4) Epoch 32, batch 400, loss[loss=0.2132, simple_loss=0.2885, pruned_loss=0.06898, over 19664.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2979, pruned_loss=0.07182, over 3313400.76 frames. ], batch size: 110, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:33:53,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=112693.33333333333, ans=0.07 2023-06-15 10:34:01,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=112760.0, ans=0.0 2023-06-15 10:34:09,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.921e+02 2.168e+02 2.475e+02 4.297e+02, threshold=4.337e+02, percent-clipped=1.0 2023-06-15 10:34:27,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.56 vs. limit=15.0 2023-06-15 10:34:35,419 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:35:04,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=112960.0, ans=0.0 2023-06-15 10:35:11,177 INFO [train.py:988] (3/4) Epoch 32, batch 450, loss[loss=0.2248, simple_loss=0.3026, pruned_loss=0.07352, over 19498.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2977, pruned_loss=0.07188, over 3423898.96 frames. ], batch size: 105, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:35:25,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-06-15 10:36:08,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-06-15 10:36:10,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-15 10:36:16,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=113226.66666666667, ans=0.125 2023-06-15 10:36:24,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=113293.33333333333, ans=0.125 2023-06-15 10:36:36,130 INFO [train.py:988] (3/4) Epoch 32, batch 500, loss[loss=0.2216, simple_loss=0.2601, pruned_loss=0.09153, over 16768.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2974, pruned_loss=0.07169, over 3501753.46 frames. ], batch size: 391, lr: 1.01e-02, grad_scale: 32.0 2023-06-15 10:36:47,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=113360.0, ans=0.0 2023-06-15 10:36:59,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.786e+02 2.074e+02 2.314e+02 3.487e+02, threshold=4.148e+02, percent-clipped=0.0 2023-06-15 10:37:01,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=113426.66666666667, ans=0.125 2023-06-15 10:37:45,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=113573.33333333333, ans=0.125 2023-06-15 10:37:52,786 INFO [train.py:988] (3/4) Epoch 33, batch 0, loss[loss=0.2039, simple_loss=0.2772, pruned_loss=0.06534, over 20670.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2772, pruned_loss=0.06534, over 20670.00 frames. ], batch size: 189, lr: 9.98e-03, grad_scale: 32.0 2023-06-15 10:37:52,787 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 10:37:58,957 INFO [train.py:1020] (3/4) Epoch 33, validation: loss=0.2021, simple_loss=0.3035, pruned_loss=0.05038, over 143649.00 frames. 2023-06-15 10:37:58,957 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 10:38:05,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=113573.33333333333, ans=0.125 2023-06-15 10:38:29,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=113640.0, ans=0.125 2023-06-15 10:38:33,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.89 vs. limit=10.0 2023-06-15 10:38:41,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113706.66666666667, ans=0.1 2023-06-15 10:39:02,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=113773.33333333333, ans=0.0 2023-06-15 10:39:13,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=113840.0, ans=0.0 2023-06-15 10:39:26,787 INFO [train.py:988] (3/4) Epoch 33, batch 50, loss[loss=0.2154, simple_loss=0.276, pruned_loss=0.0774, over 20125.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2957, pruned_loss=0.07315, over 851505.72 frames. ], batch size: 294, lr: 9.96e-03, grad_scale: 32.0 2023-06-15 10:39:27,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=113906.66666666667, ans=0.2 2023-06-15 10:39:28,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=113906.66666666667, ans=0.125 2023-06-15 10:39:30,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113906.66666666667, ans=0.125 2023-06-15 10:39:39,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=113906.66666666667, ans=0.0 2023-06-15 10:39:56,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=113973.33333333333, ans=0.0 2023-06-15 10:39:59,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.01 vs. limit=15.0 2023-06-15 10:40:13,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=114040.0, ans=0.1 2023-06-15 10:40:22,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.812e+02 2.020e+02 2.321e+02 4.264e+02, threshold=4.041e+02, percent-clipped=1.0 2023-06-15 10:40:53,377 INFO [train.py:988] (3/4) Epoch 33, batch 100, loss[loss=0.1908, simple_loss=0.2766, pruned_loss=0.05253, over 19206.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.297, pruned_loss=0.07204, over 1515245.71 frames. ], batch size: 92, lr: 9.95e-03, grad_scale: 32.0 2023-06-15 10:40:59,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.05 vs. limit=10.0 2023-06-15 10:41:05,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=114240.0, ans=0.0 2023-06-15 10:41:06,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=15.0 2023-06-15 10:41:14,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=114306.66666666667, ans=0.125 2023-06-15 10:41:19,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=114306.66666666667, ans=0.125 2023-06-15 10:41:21,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=114306.66666666667, ans=0.1 2023-06-15 10:41:21,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=114306.66666666667, ans=0.125 2023-06-15 10:42:00,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=114440.0, ans=22.5 2023-06-15 10:42:13,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=114506.66666666667, ans=0.125 2023-06-15 10:42:19,602 INFO [train.py:988] (3/4) Epoch 33, batch 150, loss[loss=0.2152, simple_loss=0.2816, pruned_loss=0.07444, over 20322.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2956, pruned_loss=0.07123, over 2034237.12 frames. ], batch size: 149, lr: 9.94e-03, grad_scale: 32.0 2023-06-15 10:42:44,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=114640.0, ans=0.0 2023-06-15 10:43:14,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.854e+02 2.036e+02 2.409e+02 3.930e+02, threshold=4.072e+02, percent-clipped=0.0 2023-06-15 10:43:31,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-15 10:43:45,589 INFO [train.py:988] (3/4) Epoch 33, batch 200, loss[loss=0.2207, simple_loss=0.2974, pruned_loss=0.07203, over 19082.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2962, pruned_loss=0.0703, over 2427557.37 frames. ], batch size: 89, lr: 9.93e-03, grad_scale: 32.0 2023-06-15 10:43:59,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=114906.66666666667, ans=0.09899494936611666 2023-06-15 10:44:06,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=114973.33333333333, ans=0.125 2023-06-15 10:44:12,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=114973.33333333333, ans=0.0 2023-06-15 10:44:46,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=115106.66666666667, ans=0.0 2023-06-15 10:44:48,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-15 10:45:03,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=115173.33333333333, ans=0.0 2023-06-15 10:45:11,584 INFO [train.py:988] (3/4) Epoch 33, batch 250, loss[loss=0.2228, simple_loss=0.3054, pruned_loss=0.07012, over 19060.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2962, pruned_loss=0.07042, over 2730667.43 frames. ], batch size: 89, lr: 9.92e-03, grad_scale: 32.0 2023-06-15 10:45:22,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.73 vs. limit=6.0 2023-06-15 10:45:42,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=115306.66666666667, ans=15.0 2023-06-15 10:45:48,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=115373.33333333333, ans=0.1 2023-06-15 10:45:49,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=115373.33333333333, ans=0.0 2023-06-15 10:46:06,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.758e+02 1.982e+02 2.404e+02 3.997e+02, threshold=3.964e+02, percent-clipped=0.0 2023-06-15 10:46:16,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=115440.0, ans=0.0 2023-06-15 10:46:27,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=115506.66666666667, ans=0.125 2023-06-15 10:46:28,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.24 vs. limit=6.0 2023-06-15 10:46:36,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-06-15 10:46:37,244 INFO [train.py:988] (3/4) Epoch 33, batch 300, loss[loss=0.1979, simple_loss=0.2837, pruned_loss=0.05607, over 19087.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2967, pruned_loss=0.06977, over 2970678.14 frames. ], batch size: 94, lr: 9.90e-03, grad_scale: 32.0 2023-06-15 10:46:42,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=115573.33333333333, ans=0.125 2023-06-15 10:47:16,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115706.66666666667, ans=0.125 2023-06-15 10:47:26,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115773.33333333333, ans=0.1 2023-06-15 10:47:43,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=115840.0, ans=0.0 2023-06-15 10:48:02,187 INFO [train.py:988] (3/4) Epoch 33, batch 350, loss[loss=0.1942, simple_loss=0.2795, pruned_loss=0.05449, over 19343.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2965, pruned_loss=0.06961, over 3142371.95 frames. ], batch size: 98, lr: 9.89e-03, grad_scale: 32.0 2023-06-15 10:48:12,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-06-15 10:48:22,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115973.33333333333, ans=0.125 2023-06-15 10:48:57,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.859e+02 2.087e+02 2.471e+02 4.224e+02, threshold=4.174e+02, percent-clipped=1.0 2023-06-15 10:49:02,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116106.66666666667, ans=0.1 2023-06-15 10:49:28,270 INFO [train.py:988] (3/4) Epoch 33, batch 400, loss[loss=0.2156, simple_loss=0.2991, pruned_loss=0.06601, over 18622.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2963, pruned_loss=0.06995, over 3290450.01 frames. ], batch size: 80, lr: 9.88e-03, grad_scale: 32.0 2023-06-15 10:49:39,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=116240.0, ans=0.0 2023-06-15 10:49:52,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=116306.66666666667, ans=0.125 2023-06-15 10:50:06,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=116373.33333333333, ans=0.125 2023-06-15 10:50:08,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-15 10:50:08,775 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:50:10,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=116373.33333333333, ans=0.035 2023-06-15 10:50:16,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=116373.33333333333, ans=0.125 2023-06-15 10:50:36,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=116506.66666666667, ans=0.2 2023-06-15 10:50:54,729 INFO [train.py:988] (3/4) Epoch 33, batch 450, loss[loss=0.2108, simple_loss=0.3051, pruned_loss=0.05831, over 14960.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2964, pruned_loss=0.06987, over 3407369.50 frames. ], batch size: 42, lr: 9.87e-03, grad_scale: 32.0 2023-06-15 10:51:44,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-06-15 10:51:50,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.786e+02 1.951e+02 2.072e+02 2.874e+02, threshold=3.901e+02, percent-clipped=0.0 2023-06-15 10:51:53,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=116773.33333333333, ans=0.0 2023-06-15 10:52:01,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=116840.0, ans=0.0 2023-06-15 10:52:17,728 INFO [train.py:988] (3/4) Epoch 33, batch 500, loss[loss=0.247, simple_loss=0.3421, pruned_loss=0.07597, over 15479.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2968, pruned_loss=0.07049, over 3479382.36 frames. ], batch size: 44, lr: 9.86e-03, grad_scale: 32.0 2023-06-15 10:52:22,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=116906.66666666667, ans=0.2 2023-06-15 10:53:34,843 INFO [train.py:988] (3/4) Epoch 34, batch 0, loss[loss=0.202, simple_loss=0.2821, pruned_loss=0.06099, over 20353.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2821, pruned_loss=0.06099, over 20353.00 frames. ], batch size: 149, lr: 9.70e-03, grad_scale: 32.0 2023-06-15 10:53:34,844 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 10:53:41,137 INFO [train.py:1020] (3/4) Epoch 34, validation: loss=0.2011, simple_loss=0.3024, pruned_loss=0.04991, over 143649.00 frames. 2023-06-15 10:53:41,138 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 10:54:09,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=117193.33333333333, ans=0.04949747468305833 2023-06-15 10:54:13,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=117193.33333333333, ans=0.0 2023-06-15 10:54:40,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=117326.66666666667, ans=0.1 2023-06-15 10:55:10,949 INFO [train.py:988] (3/4) Epoch 34, batch 50, loss[loss=0.2125, simple_loss=0.2782, pruned_loss=0.07347, over 20190.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.294, pruned_loss=0.06838, over 855638.50 frames. ], batch size: 239, lr: 9.69e-03, grad_scale: 16.0 2023-06-15 10:55:12,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.849e+02 2.100e+02 2.383e+02 3.120e+02, threshold=4.200e+02, percent-clipped=0.0 2023-06-15 10:55:36,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=117526.66666666667, ans=0.125 2023-06-15 10:55:36,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=117526.66666666667, ans=0.0 2023-06-15 10:55:56,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=117593.33333333333, ans=0.2 2023-06-15 10:56:14,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=117660.0, ans=0.07 2023-06-15 10:56:17,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-15 10:56:40,475 INFO [train.py:988] (3/4) Epoch 34, batch 100, loss[loss=0.2205, simple_loss=0.293, pruned_loss=0.074, over 20075.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2938, pruned_loss=0.0679, over 1514603.49 frames. ], batch size: 133, lr: 9.68e-03, grad_scale: 16.0 2023-06-15 10:57:04,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=117860.0, ans=0.125 2023-06-15 10:57:53,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=12.0 2023-06-15 10:57:56,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.93 vs. limit=15.0 2023-06-15 10:57:57,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=118060.0, ans=0.125 2023-06-15 10:57:59,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118060.0, ans=0.1 2023-06-15 10:58:09,683 INFO [train.py:988] (3/4) Epoch 34, batch 150, loss[loss=0.2006, simple_loss=0.2851, pruned_loss=0.05809, over 19464.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2941, pruned_loss=0.06822, over 2033200.52 frames. ], batch size: 105, lr: 9.67e-03, grad_scale: 16.0 2023-06-15 10:58:11,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.765e+02 1.971e+02 2.282e+02 3.738e+02, threshold=3.942e+02, percent-clipped=0.0 2023-06-15 10:58:33,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=118193.33333333333, ans=0.125 2023-06-15 10:59:09,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=12.0 2023-06-15 10:59:16,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=118326.66666666667, ans=0.2 2023-06-15 10:59:30,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-15 10:59:37,547 INFO [train.py:988] (3/4) Epoch 34, batch 200, loss[loss=0.207, simple_loss=0.2917, pruned_loss=0.06116, over 19843.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2945, pruned_loss=0.0688, over 2425220.18 frames. ], batch size: 115, lr: 9.65e-03, grad_scale: 16.0 2023-06-15 10:59:39,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=118460.0, ans=0.125 2023-06-15 11:00:41,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=118660.0, ans=0.125 2023-06-15 11:01:07,019 INFO [train.py:988] (3/4) Epoch 34, batch 250, loss[loss=0.2249, simple_loss=0.3003, pruned_loss=0.07475, over 19957.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2941, pruned_loss=0.06904, over 2738502.76 frames. ], batch size: 126, lr: 9.64e-03, grad_scale: 16.0 2023-06-15 11:01:09,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.911e+02 2.159e+02 2.494e+02 3.715e+02, threshold=4.319e+02, percent-clipped=0.0 2023-06-15 11:01:09,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=118793.33333333333, ans=0.125 2023-06-15 11:01:13,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=22.5 2023-06-15 11:01:25,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=118860.0, ans=0.125 2023-06-15 11:01:36,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118860.0, ans=0.0 2023-06-15 11:01:36,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118860.0, ans=0.1 2023-06-15 11:01:42,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=118926.66666666667, ans=0.125 2023-06-15 11:01:49,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=118926.66666666667, ans=0.2 2023-06-15 11:02:19,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=119060.0, ans=0.2 2023-06-15 11:02:34,407 INFO [train.py:988] (3/4) Epoch 34, batch 300, loss[loss=0.1919, simple_loss=0.2784, pruned_loss=0.05268, over 19213.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2951, pruned_loss=0.06893, over 2970599.08 frames. ], batch size: 92, lr: 9.63e-03, grad_scale: 16.0 2023-06-15 11:03:20,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=119260.0, ans=0.1 2023-06-15 11:03:35,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-15 11:03:55,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=119393.33333333333, ans=0.2 2023-06-15 11:04:03,799 INFO [train.py:988] (3/4) Epoch 34, batch 350, loss[loss=0.2332, simple_loss=0.2915, pruned_loss=0.08745, over 19849.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2957, pruned_loss=0.06972, over 3149433.43 frames. ], batch size: 293, lr: 9.62e-03, grad_scale: 16.0 2023-06-15 11:04:05,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.947e+02 2.157e+02 2.600e+02 3.674e+02, threshold=4.313e+02, percent-clipped=0.0 2023-06-15 11:04:05,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=119460.0, ans=0.04949747468305833 2023-06-15 11:04:10,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.95 vs. limit=22.5 2023-06-15 11:04:32,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-06-15 11:04:44,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=119593.33333333333, ans=0.125 2023-06-15 11:04:56,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=119660.0, ans=0.0 2023-06-15 11:04:58,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119660.0, ans=0.1 2023-06-15 11:05:07,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=119660.0, ans=0.0 2023-06-15 11:05:34,489 INFO [train.py:988] (3/4) Epoch 34, batch 400, loss[loss=0.1988, simple_loss=0.2829, pruned_loss=0.05737, over 19318.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2952, pruned_loss=0.07021, over 3287402.07 frames. ], batch size: 98, lr: 9.61e-03, grad_scale: 32.0 2023-06-15 11:05:56,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.24 vs. limit=15.0 2023-06-15 11:05:57,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=119860.0, ans=0.2 2023-06-15 11:05:59,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=119860.0, ans=0.125 2023-06-15 11:06:30,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=119993.33333333333, ans=0.125 2023-06-15 11:06:33,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=119993.33333333333, ans=0.125 2023-06-15 11:06:44,858 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:06:52,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=120060.0, ans=0.0 2023-06-15 11:07:03,746 INFO [train.py:988] (3/4) Epoch 34, batch 450, loss[loss=0.2113, simple_loss=0.2869, pruned_loss=0.0679, over 18776.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2951, pruned_loss=0.06938, over 3399979.64 frames. ], batch size: 83, lr: 9.60e-03, grad_scale: 32.0 2023-06-15 11:07:05,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.842e+02 2.161e+02 2.491e+02 3.686e+02, threshold=4.322e+02, percent-clipped=0.0 2023-06-15 11:07:12,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120126.66666666667, ans=0.1 2023-06-15 11:07:47,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120260.0, ans=0.1 2023-06-15 11:08:03,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=120326.66666666667, ans=10.0 2023-06-15 11:08:29,406 INFO [train.py:988] (3/4) Epoch 34, batch 500, loss[loss=0.218, simple_loss=0.3017, pruned_loss=0.06717, over 18280.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2949, pruned_loss=0.06968, over 3469298.97 frames. ], batch size: 74, lr: 9.59e-03, grad_scale: 32.0 2023-06-15 11:08:33,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=120460.0, ans=0.0 2023-06-15 11:08:38,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=120460.0, ans=0.125 2023-06-15 11:08:56,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=120526.66666666667, ans=0.2 2023-06-15 11:09:43,663 INFO [train.py:988] (3/4) Epoch 35, batch 0, loss[loss=0.2486, simple_loss=0.2806, pruned_loss=0.1082, over 16811.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.2806, pruned_loss=0.1082, over 16811.00 frames. ], batch size: 391, lr: 9.44e-03, grad_scale: 32.0 2023-06-15 11:09:43,663 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 11:09:49,769 INFO [train.py:1020] (3/4) Epoch 35, validation: loss=0.2016, simple_loss=0.3016, pruned_loss=0.05077, over 143649.00 frames. 2023-06-15 11:09:49,770 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 11:10:21,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.805e+02 2.012e+02 2.315e+02 3.975e+02, threshold=4.025e+02, percent-clipped=0.0 2023-06-15 11:10:31,054 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:10:32,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120813.33333333333, ans=0.1 2023-06-15 11:10:43,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=120880.0, ans=0.1 2023-06-15 11:10:49,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=120880.0, ans=0.125 2023-06-15 11:10:56,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.83 vs. limit=15.0 2023-06-15 11:11:01,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.88 vs. limit=15.0 2023-06-15 11:11:19,002 INFO [train.py:988] (3/4) Epoch 35, batch 50, loss[loss=0.2134, simple_loss=0.2995, pruned_loss=0.0637, over 18280.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2926, pruned_loss=0.06701, over 850405.24 frames. ], batch size: 74, lr: 9.43e-03, grad_scale: 32.0 2023-06-15 11:11:43,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121080.0, ans=0.125 2023-06-15 11:11:50,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=121080.0, ans=0.1 2023-06-15 11:11:52,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=121146.66666666667, ans=0.0 2023-06-15 11:12:47,382 INFO [train.py:988] (3/4) Epoch 35, batch 100, loss[loss=0.2394, simple_loss=0.2779, pruned_loss=0.1004, over 17077.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2925, pruned_loss=0.06889, over 1508706.27 frames. ], batch size: 392, lr: 9.42e-03, grad_scale: 32.0 2023-06-15 11:13:05,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=121413.33333333333, ans=0.2 2023-06-15 11:13:18,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.883e+02 2.096e+02 2.428e+02 4.337e+02, threshold=4.193e+02, percent-clipped=1.0 2023-06-15 11:13:27,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.75 vs. limit=22.5 2023-06-15 11:13:39,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=121546.66666666667, ans=0.125 2023-06-15 11:14:15,098 INFO [train.py:988] (3/4) Epoch 35, batch 150, loss[loss=0.211, simple_loss=0.2959, pruned_loss=0.06302, over 19483.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2938, pruned_loss=0.06807, over 2003504.47 frames. ], batch size: 105, lr: 9.41e-03, grad_scale: 32.0 2023-06-15 11:14:28,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-15 11:14:43,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121746.66666666667, ans=0.1 2023-06-15 11:14:58,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=121813.33333333333, ans=0.125 2023-06-15 11:14:58,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121813.33333333333, ans=0.1 2023-06-15 11:15:03,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=121813.33333333333, ans=0.125 2023-06-15 11:15:15,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121880.0, ans=0.1 2023-06-15 11:15:15,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=121880.0, ans=0.125 2023-06-15 11:15:15,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=121880.0, ans=0.125 2023-06-15 11:15:26,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.28 vs. limit=15.0 2023-06-15 11:15:43,556 INFO [train.py:988] (3/4) Epoch 35, batch 200, loss[loss=0.2145, simple_loss=0.3045, pruned_loss=0.06226, over 18625.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2934, pruned_loss=0.06818, over 2398119.55 frames. ], batch size: 80, lr: 9.40e-03, grad_scale: 32.0 2023-06-15 11:15:46,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2023-06-15 11:16:14,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.845e+02 2.048e+02 2.405e+02 3.914e+02, threshold=4.095e+02, percent-clipped=0.0 2023-06-15 11:17:07,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.25 vs. limit=15.0 2023-06-15 11:17:09,618 INFO [train.py:988] (3/4) Epoch 35, batch 250, loss[loss=0.2015, simple_loss=0.2873, pruned_loss=0.05784, over 19817.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2936, pruned_loss=0.06801, over 2704384.92 frames. ], batch size: 115, lr: 9.38e-03, grad_scale: 32.0 2023-06-15 11:17:12,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2023-06-15 11:17:22,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=122346.66666666667, ans=0.0 2023-06-15 11:17:31,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122413.33333333333, ans=0.125 2023-06-15 11:17:43,107 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:18:00,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=122546.66666666667, ans=0.2 2023-06-15 11:18:36,389 INFO [train.py:988] (3/4) Epoch 35, batch 300, loss[loss=0.223, simple_loss=0.3083, pruned_loss=0.06884, over 18305.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2941, pruned_loss=0.06824, over 2940036.61 frames. ], batch size: 74, lr: 9.37e-03, grad_scale: 32.0 2023-06-15 11:19:06,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.736e+02 1.889e+02 2.139e+02 2.972e+02, threshold=3.778e+02, percent-clipped=0.0 2023-06-15 11:19:45,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=122946.66666666667, ans=0.2 2023-06-15 11:19:48,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122946.66666666667, ans=0.125 2023-06-15 11:20:01,988 INFO [train.py:988] (3/4) Epoch 35, batch 350, loss[loss=0.2169, simple_loss=0.3076, pruned_loss=0.06309, over 17626.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2945, pruned_loss=0.06824, over 3133284.90 frames. ], batch size: 67, lr: 9.36e-03, grad_scale: 32.0 2023-06-15 11:20:59,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=123213.33333333333, ans=0.2 2023-06-15 11:21:13,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-15 11:21:28,659 INFO [train.py:988] (3/4) Epoch 35, batch 400, loss[loss=0.2143, simple_loss=0.2904, pruned_loss=0.06908, over 20492.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2946, pruned_loss=0.06826, over 3277989.85 frames. ], batch size: 160, lr: 9.35e-03, grad_scale: 32.0 2023-06-15 11:21:36,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=123346.66666666667, ans=0.125 2023-06-15 11:21:46,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2023-06-15 11:21:49,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-15 11:21:50,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=123413.33333333333, ans=0.2 2023-06-15 11:21:56,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=123413.33333333333, ans=0.125 2023-06-15 11:21:58,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=123413.33333333333, ans=0.0 2023-06-15 11:21:59,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.852e+02 2.101e+02 2.531e+02 3.269e+02, threshold=4.203e+02, percent-clipped=0.0 2023-06-15 11:22:29,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.36 vs. limit=10.0 2023-06-15 11:22:54,577 INFO [train.py:988] (3/4) Epoch 35, batch 450, loss[loss=0.2335, simple_loss=0.3084, pruned_loss=0.07928, over 20099.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2938, pruned_loss=0.06835, over 3395569.27 frames. ], batch size: 133, lr: 9.34e-03, grad_scale: 32.0 2023-06-15 11:22:58,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=123680.0, ans=0.125 2023-06-15 11:22:58,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.24 vs. limit=22.5 2023-06-15 11:23:59,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.24 vs. limit=6.0 2023-06-15 11:24:15,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=123946.66666666667, ans=0.0 2023-06-15 11:24:15,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=123946.66666666667, ans=0.0 2023-06-15 11:24:18,940 INFO [train.py:988] (3/4) Epoch 35, batch 500, loss[loss=0.2159, simple_loss=0.2961, pruned_loss=0.0679, over 19528.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2938, pruned_loss=0.06739, over 3479309.35 frames. ], batch size: 102, lr: 9.33e-03, grad_scale: 32.0 2023-06-15 11:24:27,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=124013.33333333333, ans=0.95 2023-06-15 11:24:41,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124080.0, ans=0.1 2023-06-15 11:24:47,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.833e+02 2.002e+02 2.181e+02 2.864e+02, threshold=4.004e+02, percent-clipped=0.0 2023-06-15 11:25:33,795 INFO [train.py:988] (3/4) Epoch 36, batch 0, loss[loss=0.2033, simple_loss=0.2756, pruned_loss=0.06551, over 20554.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2756, pruned_loss=0.06551, over 20554.00 frames. ], batch size: 189, lr: 9.19e-03, grad_scale: 32.0 2023-06-15 11:25:33,795 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 11:25:39,894 INFO [train.py:1020] (3/4) Epoch 36, validation: loss=0.2014, simple_loss=0.3017, pruned_loss=0.05055, over 143649.00 frames. 2023-06-15 11:25:39,895 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 11:25:47,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=124226.66666666667, ans=0.0 2023-06-15 11:26:01,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=124293.33333333333, ans=0.07 2023-06-15 11:26:08,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=124293.33333333333, ans=0.125 2023-06-15 11:26:17,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.66 vs. limit=10.0 2023-06-15 11:26:27,404 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:26:40,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2023-06-15 11:27:05,067 INFO [train.py:988] (3/4) Epoch 36, batch 50, loss[loss=0.2114, simple_loss=0.2984, pruned_loss=0.06218, over 18265.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2951, pruned_loss=0.06929, over 842308.60 frames. ], batch size: 74, lr: 9.18e-03, grad_scale: 32.0 2023-06-15 11:27:19,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=124560.0, ans=0.125 2023-06-15 11:27:29,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=124626.66666666667, ans=0.125 2023-06-15 11:28:07,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.827e+02 2.009e+02 2.333e+02 3.474e+02, threshold=4.018e+02, percent-clipped=0.0 2023-06-15 11:28:15,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=124826.66666666667, ans=0.5 2023-06-15 11:28:23,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=124826.66666666667, ans=0.0 2023-06-15 11:28:31,573 INFO [train.py:988] (3/4) Epoch 36, batch 100, loss[loss=0.2217, simple_loss=0.3078, pruned_loss=0.06775, over 16487.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2935, pruned_loss=0.06845, over 1503959.73 frames. ], batch size: 52, lr: 9.17e-03, grad_scale: 32.0 2023-06-15 11:28:51,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-06-15 11:29:02,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=124960.0, ans=0.0 2023-06-15 11:29:03,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124960.0, ans=0.1 2023-06-15 11:29:09,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.66 vs. limit=15.0 2023-06-15 11:29:41,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2023-06-15 11:29:46,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=125160.0, ans=0.125 2023-06-15 11:29:55,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=125160.0, ans=0.0 2023-06-15 11:29:58,777 INFO [train.py:988] (3/4) Epoch 36, batch 150, loss[loss=0.2024, simple_loss=0.2886, pruned_loss=0.0581, over 19349.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2935, pruned_loss=0.06838, over 1993276.74 frames. ], batch size: 98, lr: 9.16e-03, grad_scale: 16.0 2023-06-15 11:30:20,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125293.33333333333, ans=0.0 2023-06-15 11:30:42,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=125360.0, ans=0.125 2023-06-15 11:30:42,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=125360.0, ans=0.0 2023-06-15 11:30:59,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125426.66666666667, ans=0.1 2023-06-15 11:31:02,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.954e+02 2.200e+02 2.726e+02 5.615e+02, threshold=4.401e+02, percent-clipped=3.0 2023-06-15 11:31:25,657 INFO [train.py:988] (3/4) Epoch 36, batch 200, loss[loss=0.1987, simple_loss=0.2833, pruned_loss=0.05709, over 18907.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2928, pruned_loss=0.06818, over 2400300.50 frames. ], batch size: 86, lr: 9.15e-03, grad_scale: 16.0 2023-06-15 11:31:26,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=125560.0, ans=22.5 2023-06-15 11:31:39,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=125560.0, ans=0.125 2023-06-15 11:31:41,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-06-15 11:31:42,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=125626.66666666667, ans=0.125 2023-06-15 11:32:36,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125826.66666666667, ans=0.125 2023-06-15 11:32:52,603 INFO [train.py:988] (3/4) Epoch 36, batch 250, loss[loss=0.2126, simple_loss=0.2923, pruned_loss=0.06642, over 20446.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2935, pruned_loss=0.06808, over 2715563.37 frames. ], batch size: 160, lr: 9.14e-03, grad_scale: 16.0 2023-06-15 11:32:59,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=125893.33333333333, ans=0.125 2023-06-15 11:33:26,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126026.66666666667, ans=0.125 2023-06-15 11:33:27,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=126026.66666666667, ans=0.0 2023-06-15 11:33:56,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.785e+02 1.964e+02 2.188e+02 3.452e+02, threshold=3.927e+02, percent-clipped=0.0 2023-06-15 11:33:58,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=126093.33333333333, ans=0.125 2023-06-15 11:34:07,166 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:34:18,428 INFO [train.py:988] (3/4) Epoch 36, batch 300, loss[loss=0.2194, simple_loss=0.3101, pruned_loss=0.06431, over 16475.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2924, pruned_loss=0.06726, over 2944551.00 frames. ], batch size: 52, lr: 9.13e-03, grad_scale: 16.0 2023-06-15 11:34:49,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=126293.33333333333, ans=0.2 2023-06-15 11:35:20,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=126426.66666666667, ans=0.0 2023-06-15 11:35:22,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=126426.66666666667, ans=0.125 2023-06-15 11:35:45,211 INFO [train.py:988] (3/4) Epoch 36, batch 350, loss[loss=0.2147, simple_loss=0.2963, pruned_loss=0.06651, over 19814.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2922, pruned_loss=0.06728, over 3118709.29 frames. ], batch size: 115, lr: 9.12e-03, grad_scale: 16.0 2023-06-15 11:35:52,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=126560.0, ans=0.125 2023-06-15 11:36:09,730 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:36:31,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=126693.33333333333, ans=0.125 2023-06-15 11:36:50,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.822e+02 2.085e+02 2.267e+02 3.723e+02, threshold=4.169e+02, percent-clipped=0.0 2023-06-15 11:36:55,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=126826.66666666667, ans=0.125 2023-06-15 11:37:04,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=126826.66666666667, ans=0.0 2023-06-15 11:37:13,988 INFO [train.py:988] (3/4) Epoch 36, batch 400, loss[loss=0.2326, simple_loss=0.2706, pruned_loss=0.09731, over 17203.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2925, pruned_loss=0.06768, over 3258895.60 frames. ], batch size: 391, lr: 9.11e-03, grad_scale: 32.0 2023-06-15 11:37:24,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=126893.33333333333, ans=0.0 2023-06-15 11:37:44,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=126960.0, ans=0.2 2023-06-15 11:38:07,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127093.33333333333, ans=0.1 2023-06-15 11:38:12,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=127093.33333333333, ans=0.09899494936611666 2023-06-15 11:38:21,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2023-06-15 11:38:40,670 INFO [train.py:988] (3/4) Epoch 36, batch 450, loss[loss=0.1995, simple_loss=0.2494, pruned_loss=0.07483, over 16707.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2923, pruned_loss=0.06737, over 3353364.51 frames. ], batch size: 391, lr: 9.10e-03, grad_scale: 32.0 2023-06-15 11:38:57,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-06-15 11:39:21,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=127360.0, ans=0.95 2023-06-15 11:39:34,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=127426.66666666667, ans=0.0 2023-06-15 11:39:37,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=127426.66666666667, ans=0.125 2023-06-15 11:39:43,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.789e+02 1.972e+02 2.291e+02 3.379e+02, threshold=3.945e+02, percent-clipped=0.0 2023-06-15 11:40:05,543 INFO [train.py:988] (3/4) Epoch 36, batch 500, loss[loss=0.2095, simple_loss=0.2914, pruned_loss=0.06382, over 19536.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2922, pruned_loss=0.06744, over 3435669.06 frames. ], batch size: 102, lr: 9.09e-03, grad_scale: 32.0 2023-06-15 11:40:24,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127626.66666666667, ans=0.1 2023-06-15 11:41:22,923 INFO [train.py:988] (3/4) Epoch 37, batch 0, loss[loss=0.2287, simple_loss=0.3215, pruned_loss=0.06798, over 18320.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3215, pruned_loss=0.06798, over 18320.00 frames. ], batch size: 72, lr: 8.96e-03, grad_scale: 32.0 2023-06-15 11:41:22,923 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 11:41:29,080 INFO [train.py:1020] (3/4) Epoch 37, validation: loss=0.2017, simple_loss=0.3019, pruned_loss=0.05073, over 143649.00 frames. 2023-06-15 11:41:29,081 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 11:41:32,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127780.0, ans=0.125 2023-06-15 11:41:50,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=127846.66666666667, ans=0.125 2023-06-15 11:41:55,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-15 11:41:59,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-15 11:42:01,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.79 vs. limit=15.0 2023-06-15 11:42:13,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=127913.33333333333, ans=0.2 2023-06-15 11:42:22,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127980.0, ans=0.1 2023-06-15 11:42:50,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=128046.66666666667, ans=0.0 2023-06-15 11:42:56,856 INFO [train.py:988] (3/4) Epoch 37, batch 50, loss[loss=0.2456, simple_loss=0.3248, pruned_loss=0.08315, over 16280.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2937, pruned_loss=0.06787, over 855877.77 frames. ], batch size: 52, lr: 8.95e-03, grad_scale: 32.0 2023-06-15 11:43:01,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2023-06-15 11:43:04,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.676e+02 1.887e+02 2.171e+02 3.433e+02, threshold=3.773e+02, percent-clipped=0.0 2023-06-15 11:43:12,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128180.0, ans=0.125 2023-06-15 11:43:20,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=128180.0, ans=0.0 2023-06-15 11:43:20,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=128180.0, ans=0.0 2023-06-15 11:43:21,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=128180.0, ans=0.125 2023-06-15 11:43:25,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128180.0, ans=0.125 2023-06-15 11:43:32,424 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:43:34,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=128246.66666666667, ans=0.5 2023-06-15 11:44:17,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=128380.0, ans=0.0 2023-06-15 11:44:17,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128380.0, ans=0.125 2023-06-15 11:44:23,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-06-15 11:44:24,268 INFO [train.py:988] (3/4) Epoch 37, batch 100, loss[loss=0.2124, simple_loss=0.2943, pruned_loss=0.06527, over 18775.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2922, pruned_loss=0.06712, over 1498037.00 frames. ], batch size: 83, lr: 8.94e-03, grad_scale: 32.0 2023-06-15 11:44:33,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=128446.66666666667, ans=0.2 2023-06-15 11:45:14,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=128580.0, ans=0.0 2023-06-15 11:45:39,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-15 11:45:52,392 INFO [train.py:988] (3/4) Epoch 37, batch 150, loss[loss=0.2042, simple_loss=0.2914, pruned_loss=0.05847, over 19227.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2909, pruned_loss=0.06616, over 2012783.42 frames. ], batch size: 92, lr: 8.93e-03, grad_scale: 32.0 2023-06-15 11:45:52,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=128780.0, ans=0.125 2023-06-15 11:45:54,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-06-15 11:45:58,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128780.0, ans=0.1 2023-06-15 11:45:59,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.837e+02 2.114e+02 2.317e+02 3.549e+02, threshold=4.229e+02, percent-clipped=0.0 2023-06-15 11:46:30,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128913.33333333333, ans=0.1 2023-06-15 11:46:42,880 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:46:51,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=128980.0, ans=0.0 2023-06-15 11:47:07,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=129046.66666666667, ans=0.125 2023-06-15 11:47:07,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=129046.66666666667, ans=0.5 2023-06-15 11:47:20,620 INFO [train.py:988] (3/4) Epoch 37, batch 200, loss[loss=0.2125, simple_loss=0.2997, pruned_loss=0.06264, over 18801.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2917, pruned_loss=0.06717, over 2398759.35 frames. ], batch size: 83, lr: 8.92e-03, grad_scale: 32.0 2023-06-15 11:47:32,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=129113.33333333333, ans=0.2 2023-06-15 11:48:19,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=129313.33333333333, ans=0.2 2023-06-15 11:48:33,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129380.0, ans=0.0 2023-06-15 11:48:48,315 INFO [train.py:988] (3/4) Epoch 37, batch 250, loss[loss=0.2097, simple_loss=0.2997, pruned_loss=0.05983, over 18443.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2912, pruned_loss=0.06647, over 2717886.40 frames. ], batch size: 77, lr: 8.91e-03, grad_scale: 32.0 2023-06-15 11:48:54,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.981e+02 2.345e+02 2.837e+02 3.921e+02, threshold=4.691e+02, percent-clipped=0.0 2023-06-15 11:49:10,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=129513.33333333333, ans=0.0 2023-06-15 11:49:10,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=129513.33333333333, ans=0.05 2023-06-15 11:49:21,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-15 11:49:51,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=129646.66666666667, ans=0.0 2023-06-15 11:49:56,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=129646.66666666667, ans=0.0 2023-06-15 11:50:02,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=129713.33333333333, ans=0.0 2023-06-15 11:50:16,948 INFO [train.py:988] (3/4) Epoch 37, batch 300, loss[loss=0.2056, simple_loss=0.2745, pruned_loss=0.06838, over 20277.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2919, pruned_loss=0.06678, over 2950600.66 frames. ], batch size: 239, lr: 8.90e-03, grad_scale: 32.0 2023-06-15 11:50:39,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-15 11:51:05,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=129913.33333333333, ans=0.125 2023-06-15 11:51:11,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=129980.0, ans=0.0 2023-06-15 11:51:13,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=22.5 2023-06-15 11:51:27,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130046.66666666667, ans=0.1 2023-06-15 11:51:31,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=12.0 2023-06-15 11:51:45,589 INFO [train.py:988] (3/4) Epoch 37, batch 350, loss[loss=0.2143, simple_loss=0.2913, pruned_loss=0.06859, over 20275.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2914, pruned_loss=0.06678, over 3133804.01 frames. ], batch size: 149, lr: 8.89e-03, grad_scale: 32.0 2023-06-15 11:51:47,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=130113.33333333333, ans=0.125 2023-06-15 11:51:51,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=130113.33333333333, ans=0.125 2023-06-15 11:51:52,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.921e+02 2.094e+02 2.447e+02 3.479e+02, threshold=4.189e+02, percent-clipped=0.0 2023-06-15 11:51:59,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130113.33333333333, ans=0.1 2023-06-15 11:52:06,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2023-06-15 11:52:11,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=130180.0, ans=0.125 2023-06-15 11:52:11,422 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:52:28,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2023-06-15 11:53:00,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=130380.0, ans=0.2 2023-06-15 11:53:06,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=130380.0, ans=0.2 2023-06-15 11:53:13,093 INFO [train.py:988] (3/4) Epoch 37, batch 400, loss[loss=0.2191, simple_loss=0.2872, pruned_loss=0.07551, over 20226.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2911, pruned_loss=0.06668, over 3298342.71 frames. ], batch size: 239, lr: 8.88e-03, grad_scale: 32.0 2023-06-15 11:53:25,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=130446.66666666667, ans=0.0 2023-06-15 11:53:34,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=130513.33333333333, ans=0.125 2023-06-15 11:53:44,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.39 vs. limit=15.0 2023-06-15 11:53:49,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=130580.0, ans=0.2 2023-06-15 11:54:42,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=130780.0, ans=0.0 2023-06-15 11:54:43,358 INFO [train.py:988] (3/4) Epoch 37, batch 450, loss[loss=0.2067, simple_loss=0.2911, pruned_loss=0.06119, over 19110.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2909, pruned_loss=0.06624, over 3406549.12 frames. ], batch size: 94, lr: 8.87e-03, grad_scale: 16.0 2023-06-15 11:54:51,582 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.773e+02 2.041e+02 2.312e+02 3.124e+02, threshold=4.082e+02, percent-clipped=0.0 2023-06-15 11:54:52,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130780.0, ans=0.1 2023-06-15 11:54:53,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=130780.0, ans=0.125 2023-06-15 11:55:13,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=130846.66666666667, ans=0.2 2023-06-15 11:55:29,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=130913.33333333333, ans=0.125 2023-06-15 11:55:48,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=130980.0, ans=0.125 2023-06-15 11:56:09,222 INFO [train.py:988] (3/4) Epoch 37, batch 500, loss[loss=0.2078, simple_loss=0.2978, pruned_loss=0.05893, over 19093.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2912, pruned_loss=0.06632, over 3492850.31 frames. ], batch size: 94, lr: 8.86e-03, grad_scale: 16.0 2023-06-15 11:56:17,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=131113.33333333334, ans=0.1 2023-06-15 11:56:37,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=131180.0, ans=0.0 2023-06-15 11:56:46,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=131246.66666666666, ans=0.125 2023-06-15 11:56:46,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=131246.66666666666, ans=0.0 2023-06-15 11:57:24,717 INFO [train.py:988] (3/4) Epoch 38, batch 0, loss[loss=0.2178, simple_loss=0.3068, pruned_loss=0.06438, over 14816.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3068, pruned_loss=0.06438, over 14816.00 frames. ], batch size: 42, lr: 8.73e-03, grad_scale: 32.0 2023-06-15 11:57:24,717 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 11:57:31,186 INFO [train.py:1020] (3/4) Epoch 38, validation: loss=0.2046, simple_loss=0.3024, pruned_loss=0.05337, over 143649.00 frames. 2023-06-15 11:57:31,187 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 11:57:46,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=131326.66666666666, ans=0.0 2023-06-15 11:57:49,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=131393.33333333334, ans=0.07 2023-06-15 11:58:07,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=131460.0, ans=0.2 2023-06-15 11:58:13,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-06-15 11:58:14,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.971e+02 2.221e+02 2.654e+02 3.969e+02, threshold=4.441e+02, percent-clipped=0.0 2023-06-15 11:58:44,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-15 11:58:49,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=131593.33333333334, ans=0.0 2023-06-15 11:59:00,145 INFO [train.py:988] (3/4) Epoch 38, batch 50, loss[loss=0.2094, simple_loss=0.2739, pruned_loss=0.0724, over 20130.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2862, pruned_loss=0.06761, over 878926.53 frames. ], batch size: 239, lr: 8.72e-03, grad_scale: 16.0 2023-06-15 11:59:07,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=131660.0, ans=0.0 2023-06-15 11:59:19,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=131726.66666666666, ans=0.2 2023-06-15 11:59:26,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-15 11:59:28,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=131726.66666666666, ans=0.07 2023-06-15 11:59:28,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-15 11:59:39,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=131793.33333333334, ans=0.2 2023-06-15 11:59:54,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.27 vs. limit=10.0 2023-06-15 11:59:59,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131860.0, ans=0.1 2023-06-15 12:00:22,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=131926.66666666666, ans=0.0 2023-06-15 12:00:26,390 INFO [train.py:988] (3/4) Epoch 38, batch 100, loss[loss=0.1875, simple_loss=0.2694, pruned_loss=0.05283, over 19060.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2915, pruned_loss=0.06777, over 1512725.79 frames. ], batch size: 89, lr: 8.71e-03, grad_scale: 16.0 2023-06-15 12:00:45,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=132060.0, ans=0.2 2023-06-15 12:00:46,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132060.0, ans=0.1 2023-06-15 12:00:57,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=132060.0, ans=0.125 2023-06-15 12:01:07,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.839e+02 2.066e+02 2.376e+02 4.240e+02, threshold=4.131e+02, percent-clipped=0.0 2023-06-15 12:01:24,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-06-15 12:01:28,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132193.33333333334, ans=0.1 2023-06-15 12:01:42,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.63 vs. limit=10.0 2023-06-15 12:01:52,155 INFO [train.py:988] (3/4) Epoch 38, batch 150, loss[loss=0.2224, simple_loss=0.3131, pruned_loss=0.06582, over 13099.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2905, pruned_loss=0.06587, over 2016603.68 frames. ], batch size: 37, lr: 8.70e-03, grad_scale: 16.0 2023-06-15 12:01:59,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=132326.66666666666, ans=0.0 2023-06-15 12:02:05,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=132326.66666666666, ans=0.125 2023-06-15 12:02:38,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=132460.0, ans=0.025 2023-06-15 12:03:17,127 INFO [train.py:988] (3/4) Epoch 38, batch 200, loss[loss=0.1844, simple_loss=0.2751, pruned_loss=0.04683, over 19217.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2911, pruned_loss=0.06581, over 2387718.92 frames. ], batch size: 92, lr: 8.69e-03, grad_scale: 16.0 2023-06-15 12:03:58,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.792e+02 1.952e+02 2.227e+02 3.208e+02, threshold=3.904e+02, percent-clipped=0.0 2023-06-15 12:04:13,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=132860.0, ans=0.04949747468305833 2023-06-15 12:04:43,043 INFO [train.py:988] (3/4) Epoch 38, batch 250, loss[loss=0.2082, simple_loss=0.2964, pruned_loss=0.05999, over 19494.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2904, pruned_loss=0.06566, over 2719723.07 frames. ], batch size: 105, lr: 8.68e-03, grad_scale: 16.0 2023-06-15 12:05:09,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-15 12:05:37,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=133193.33333333334, ans=0.125 2023-06-15 12:05:53,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=133260.0, ans=0.125 2023-06-15 12:06:00,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.43 vs. limit=22.5 2023-06-15 12:06:04,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-15 12:06:10,256 INFO [train.py:988] (3/4) Epoch 38, batch 300, loss[loss=0.2035, simple_loss=0.2946, pruned_loss=0.05618, over 18466.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2902, pruned_loss=0.06571, over 2953112.66 frames. ], batch size: 77, lr: 8.67e-03, grad_scale: 16.0 2023-06-15 12:06:28,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=133326.66666666666, ans=0.125 2023-06-15 12:06:54,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.862e+02 2.054e+02 2.308e+02 3.545e+02, threshold=4.107e+02, percent-clipped=0.0 2023-06-15 12:07:33,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=133593.33333333334, ans=0.2 2023-06-15 12:07:39,130 INFO [train.py:988] (3/4) Epoch 38, batch 350, loss[loss=0.2099, simple_loss=0.2932, pruned_loss=0.06334, over 19119.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2905, pruned_loss=0.06535, over 3153870.09 frames. ], batch size: 94, lr: 8.66e-03, grad_scale: 16.0 2023-06-15 12:07:54,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=133660.0, ans=0.0 2023-06-15 12:08:41,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=133860.0, ans=0.0 2023-06-15 12:08:57,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=133926.66666666666, ans=0.2 2023-06-15 12:09:05,468 INFO [train.py:988] (3/4) Epoch 38, batch 400, loss[loss=0.2267, simple_loss=0.2665, pruned_loss=0.09341, over 16786.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2903, pruned_loss=0.065, over 3307699.48 frames. ], batch size: 391, lr: 8.65e-03, grad_scale: 32.0 2023-06-15 12:09:20,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=133993.33333333334, ans=0.1 2023-06-15 12:09:20,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=133993.33333333334, ans=0.2 2023-06-15 12:09:49,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.764e+02 2.036e+02 2.337e+02 3.432e+02, threshold=4.071e+02, percent-clipped=0.0 2023-06-15 12:10:15,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=134260.0, ans=0.0 2023-06-15 12:10:32,634 INFO [train.py:988] (3/4) Epoch 38, batch 450, loss[loss=0.2241, simple_loss=0.2667, pruned_loss=0.09073, over 16733.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2896, pruned_loss=0.06476, over 3424487.11 frames. ], batch size: 391, lr: 8.65e-03, grad_scale: 16.0 2023-06-15 12:10:32,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=134326.66666666666, ans=0.125 2023-06-15 12:11:12,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-15 12:11:15,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=134460.0, ans=0.0 2023-06-15 12:11:27,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=134526.66666666666, ans=0.5 2023-06-15 12:11:38,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=134593.33333333334, ans=0.125 2023-06-15 12:11:42,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=134593.33333333334, ans=0.2 2023-06-15 12:11:50,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=134593.33333333334, ans=0.0 2023-06-15 12:11:51,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=134593.33333333334, ans=0.125 2023-06-15 12:11:56,101 INFO [train.py:988] (3/4) Epoch 38, batch 500, loss[loss=0.1988, simple_loss=0.2833, pruned_loss=0.05721, over 19353.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2904, pruned_loss=0.06493, over 3504308.88 frames. ], batch size: 98, lr: 8.64e-03, grad_scale: 16.0 2023-06-15 12:12:13,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=134726.66666666666, ans=10.0 2023-06-15 12:12:37,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.813e+02 2.102e+02 2.344e+02 3.588e+02, threshold=4.203e+02, percent-clipped=0.0 2023-06-15 12:13:08,846 INFO [train.py:988] (3/4) Epoch 39, batch 0, loss[loss=0.2201, simple_loss=0.3162, pruned_loss=0.06195, over 18329.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3162, pruned_loss=0.06195, over 18329.00 frames. ], batch size: 72, lr: 8.52e-03, grad_scale: 32.0 2023-06-15 12:13:08,847 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 12:13:14,995 INFO [train.py:1020] (3/4) Epoch 39, validation: loss=0.2008, simple_loss=0.3008, pruned_loss=0.05042, over 143649.00 frames. 2023-06-15 12:13:14,996 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 12:13:37,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=134940.0, ans=0.0 2023-06-15 12:14:02,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=135006.66666666666, ans=0.125 2023-06-15 12:14:07,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=135073.33333333334, ans=0.2 2023-06-15 12:14:08,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:42,572 INFO [train.py:988] (3/4) Epoch 39, batch 50, loss[loss=0.2046, simple_loss=0.2908, pruned_loss=0.05921, over 18940.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2913, pruned_loss=0.06469, over 863221.73 frames. ], batch size: 86, lr: 8.51e-03, grad_scale: 16.0 2023-06-15 12:14:56,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=135206.66666666666, ans=0.125 2023-06-15 12:14:58,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.08 vs. limit=6.0 2023-06-15 12:15:24,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=135340.0, ans=0.125 2023-06-15 12:15:39,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=12.0 2023-06-15 12:15:58,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.823e+02 2.055e+02 2.289e+02 2.991e+02, threshold=4.109e+02, percent-clipped=0.0 2023-06-15 12:16:06,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135473.33333333334, ans=0.1 2023-06-15 12:16:09,601 INFO [train.py:988] (3/4) Epoch 39, batch 100, loss[loss=0.1946, simple_loss=0.2814, pruned_loss=0.05387, over 19530.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2881, pruned_loss=0.06397, over 1521210.72 frames. ], batch size: 102, lr: 8.50e-03, grad_scale: 16.0 2023-06-15 12:16:57,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=12.0 2023-06-15 12:17:35,691 INFO [train.py:988] (3/4) Epoch 39, batch 150, loss[loss=0.1878, simple_loss=0.2765, pruned_loss=0.04949, over 19056.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2891, pruned_loss=0.06438, over 2019839.08 frames. ], batch size: 89, lr: 8.49e-03, grad_scale: 16.0 2023-06-15 12:17:46,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=135873.33333333334, ans=0.125 2023-06-15 12:17:46,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=135873.33333333334, ans=0.2 2023-06-15 12:17:49,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=135873.33333333334, ans=0.125 2023-06-15 12:18:16,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=136006.66666666666, ans=0.0 2023-06-15 12:18:27,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-15 12:18:34,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.62 vs. limit=22.5 2023-06-15 12:18:52,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.763e+02 2.014e+02 2.292e+02 3.195e+02, threshold=4.028e+02, percent-clipped=0.0 2023-06-15 12:19:03,519 INFO [train.py:988] (3/4) Epoch 39, batch 200, loss[loss=0.1874, simple_loss=0.2746, pruned_loss=0.05009, over 19822.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2885, pruned_loss=0.06438, over 2393622.77 frames. ], batch size: 115, lr: 8.48e-03, grad_scale: 16.0 2023-06-15 12:20:20,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=136473.33333333334, ans=0.2 2023-06-15 12:20:31,129 INFO [train.py:988] (3/4) Epoch 39, batch 250, loss[loss=0.1939, simple_loss=0.282, pruned_loss=0.05287, over 19818.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2897, pruned_loss=0.0644, over 2680504.34 frames. ], batch size: 115, lr: 8.47e-03, grad_scale: 16.0 2023-06-15 12:21:01,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=136606.66666666666, ans=0.125 2023-06-15 12:21:11,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=136673.33333333334, ans=0.125 2023-06-15 12:21:17,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=136673.33333333334, ans=0.125 2023-06-15 12:21:48,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.779e+02 1.952e+02 2.172e+02 3.259e+02, threshold=3.903e+02, percent-clipped=0.0 2023-06-15 12:21:52,598 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:21:56,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=136806.66666666666, ans=0.025 2023-06-15 12:21:59,997 INFO [train.py:988] (3/4) Epoch 39, batch 300, loss[loss=0.2071, simple_loss=0.281, pruned_loss=0.06661, over 20579.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2892, pruned_loss=0.06454, over 2945810.26 frames. ], batch size: 173, lr: 8.46e-03, grad_scale: 16.0 2023-06-15 12:22:07,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=136873.33333333334, ans=0.025 2023-06-15 12:22:23,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-06-15 12:22:45,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2023-06-15 12:22:51,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.39 vs. limit=15.0 2023-06-15 12:22:54,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=137073.33333333334, ans=0.125 2023-06-15 12:22:58,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=137073.33333333334, ans=0.2 2023-06-15 12:23:01,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=137073.33333333334, ans=0.0 2023-06-15 12:23:20,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=137140.0, ans=0.125 2023-06-15 12:23:25,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=137206.66666666666, ans=10.0 2023-06-15 12:23:26,368 INFO [train.py:988] (3/4) Epoch 39, batch 350, loss[loss=0.2241, simple_loss=0.2923, pruned_loss=0.07793, over 20150.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.289, pruned_loss=0.06432, over 3138463.77 frames. ], batch size: 239, lr: 8.45e-03, grad_scale: 16.0 2023-06-15 12:23:34,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-15 12:23:38,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=137206.66666666666, ans=0.0 2023-06-15 12:23:45,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=137273.33333333334, ans=0.125 2023-06-15 12:23:46,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=137273.33333333334, ans=0.05 2023-06-15 12:24:04,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137340.0, ans=0.125 2023-06-15 12:24:06,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=137340.0, ans=0.0 2023-06-15 12:24:17,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=137340.0, ans=0.125 2023-06-15 12:24:36,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=137473.33333333334, ans=0.125 2023-06-15 12:24:44,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.816e+02 2.065e+02 2.366e+02 3.841e+02, threshold=4.130e+02, percent-clipped=0.0 2023-06-15 12:24:46,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=137473.33333333334, ans=0.125 2023-06-15 12:24:55,386 INFO [train.py:988] (3/4) Epoch 39, batch 400, loss[loss=0.2095, simple_loss=0.2885, pruned_loss=0.0653, over 20095.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2891, pruned_loss=0.06416, over 3278165.02 frames. ], batch size: 133, lr: 8.44e-03, grad_scale: 32.0 2023-06-15 12:25:05,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=137540.0, ans=0.05 2023-06-15 12:25:08,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=137540.0, ans=10.0 2023-06-15 12:25:35,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=137673.33333333334, ans=0.0 2023-06-15 12:26:24,762 INFO [train.py:988] (3/4) Epoch 39, batch 450, loss[loss=0.2115, simple_loss=0.2923, pruned_loss=0.06539, over 20291.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.289, pruned_loss=0.06414, over 3393959.29 frames. ], batch size: 149, lr: 8.44e-03, grad_scale: 16.0 2023-06-15 12:26:28,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=137873.33333333334, ans=0.1 2023-06-15 12:26:43,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=137940.0, ans=0.0 2023-06-15 12:26:50,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-15 12:27:13,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.94 vs. limit=6.0 2023-06-15 12:27:15,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=138073.33333333334, ans=0.1 2023-06-15 12:27:34,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=138140.0, ans=0.125 2023-06-15 12:27:41,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.826e+02 2.179e+02 2.454e+02 3.798e+02, threshold=4.358e+02, percent-clipped=0.0 2023-06-15 12:27:49,798 INFO [train.py:988] (3/4) Epoch 39, batch 500, loss[loss=0.2041, simple_loss=0.277, pruned_loss=0.0656, over 20567.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2885, pruned_loss=0.06397, over 3477669.22 frames. ], batch size: 189, lr: 8.43e-03, grad_scale: 16.0 2023-06-15 12:28:28,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=138340.0, ans=0.125 2023-06-15 12:28:30,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=138340.0, ans=0.04949747468305833 2023-06-15 12:29:07,669 INFO [train.py:988] (3/4) Epoch 40, batch 0, loss[loss=0.2078, simple_loss=0.2548, pruned_loss=0.0804, over 16672.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2548, pruned_loss=0.0804, over 16672.00 frames. ], batch size: 391, lr: 8.31e-03, grad_scale: 32.0 2023-06-15 12:29:07,670 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 12:29:13,809 INFO [train.py:1020] (3/4) Epoch 40, validation: loss=0.2011, simple_loss=0.3008, pruned_loss=0.05073, over 143649.00 frames. 2023-06-15 12:29:13,812 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 12:29:19,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=138420.0, ans=0.07 2023-06-15 12:29:48,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2023-06-15 12:29:49,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138553.33333333334, ans=0.125 2023-06-15 12:30:00,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=138553.33333333334, ans=0.95 2023-06-15 12:30:15,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-06-15 12:30:22,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138686.66666666666, ans=0.1 2023-06-15 12:30:27,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=138686.66666666666, ans=0.0 2023-06-15 12:30:42,348 INFO [train.py:988] (3/4) Epoch 40, batch 50, loss[loss=0.222, simple_loss=0.2674, pruned_loss=0.08827, over 17040.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2875, pruned_loss=0.06411, over 865412.82 frames. ], batch size: 391, lr: 8.31e-03, grad_scale: 32.0 2023-06-15 12:31:05,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.755e+02 2.069e+02 2.333e+02 3.346e+02, threshold=4.138e+02, percent-clipped=0.0 2023-06-15 12:31:28,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-15 12:31:29,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.18 vs. limit=15.0 2023-06-15 12:31:34,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=138953.33333333334, ans=0.0 2023-06-15 12:32:12,093 INFO [train.py:988] (3/4) Epoch 40, batch 100, loss[loss=0.2169, simple_loss=0.3137, pruned_loss=0.06012, over 17040.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2891, pruned_loss=0.06356, over 1496222.91 frames. ], batch size: 60, lr: 8.30e-03, grad_scale: 32.0 2023-06-15 12:32:27,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=139153.33333333334, ans=0.05 2023-06-15 12:32:39,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=139153.33333333334, ans=0.0 2023-06-15 12:32:52,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.28 vs. limit=15.0 2023-06-15 12:33:23,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=139353.33333333334, ans=0.0 2023-06-15 12:33:32,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=139353.33333333334, ans=0.2 2023-06-15 12:33:38,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=139353.33333333334, ans=0.125 2023-06-15 12:33:39,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=139420.0, ans=0.125 2023-06-15 12:33:41,065 INFO [train.py:988] (3/4) Epoch 40, batch 150, loss[loss=0.1918, simple_loss=0.2761, pruned_loss=0.05377, over 18777.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2883, pruned_loss=0.06294, over 2012381.52 frames. ], batch size: 83, lr: 8.29e-03, grad_scale: 32.0 2023-06-15 12:33:41,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.81 vs. limit=22.5 2023-06-15 12:33:53,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=139420.0, ans=0.5 2023-06-15 12:33:59,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.00 vs. limit=15.0 2023-06-15 12:34:03,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.849e+02 1.991e+02 2.229e+02 4.188e+02, threshold=3.982e+02, percent-clipped=1.0 2023-06-15 12:34:35,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=139620.0, ans=0.0 2023-06-15 12:34:40,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=139620.0, ans=0.0 2023-06-15 12:35:09,542 INFO [train.py:988] (3/4) Epoch 40, batch 200, loss[loss=0.2022, simple_loss=0.2991, pruned_loss=0.05267, over 17025.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.288, pruned_loss=0.06335, over 2411944.17 frames. ], batch size: 60, lr: 8.28e-03, grad_scale: 32.0 2023-06-15 12:35:18,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139753.33333333334, ans=0.0 2023-06-15 12:35:40,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=139820.0, ans=0.125 2023-06-15 12:36:29,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=140020.0, ans=0.125 2023-06-15 12:36:33,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=140020.0, ans=0.0 2023-06-15 12:36:36,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=140086.66666666666, ans=0.125 2023-06-15 12:36:37,825 INFO [train.py:988] (3/4) Epoch 40, batch 250, loss[loss=0.2015, simple_loss=0.2861, pruned_loss=0.05845, over 19790.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2894, pruned_loss=0.0634, over 2717712.31 frames. ], batch size: 115, lr: 8.27e-03, grad_scale: 32.0 2023-06-15 12:36:45,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=140086.66666666666, ans=0.0 2023-06-15 12:36:52,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-15 12:37:01,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.820e+02 2.078e+02 2.421e+02 4.152e+02, threshold=4.155e+02, percent-clipped=1.0 2023-06-15 12:37:06,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140153.33333333334, ans=0.0 2023-06-15 12:37:08,373 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:37:22,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.35 vs. limit=15.0 2023-06-15 12:37:31,680 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:37:42,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=140286.66666666666, ans=0.1 2023-06-15 12:37:56,825 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:38:04,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=140353.33333333334, ans=0.125 2023-06-15 12:38:08,090 INFO [train.py:988] (3/4) Epoch 40, batch 300, loss[loss=0.2046, simple_loss=0.2885, pruned_loss=0.06038, over 19817.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2896, pruned_loss=0.06341, over 2962330.63 frames. ], batch size: 115, lr: 8.26e-03, grad_scale: 32.0 2023-06-15 12:38:11,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=140420.0, ans=0.1 2023-06-15 12:38:54,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-15 12:39:19,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-06-15 12:39:38,299 INFO [train.py:988] (3/4) Epoch 40, batch 350, loss[loss=0.2234, simple_loss=0.3184, pruned_loss=0.0642, over 17068.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.29, pruned_loss=0.06421, over 3142603.99 frames. ], batch size: 60, lr: 8.25e-03, grad_scale: 32.0 2023-06-15 12:39:42,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-06-15 12:39:56,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=140820.0, ans=0.2 2023-06-15 12:40:01,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.757e+02 1.917e+02 2.241e+02 2.935e+02, threshold=3.834e+02, percent-clipped=0.0 2023-06-15 12:40:15,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=140886.66666666666, ans=0.125 2023-06-15 12:40:27,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140886.66666666666, ans=0.0 2023-06-15 12:40:28,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=140886.66666666666, ans=0.125 2023-06-15 12:41:05,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=141020.0, ans=0.0 2023-06-15 12:41:08,530 INFO [train.py:988] (3/4) Epoch 40, batch 400, loss[loss=0.2252, simple_loss=0.2674, pruned_loss=0.09149, over 17098.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2898, pruned_loss=0.06417, over 3288779.00 frames. ], batch size: 391, lr: 8.24e-03, grad_scale: 32.0 2023-06-15 12:41:10,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141086.66666666666, ans=0.1 2023-06-15 12:41:11,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.48 vs. limit=10.0 2023-06-15 12:41:25,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=141153.33333333334, ans=0.0 2023-06-15 12:41:40,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=141153.33333333334, ans=0.125 2023-06-15 12:41:43,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=141220.0, ans=0.2 2023-06-15 12:42:00,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=12.0 2023-06-15 12:42:36,461 INFO [train.py:988] (3/4) Epoch 40, batch 450, loss[loss=0.1975, simple_loss=0.2842, pruned_loss=0.05539, over 19551.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2896, pruned_loss=0.0636, over 3394661.39 frames. ], batch size: 102, lr: 8.24e-03, grad_scale: 32.0 2023-06-15 12:42:59,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.768e+02 1.885e+02 2.206e+02 3.327e+02, threshold=3.770e+02, percent-clipped=0.0 2023-06-15 12:43:14,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=141553.33333333334, ans=0.2 2023-06-15 12:43:15,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141553.33333333334, ans=0.125 2023-06-15 12:43:20,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=141553.33333333334, ans=0.0 2023-06-15 12:43:53,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=141686.66666666666, ans=0.125 2023-06-15 12:43:55,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=141686.66666666666, ans=0.125 2023-06-15 12:44:02,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141753.33333333334, ans=0.1 2023-06-15 12:44:03,585 INFO [train.py:988] (3/4) Epoch 40, batch 500, loss[loss=0.23, simple_loss=0.3148, pruned_loss=0.07258, over 17078.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2895, pruned_loss=0.06353, over 3479901.44 frames. ], batch size: 60, lr: 8.23e-03, grad_scale: 32.0 2023-06-15 12:44:12,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=141753.33333333334, ans=0.125 2023-06-15 12:44:15,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=12.0 2023-06-15 12:44:17,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=141753.33333333334, ans=0.125 2023-06-15 12:44:28,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141820.0, ans=0.125 2023-06-15 12:44:30,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=141820.0, ans=0.0 2023-06-15 12:44:33,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141820.0, ans=0.1 2023-06-15 12:44:45,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=141886.66666666666, ans=0.125 2023-06-15 12:45:20,962 INFO [train.py:988] (3/4) Epoch 41, batch 0, loss[loss=0.1953, simple_loss=0.2778, pruned_loss=0.05643, over 19319.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2778, pruned_loss=0.05643, over 19319.00 frames. ], batch size: 98, lr: 8.12e-03, grad_scale: 32.0 2023-06-15 12:45:20,962 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 12:45:28,023 INFO [train.py:1020] (3/4) Epoch 41, validation: loss=0.2002, simple_loss=0.2999, pruned_loss=0.05026, over 143649.00 frames. 2023-06-15 12:45:28,024 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 12:46:04,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-15 12:46:18,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=142106.66666666666, ans=0.2 2023-06-15 12:46:21,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.817e+02 2.110e+02 2.443e+02 3.477e+02, threshold=4.219e+02, percent-clipped=0.0 2023-06-15 12:46:23,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=142173.33333333334, ans=22.5 2023-06-15 12:46:43,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=142240.0, ans=0.125 2023-06-15 12:46:57,304 INFO [train.py:988] (3/4) Epoch 41, batch 50, loss[loss=0.1968, simple_loss=0.2835, pruned_loss=0.05505, over 19127.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2865, pruned_loss=0.06338, over 844726.08 frames. ], batch size: 94, lr: 8.11e-03, grad_scale: 32.0 2023-06-15 12:47:22,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=142373.33333333334, ans=0.125 2023-06-15 12:48:20,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=142573.33333333334, ans=0.0 2023-06-15 12:48:25,566 INFO [train.py:988] (3/4) Epoch 41, batch 100, loss[loss=0.2022, simple_loss=0.2868, pruned_loss=0.0588, over 19069.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2875, pruned_loss=0.06303, over 1496551.51 frames. ], batch size: 89, lr: 8.10e-03, grad_scale: 32.0 2023-06-15 12:48:49,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=142706.66666666666, ans=0.125 2023-06-15 12:48:58,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=142706.66666666666, ans=0.05 2023-06-15 12:49:01,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=142773.33333333334, ans=0.0 2023-06-15 12:49:10,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=142773.33333333334, ans=0.0 2023-06-15 12:49:10,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142773.33333333334, ans=0.1 2023-06-15 12:49:17,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=142840.0, ans=0.2 2023-06-15 12:49:18,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.848e+02 2.100e+02 2.504e+02 3.647e+02, threshold=4.200e+02, percent-clipped=0.0 2023-06-15 12:49:28,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142840.0, ans=0.1 2023-06-15 12:49:54,451 INFO [train.py:988] (3/4) Epoch 41, batch 150, loss[loss=0.1836, simple_loss=0.276, pruned_loss=0.04555, over 19114.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2863, pruned_loss=0.06374, over 2005389.12 frames. ], batch size: 94, lr: 8.09e-03, grad_scale: 32.0 2023-06-15 12:50:02,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=142973.33333333334, ans=0.015 2023-06-15 12:50:40,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=143106.66666666666, ans=0.125 2023-06-15 12:51:24,270 INFO [train.py:988] (3/4) Epoch 41, batch 200, loss[loss=0.1929, simple_loss=0.27, pruned_loss=0.05788, over 20658.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2877, pruned_loss=0.06336, over 2400157.80 frames. ], batch size: 211, lr: 8.09e-03, grad_scale: 32.0 2023-06-15 12:51:41,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=143373.33333333334, ans=0.0 2023-06-15 12:52:03,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=143440.0, ans=0.0 2023-06-15 12:52:09,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=143440.0, ans=0.1 2023-06-15 12:52:11,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=143440.0, ans=0.125 2023-06-15 12:52:13,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.74 vs. limit=22.5 2023-06-15 12:52:18,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.761e+02 1.969e+02 2.341e+02 3.526e+02, threshold=3.938e+02, percent-clipped=0.0 2023-06-15 12:52:54,196 INFO [train.py:988] (3/4) Epoch 41, batch 250, loss[loss=0.1813, simple_loss=0.2677, pruned_loss=0.04741, over 19836.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2883, pruned_loss=0.06312, over 2712550.39 frames. ], batch size: 120, lr: 8.08e-03, grad_scale: 32.0 2023-06-15 12:52:56,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=143640.0, ans=0.125 2023-06-15 12:53:30,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=143773.33333333334, ans=0.2 2023-06-15 12:53:57,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=143840.0, ans=0.0 2023-06-15 12:54:18,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=143906.66666666666, ans=0.0 2023-06-15 12:54:24,713 INFO [train.py:988] (3/4) Epoch 41, batch 300, loss[loss=0.1972, simple_loss=0.2721, pruned_loss=0.06112, over 20579.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2877, pruned_loss=0.06301, over 2952562.07 frames. ], batch size: 173, lr: 8.07e-03, grad_scale: 32.0 2023-06-15 12:55:19,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.842e+02 2.030e+02 2.348e+02 3.359e+02, threshold=4.059e+02, percent-clipped=0.0 2023-06-15 12:55:26,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=144173.33333333334, ans=0.2 2023-06-15 12:55:41,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=22.5 2023-06-15 12:55:44,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=144240.0, ans=0.2 2023-06-15 12:55:54,948 INFO [train.py:988] (3/4) Epoch 41, batch 350, loss[loss=0.217, simple_loss=0.306, pruned_loss=0.064, over 16146.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2878, pruned_loss=0.06266, over 3127502.12 frames. ], batch size: 51, lr: 8.06e-03, grad_scale: 32.0 2023-06-15 12:55:55,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=144306.66666666666, ans=0.125 2023-06-15 12:56:02,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=144306.66666666666, ans=0.2 2023-06-15 12:56:32,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=144440.0, ans=0.0 2023-06-15 12:56:45,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=144440.0, ans=0.125 2023-06-15 12:56:53,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=144506.66666666666, ans=0.125 2023-06-15 12:57:18,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=144573.33333333334, ans=0.125 2023-06-15 12:57:20,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=144573.33333333334, ans=0.125 2023-06-15 12:57:20,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=144573.33333333334, ans=0.125 2023-06-15 12:57:25,281 INFO [train.py:988] (3/4) Epoch 41, batch 400, loss[loss=0.2083, simple_loss=0.3015, pruned_loss=0.0575, over 18473.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2872, pruned_loss=0.06217, over 3284306.46 frames. ], batch size: 77, lr: 8.05e-03, grad_scale: 32.0 2023-06-15 12:57:25,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=144640.0, ans=0.125 2023-06-15 12:57:32,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=144640.0, ans=0.0 2023-06-15 12:57:55,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=144706.66666666666, ans=0.05 2023-06-15 12:58:03,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=144773.33333333334, ans=15.0 2023-06-15 12:58:17,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-06-15 12:58:17,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.778e+02 1.933e+02 2.211e+02 3.033e+02, threshold=3.866e+02, percent-clipped=0.0 2023-06-15 12:58:32,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=144840.0, ans=0.2 2023-06-15 12:58:34,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.94 vs. limit=6.0 2023-06-15 12:58:53,431 INFO [train.py:988] (3/4) Epoch 41, batch 450, loss[loss=0.1968, simple_loss=0.2797, pruned_loss=0.05695, over 18926.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2876, pruned_loss=0.0621, over 3382944.17 frames. ], batch size: 86, lr: 8.04e-03, grad_scale: 32.0 2023-06-15 13:00:03,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=145240.0, ans=0.0 2023-06-15 13:00:07,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2023-06-15 13:00:17,857 INFO [train.py:988] (3/4) Epoch 41, batch 500, loss[loss=0.1992, simple_loss=0.2825, pruned_loss=0.05794, over 18649.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2875, pruned_loss=0.06215, over 3470824.10 frames. ], batch size: 80, lr: 8.04e-03, grad_scale: 32.0 2023-06-15 13:00:25,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145306.66666666666, ans=0.1 2023-06-15 13:00:54,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.98 vs. limit=22.5 2023-06-15 13:01:07,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.783e+02 1.958e+02 2.209e+02 2.904e+02, threshold=3.915e+02, percent-clipped=0.0 2023-06-15 13:01:34,508 INFO [train.py:988] (3/4) Epoch 42, batch 0, loss[loss=0.1958, simple_loss=0.2724, pruned_loss=0.05961, over 20580.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2724, pruned_loss=0.05961, over 20580.00 frames. ], batch size: 173, lr: 7.93e-03, grad_scale: 32.0 2023-06-15 13:01:34,508 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 13:01:40,649 INFO [train.py:1020] (3/4) Epoch 42, validation: loss=0.1999, simple_loss=0.2992, pruned_loss=0.05028, over 143649.00 frames. 2023-06-15 13:01:40,651 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 13:01:51,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=145520.0, ans=0.1 2023-06-15 13:01:57,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145586.66666666666, ans=0.1 2023-06-15 13:02:08,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=145586.66666666666, ans=0.125 2023-06-15 13:02:19,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=145653.33333333334, ans=0.2 2023-06-15 13:02:59,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.15 vs. limit=15.0 2023-06-15 13:03:07,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=145786.66666666666, ans=0.125 2023-06-15 13:03:10,721 INFO [train.py:988] (3/4) Epoch 42, batch 50, loss[loss=0.2102, simple_loss=0.3058, pruned_loss=0.05733, over 18323.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2877, pruned_loss=0.06274, over 848241.11 frames. ], batch size: 72, lr: 7.93e-03, grad_scale: 32.0 2023-06-15 13:03:19,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145853.33333333334, ans=0.0 2023-06-15 13:03:32,566 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:03:44,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-06-15 13:04:08,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.66 vs. limit=15.0 2023-06-15 13:04:15,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=146053.33333333334, ans=0.125 2023-06-15 13:04:35,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.815e+02 2.002e+02 2.267e+02 3.167e+02, threshold=4.003e+02, percent-clipped=0.0 2023-06-15 13:04:39,182 INFO [train.py:988] (3/4) Epoch 42, batch 100, loss[loss=0.1838, simple_loss=0.2728, pruned_loss=0.04738, over 19506.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2856, pruned_loss=0.06197, over 1500800.98 frames. ], batch size: 102, lr: 7.92e-03, grad_scale: 32.0 2023-06-15 13:06:08,108 INFO [train.py:988] (3/4) Epoch 42, batch 150, loss[loss=0.1969, simple_loss=0.2932, pruned_loss=0.05032, over 13320.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2869, pruned_loss=0.06232, over 2007014.37 frames. ], batch size: 38, lr: 7.91e-03, grad_scale: 32.0 2023-06-15 13:06:10,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-15 13:06:26,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=146586.66666666666, ans=0.95 2023-06-15 13:06:26,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-15 13:06:36,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=146586.66666666666, ans=0.0 2023-06-15 13:07:12,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=146720.0, ans=0.125 2023-06-15 13:07:34,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.782e+02 1.960e+02 2.224e+02 3.500e+02, threshold=3.921e+02, percent-clipped=0.0 2023-06-15 13:07:37,705 INFO [train.py:988] (3/4) Epoch 42, batch 200, loss[loss=0.2066, simple_loss=0.2936, pruned_loss=0.0598, over 18293.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2873, pruned_loss=0.06282, over 2391041.37 frames. ], batch size: 74, lr: 7.90e-03, grad_scale: 32.0 2023-06-15 13:07:56,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=146920.0, ans=0.0 2023-06-15 13:08:25,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.70 vs. limit=15.0 2023-06-15 13:08:26,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=146986.66666666666, ans=0.0 2023-06-15 13:08:49,002 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:08:53,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147120.0, ans=0.1 2023-06-15 13:09:07,768 INFO [train.py:988] (3/4) Epoch 42, batch 250, loss[loss=0.2057, simple_loss=0.2864, pruned_loss=0.0625, over 19227.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2875, pruned_loss=0.06247, over 2694666.26 frames. ], batch size: 92, lr: 7.89e-03, grad_scale: 32.0 2023-06-15 13:09:30,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=147253.33333333334, ans=0.0 2023-06-15 13:09:32,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147253.33333333334, ans=0.125 2023-06-15 13:09:37,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=147253.33333333334, ans=0.125 2023-06-15 13:09:54,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=147320.0, ans=0.0 2023-06-15 13:10:07,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=147386.66666666666, ans=0.0 2023-06-15 13:10:14,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=147386.66666666666, ans=0.0 2023-06-15 13:10:22,110 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:10:28,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-15 13:10:32,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.737e+02 1.907e+02 2.072e+02 2.821e+02, threshold=3.814e+02, percent-clipped=0.0 2023-06-15 13:10:36,343 INFO [train.py:988] (3/4) Epoch 42, batch 300, loss[loss=0.2231, simple_loss=0.3096, pruned_loss=0.06835, over 15488.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2868, pruned_loss=0.06224, over 2933774.35 frames. ], batch size: 44, lr: 7.88e-03, grad_scale: 32.0 2023-06-15 13:10:36,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147520.0, ans=0.1 2023-06-15 13:10:42,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=147520.0, ans=0.125 2023-06-15 13:10:43,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.62 vs. limit=5.0 2023-06-15 13:10:57,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=147586.66666666666, ans=0.0 2023-06-15 13:11:07,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=12.0 2023-06-15 13:12:04,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147853.33333333334, ans=0.0 2023-06-15 13:12:05,687 INFO [train.py:988] (3/4) Epoch 42, batch 350, loss[loss=0.1908, simple_loss=0.2741, pruned_loss=0.05369, over 19648.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2866, pruned_loss=0.06177, over 3140286.79 frames. ], batch size: 110, lr: 7.88e-03, grad_scale: 32.0 2023-06-15 13:12:33,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=147920.0, ans=0.2 2023-06-15 13:12:44,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=147986.66666666666, ans=0.0 2023-06-15 13:13:00,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.47 vs. limit=15.0 2023-06-15 13:13:30,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.764e+02 1.957e+02 2.256e+02 2.981e+02, threshold=3.914e+02, percent-clipped=0.0 2023-06-15 13:13:34,291 INFO [train.py:988] (3/4) Epoch 42, batch 400, loss[loss=0.1955, simple_loss=0.2891, pruned_loss=0.05089, over 19084.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2873, pruned_loss=0.06195, over 3274322.48 frames. ], batch size: 94, lr: 7.87e-03, grad_scale: 32.0 2023-06-15 13:13:50,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=148253.33333333334, ans=0.125 2023-06-15 13:13:56,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-06-15 13:14:14,068 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:14:15,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=148320.0, ans=0.2 2023-06-15 13:14:47,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=148453.33333333334, ans=0.125 2023-06-15 13:15:00,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=148453.33333333334, ans=0.125 2023-06-15 13:15:03,347 INFO [train.py:988] (3/4) Epoch 42, batch 450, loss[loss=0.2142, simple_loss=0.2898, pruned_loss=0.06927, over 20316.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.287, pruned_loss=0.06192, over 3397242.74 frames. ], batch size: 149, lr: 7.86e-03, grad_scale: 32.0 2023-06-15 13:15:17,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=148520.0, ans=0.0 2023-06-15 13:15:21,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=148586.66666666666, ans=0.0 2023-06-15 13:15:25,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=148586.66666666666, ans=0.125 2023-06-15 13:15:41,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-15 13:15:54,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-15 13:16:11,303 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:16:15,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-15 13:16:26,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.894e+02 2.076e+02 2.350e+02 3.042e+02, threshold=4.151e+02, percent-clipped=0.0 2023-06-15 13:16:28,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148853.33333333334, ans=0.1 2023-06-15 13:16:29,385 INFO [train.py:988] (3/4) Epoch 42, batch 500, loss[loss=0.2006, simple_loss=0.2785, pruned_loss=0.06138, over 19338.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2876, pruned_loss=0.06213, over 3467084.93 frames. ], batch size: 98, lr: 7.85e-03, grad_scale: 32.0 2023-06-15 13:16:31,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=148853.33333333334, ans=0.125 2023-06-15 13:16:33,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.84 vs. limit=12.0 2023-06-15 13:17:06,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=22.5 2023-06-15 13:17:08,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=148986.66666666666, ans=0.0 2023-06-15 13:17:51,430 INFO [train.py:988] (3/4) Epoch 43, batch 0, loss[loss=0.2105, simple_loss=0.2979, pruned_loss=0.06159, over 18477.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2979, pruned_loss=0.06159, over 18477.00 frames. ], batch size: 77, lr: 7.76e-03, grad_scale: 32.0 2023-06-15 13:17:51,431 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 13:17:57,697 INFO [train.py:1020] (3/4) Epoch 43, validation: loss=0.2014, simple_loss=0.3004, pruned_loss=0.05115, over 143649.00 frames. 2023-06-15 13:17:57,698 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 13:17:59,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=149073.33333333334, ans=0.07 2023-06-15 13:18:18,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149140.0, ans=0.1 2023-06-15 13:18:42,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149206.66666666666, ans=0.125 2023-06-15 13:19:26,769 INFO [train.py:988] (3/4) Epoch 43, batch 50, loss[loss=0.2055, simple_loss=0.2899, pruned_loss=0.06057, over 18464.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2864, pruned_loss=0.06265, over 864444.56 frames. ], batch size: 77, lr: 7.75e-03, grad_scale: 32.0 2023-06-15 13:19:53,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.770e+02 1.939e+02 2.280e+02 3.061e+02, threshold=3.878e+02, percent-clipped=0.0 2023-06-15 13:20:16,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=149540.0, ans=0.0 2023-06-15 13:20:20,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=149606.66666666666, ans=0.125 2023-06-15 13:20:46,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=149673.33333333334, ans=0.125 2023-06-15 13:20:55,021 INFO [train.py:988] (3/4) Epoch 43, batch 100, loss[loss=0.2041, simple_loss=0.2991, pruned_loss=0.0546, over 17626.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2857, pruned_loss=0.06197, over 1521682.71 frames. ], batch size: 67, lr: 7.74e-03, grad_scale: 32.0 2023-06-15 13:22:14,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-15 13:22:23,153 INFO [train.py:988] (3/4) Epoch 43, batch 150, loss[loss=0.1796, simple_loss=0.2709, pruned_loss=0.04418, over 18762.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2859, pruned_loss=0.06092, over 2022854.17 frames. ], batch size: 83, lr: 7.73e-03, grad_scale: 32.0 2023-06-15 13:22:37,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150073.33333333334, ans=0.125 2023-06-15 13:22:41,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=150140.0, ans=0.2 2023-06-15 13:22:50,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.767e+02 1.914e+02 2.114e+02 3.326e+02, threshold=3.828e+02, percent-clipped=0.0 2023-06-15 13:22:52,641 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:23:13,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=150206.66666666666, ans=0.0 2023-06-15 13:23:49,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=150340.0, ans=0.0 2023-06-15 13:23:51,946 INFO [train.py:988] (3/4) Epoch 43, batch 200, loss[loss=0.212, simple_loss=0.2952, pruned_loss=0.06445, over 19554.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2869, pruned_loss=0.06081, over 2419304.69 frames. ], batch size: 102, lr: 7.72e-03, grad_scale: 32.0 2023-06-15 13:23:57,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=150406.66666666666, ans=0.125 2023-06-15 13:23:59,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150406.66666666666, ans=0.1 2023-06-15 13:24:08,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=150473.33333333334, ans=0.125 2023-06-15 13:24:13,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-15 13:24:43,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=150606.66666666666, ans=0.125 2023-06-15 13:24:45,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=150606.66666666666, ans=0.125 2023-06-15 13:24:51,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150606.66666666666, ans=0.1 2023-06-15 13:25:21,378 INFO [train.py:988] (3/4) Epoch 43, batch 250, loss[loss=0.2, simple_loss=0.2814, pruned_loss=0.05932, over 20302.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2873, pruned_loss=0.06146, over 2715493.91 frames. ], batch size: 141, lr: 7.72e-03, grad_scale: 32.0 2023-06-15 13:25:47,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.795e+02 2.042e+02 2.211e+02 3.400e+02, threshold=4.084e+02, percent-clipped=0.0 2023-06-15 13:26:11,480 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:26:31,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=151006.66666666666, ans=0.125 2023-06-15 13:26:45,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=151006.66666666666, ans=0.125 2023-06-15 13:26:50,343 INFO [train.py:988] (3/4) Epoch 43, batch 300, loss[loss=0.1985, simple_loss=0.2705, pruned_loss=0.06327, over 20532.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2867, pruned_loss=0.06108, over 2945836.64 frames. ], batch size: 173, lr: 7.71e-03, grad_scale: 32.0 2023-06-15 13:26:57,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=151073.33333333334, ans=0.1 2023-06-15 13:27:28,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-06-15 13:27:31,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=151206.66666666666, ans=0.125 2023-06-15 13:27:57,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2023-06-15 13:28:02,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=151340.0, ans=0.07 2023-06-15 13:28:04,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=151340.0, ans=0.0 2023-06-15 13:28:17,953 INFO [train.py:988] (3/4) Epoch 43, batch 350, loss[loss=0.1994, simple_loss=0.2777, pruned_loss=0.06059, over 20743.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2868, pruned_loss=0.06127, over 3123785.75 frames. ], batch size: 211, lr: 7.70e-03, grad_scale: 64.0 2023-06-15 13:28:20,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=151406.66666666666, ans=0.0 2023-06-15 13:28:44,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.733e+02 1.920e+02 2.080e+02 2.736e+02, threshold=3.841e+02, percent-clipped=0.0 2023-06-15 13:28:59,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151540.0, ans=0.125 2023-06-15 13:29:47,162 INFO [train.py:988] (3/4) Epoch 43, batch 400, loss[loss=0.198, simple_loss=0.2931, pruned_loss=0.05143, over 18799.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2861, pruned_loss=0.06093, over 3283058.62 frames. ], batch size: 83, lr: 7.69e-03, grad_scale: 32.0 2023-06-15 13:30:42,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=151940.0, ans=0.05 2023-06-15 13:30:52,528 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:31:07,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=152006.66666666666, ans=0.0 2023-06-15 13:31:16,475 INFO [train.py:988] (3/4) Epoch 43, batch 450, loss[loss=0.2001, simple_loss=0.2829, pruned_loss=0.05865, over 20512.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2865, pruned_loss=0.06105, over 3395485.35 frames. ], batch size: 160, lr: 7.69e-03, grad_scale: 32.0 2023-06-15 13:31:28,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=152073.33333333334, ans=0.0 2023-06-15 13:31:40,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=152140.0, ans=0.1 2023-06-15 13:31:43,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.899e+02 2.101e+02 2.447e+02 3.803e+02, threshold=4.202e+02, percent-clipped=0.0 2023-06-15 13:32:13,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=152273.33333333334, ans=0.0 2023-06-15 13:32:18,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152273.33333333334, ans=0.1 2023-06-15 13:32:42,692 INFO [train.py:988] (3/4) Epoch 43, batch 500, loss[loss=0.1904, simple_loss=0.2745, pruned_loss=0.0531, over 18783.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2859, pruned_loss=0.06123, over 3500880.58 frames. ], batch size: 83, lr: 7.68e-03, grad_scale: 32.0 2023-06-15 13:32:52,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=15.0 2023-06-15 13:32:52,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-15 13:32:57,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-15 13:33:05,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=152473.33333333334, ans=0.125 2023-06-15 13:33:08,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=152473.33333333334, ans=0.125 2023-06-15 13:34:02,654 INFO [train.py:988] (3/4) Epoch 44, batch 0, loss[loss=0.208, simple_loss=0.2981, pruned_loss=0.05893, over 18319.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2981, pruned_loss=0.05893, over 18319.00 frames. ], batch size: 74, lr: 7.58e-03, grad_scale: 32.0 2023-06-15 13:34:02,655 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 13:34:08,923 INFO [train.py:1020] (3/4) Epoch 44, validation: loss=0.204, simple_loss=0.3011, pruned_loss=0.05343, over 143649.00 frames. 2023-06-15 13:34:08,924 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 13:34:38,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152693.33333333334, ans=0.1 2023-06-15 13:35:00,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-06-15 13:35:03,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-15 13:35:06,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.836e+02 2.115e+02 2.307e+02 4.215e+02, threshold=4.230e+02, percent-clipped=1.0 2023-06-15 13:35:08,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=152826.66666666666, ans=0.125 2023-06-15 13:35:34,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.55 vs. limit=10.0 2023-06-15 13:35:35,456 INFO [train.py:988] (3/4) Epoch 44, batch 50, loss[loss=0.2209, simple_loss=0.2996, pruned_loss=0.07108, over 20323.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2881, pruned_loss=0.061, over 853741.97 frames. ], batch size: 149, lr: 7.58e-03, grad_scale: 32.0 2023-06-15 13:35:40,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152960.0, ans=0.1 2023-06-15 13:35:45,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=152960.0, ans=0.0 2023-06-15 13:35:47,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=22.5 2023-06-15 13:35:49,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=152960.0, ans=0.125 2023-06-15 13:35:57,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153026.66666666666, ans=0.1 2023-06-15 13:36:23,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=153093.33333333334, ans=0.125 2023-06-15 13:36:37,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2023-06-15 13:36:41,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=153160.0, ans=0.125 2023-06-15 13:36:44,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=153226.66666666666, ans=0.2 2023-06-15 13:37:03,143 INFO [train.py:988] (3/4) Epoch 44, batch 100, loss[loss=0.1995, simple_loss=0.2949, pruned_loss=0.05206, over 18321.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.285, pruned_loss=0.05928, over 1512043.40 frames. ], batch size: 72, lr: 7.57e-03, grad_scale: 32.0 2023-06-15 13:37:05,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153293.33333333334, ans=0.1 2023-06-15 13:37:23,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=153360.0, ans=0.2 2023-06-15 13:37:26,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.46 vs. limit=15.0 2023-06-15 13:37:33,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-06-15 13:37:36,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153360.0, ans=0.125 2023-06-15 13:37:56,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.68 vs. limit=10.0 2023-06-15 13:38:01,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-15 13:38:03,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.855e+02 2.123e+02 2.461e+02 3.591e+02, threshold=4.246e+02, percent-clipped=0.0 2023-06-15 13:38:05,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=153493.33333333334, ans=0.125 2023-06-15 13:38:19,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=153560.0, ans=0.125 2023-06-15 13:38:31,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=153626.66666666666, ans=0.2 2023-06-15 13:38:32,993 INFO [train.py:988] (3/4) Epoch 44, batch 150, loss[loss=0.2155, simple_loss=0.3033, pruned_loss=0.0638, over 17659.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2854, pruned_loss=0.06038, over 2017560.98 frames. ], batch size: 67, lr: 7.56e-03, grad_scale: 32.0 2023-06-15 13:38:57,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=153693.33333333334, ans=0.0 2023-06-15 13:39:17,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=153760.0, ans=0.125 2023-06-15 13:39:21,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=153760.0, ans=0.125 2023-06-15 13:40:02,725 INFO [train.py:988] (3/4) Epoch 44, batch 200, loss[loss=0.1791, simple_loss=0.2607, pruned_loss=0.04877, over 10927.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2851, pruned_loss=0.06029, over 2411363.16 frames. ], batch size: 30, lr: 7.56e-03, grad_scale: 32.0 2023-06-15 13:40:02,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153960.0, ans=0.125 2023-06-15 13:40:34,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.41 vs. limit=15.0 2023-06-15 13:40:39,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=154093.33333333334, ans=0.0 2023-06-15 13:41:00,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=154160.0, ans=0.0 2023-06-15 13:41:02,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.737e+02 1.890e+02 2.056e+02 2.866e+02, threshold=3.780e+02, percent-clipped=0.0 2023-06-15 13:41:32,039 INFO [train.py:988] (3/4) Epoch 44, batch 250, loss[loss=0.2095, simple_loss=0.2908, pruned_loss=0.06415, over 19560.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2846, pruned_loss=0.06056, over 2724157.36 frames. ], batch size: 102, lr: 7.55e-03, grad_scale: 32.0 2023-06-15 13:41:39,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154293.33333333334, ans=0.1 2023-06-15 13:42:29,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.67 vs. limit=10.0 2023-06-15 13:42:39,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=154493.33333333334, ans=0.0 2023-06-15 13:42:43,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.54 vs. limit=15.0 2023-06-15 13:43:00,296 INFO [train.py:988] (3/4) Epoch 44, batch 300, loss[loss=0.1926, simple_loss=0.2868, pruned_loss=0.04925, over 18317.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2852, pruned_loss=0.06091, over 2946891.69 frames. ], batch size: 74, lr: 7.54e-03, grad_scale: 32.0 2023-06-15 13:43:14,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=22.5 2023-06-15 13:43:24,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=154693.33333333334, ans=0.125 2023-06-15 13:43:34,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=154760.0, ans=0.0 2023-06-15 13:44:00,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.826e+02 2.061e+02 2.441e+02 3.264e+02, threshold=4.122e+02, percent-clipped=0.0 2023-06-15 13:44:16,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=154893.33333333334, ans=0.0 2023-06-15 13:44:24,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-15 13:44:30,138 INFO [train.py:988] (3/4) Epoch 44, batch 350, loss[loss=0.2117, simple_loss=0.3033, pruned_loss=0.06007, over 16185.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2847, pruned_loss=0.06067, over 3150027.69 frames. ], batch size: 52, lr: 7.53e-03, grad_scale: 32.0 2023-06-15 13:44:59,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=155026.66666666666, ans=0.125 2023-06-15 13:45:04,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=155093.33333333334, ans=0.0 2023-06-15 13:45:06,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:15,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:18,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:20,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=155093.33333333334, ans=0.05 2023-06-15 13:45:33,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.47 vs. limit=15.0 2023-06-15 13:45:35,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-15 13:45:36,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=155160.0, ans=0.07 2023-06-15 13:45:59,538 INFO [train.py:988] (3/4) Epoch 44, batch 400, loss[loss=0.1966, simple_loss=0.2828, pruned_loss=0.05522, over 18786.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2847, pruned_loss=0.06077, over 3292228.55 frames. ], batch size: 83, lr: 7.53e-03, grad_scale: 32.0 2023-06-15 13:46:09,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.34 vs. limit=22.5 2023-06-15 13:46:13,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=155293.33333333334, ans=0.0 2023-06-15 13:46:21,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-15 13:46:57,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.791e+02 1.946e+02 2.236e+02 4.124e+02, threshold=3.892e+02, percent-clipped=1.0 2023-06-15 13:47:26,922 INFO [train.py:988] (3/4) Epoch 44, batch 450, loss[loss=0.1834, simple_loss=0.2674, pruned_loss=0.04971, over 19820.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2851, pruned_loss=0.06061, over 3406417.72 frames. ], batch size: 115, lr: 7.52e-03, grad_scale: 32.0 2023-06-15 13:47:45,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155693.33333333334, ans=0.125 2023-06-15 13:47:46,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155693.33333333334, ans=0.125 2023-06-15 13:47:48,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=155693.33333333334, ans=0.1 2023-06-15 13:48:01,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=155760.0, ans=0.2 2023-06-15 13:48:03,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=155760.0, ans=0.125 2023-06-15 13:48:36,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=155893.33333333334, ans=0.125 2023-06-15 13:48:52,707 INFO [train.py:988] (3/4) Epoch 44, batch 500, loss[loss=0.1996, simple_loss=0.2832, pruned_loss=0.05799, over 19880.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2849, pruned_loss=0.06065, over 3498480.05 frames. ], batch size: 120, lr: 7.51e-03, grad_scale: 32.0 2023-06-15 13:49:00,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-15 13:49:27,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=156093.33333333334, ans=0.0 2023-06-15 13:49:32,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156093.33333333334, ans=0.1 2023-06-15 13:50:05,162 INFO [train.py:988] (3/4) Epoch 45, batch 0, loss[loss=0.1986, simple_loss=0.2849, pruned_loss=0.05613, over 18622.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2849, pruned_loss=0.05613, over 18622.00 frames. ], batch size: 80, lr: 7.42e-03, grad_scale: 32.0 2023-06-15 13:50:05,163 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 13:50:12,416 INFO [train.py:1020] (3/4) Epoch 45, validation: loss=0.2006, simple_loss=0.2992, pruned_loss=0.05105, over 143649.00 frames. 2023-06-15 13:50:12,417 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 13:50:14,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.838e+02 2.044e+02 2.323e+02 3.630e+02, threshold=4.088e+02, percent-clipped=0.0 2023-06-15 13:50:27,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=156173.33333333334, ans=12.0 2023-06-15 13:50:29,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.72 vs. limit=22.5 2023-06-15 13:51:18,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=156373.33333333334, ans=0.05 2023-06-15 13:51:22,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=156440.0, ans=0.2 2023-06-15 13:51:22,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=15.0 2023-06-15 13:51:41,660 INFO [train.py:988] (3/4) Epoch 45, batch 50, loss[loss=0.2023, simple_loss=0.2861, pruned_loss=0.05928, over 20349.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2833, pruned_loss=0.0578, over 864105.86 frames. ], batch size: 149, lr: 7.41e-03, grad_scale: 32.0 2023-06-15 13:51:48,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=156506.66666666666, ans=0.2 2023-06-15 13:51:51,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.99 vs. limit=15.0 2023-06-15 13:52:09,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=156573.33333333334, ans=0.0 2023-06-15 13:52:13,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=156573.33333333334, ans=0.05 2023-06-15 13:52:22,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=156640.0, ans=0.125 2023-06-15 13:52:44,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=156706.66666666666, ans=0.125 2023-06-15 13:53:06,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=156773.33333333334, ans=0.2 2023-06-15 13:53:10,660 INFO [train.py:988] (3/4) Epoch 45, batch 100, loss[loss=0.2011, simple_loss=0.2855, pruned_loss=0.05834, over 19217.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2838, pruned_loss=0.06015, over 1503022.52 frames. ], batch size: 92, lr: 7.41e-03, grad_scale: 32.0 2023-06-15 13:53:12,189 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.891e+02 2.087e+02 2.341e+02 3.228e+02, threshold=4.175e+02, percent-clipped=0.0 2023-06-15 13:53:34,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=156906.66666666666, ans=0.0 2023-06-15 13:53:48,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156973.33333333334, ans=0.125 2023-06-15 13:53:58,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=156973.33333333334, ans=0.0 2023-06-15 13:54:03,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=157040.0, ans=0.04949747468305833 2023-06-15 13:54:25,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=157106.66666666666, ans=0.0 2023-06-15 13:54:36,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-15 13:54:38,933 INFO [train.py:988] (3/4) Epoch 45, batch 150, loss[loss=0.2054, simple_loss=0.2934, pruned_loss=0.05871, over 18624.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2829, pruned_loss=0.05978, over 2014216.98 frames. ], batch size: 80, lr: 7.40e-03, grad_scale: 32.0 2023-06-15 13:55:33,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.04 vs. limit=10.0 2023-06-15 13:55:46,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=157373.33333333334, ans=0.125 2023-06-15 13:55:57,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=157440.0, ans=0.025 2023-06-15 13:56:07,408 INFO [train.py:988] (3/4) Epoch 45, batch 200, loss[loss=0.2034, simple_loss=0.2763, pruned_loss=0.06526, over 20721.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2837, pruned_loss=0.06011, over 2400282.41 frames. ], batch size: 211, lr: 7.39e-03, grad_scale: 32.0 2023-06-15 13:56:09,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.822e+02 1.971e+02 2.216e+02 3.677e+02, threshold=3.943e+02, percent-clipped=0.0 2023-06-15 13:56:15,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=157506.66666666666, ans=0.125 2023-06-15 13:56:49,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=157640.0, ans=0.125 2023-06-15 13:56:49,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157640.0, ans=0.125 2023-06-15 13:57:06,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=157706.66666666666, ans=0.0 2023-06-15 13:57:29,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2023-06-15 13:57:32,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=157773.33333333334, ans=0.2 2023-06-15 13:57:35,483 INFO [train.py:988] (3/4) Epoch 45, batch 250, loss[loss=0.1818, simple_loss=0.2662, pruned_loss=0.04867, over 19082.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2847, pruned_loss=0.05991, over 2703158.69 frames. ], batch size: 89, lr: 7.39e-03, grad_scale: 32.0 2023-06-15 13:57:54,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.19 vs. limit=22.5 2023-06-15 13:57:55,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=157906.66666666666, ans=0.125 2023-06-15 13:58:09,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-15 13:59:02,817 INFO [train.py:988] (3/4) Epoch 45, batch 300, loss[loss=0.209, simple_loss=0.2933, pruned_loss=0.06237, over 19557.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2834, pruned_loss=0.05972, over 2952039.66 frames. ], batch size: 102, lr: 7.38e-03, grad_scale: 32.0 2023-06-15 13:59:03,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2023-06-15 13:59:04,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.818e+02 2.016e+02 2.367e+02 3.194e+02, threshold=4.032e+02, percent-clipped=0.0 2023-06-15 13:59:18,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=158173.33333333334, ans=0.125 2023-06-15 13:59:21,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=158240.0, ans=0.125 2023-06-15 13:59:29,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2023-06-15 13:59:44,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=158306.66666666666, ans=0.125 2023-06-15 14:00:27,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=158440.0, ans=0.09899494936611666 2023-06-15 14:00:31,872 INFO [train.py:988] (3/4) Epoch 45, batch 350, loss[loss=0.1991, simple_loss=0.285, pruned_loss=0.05665, over 19087.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2843, pruned_loss=0.05938, over 3136172.10 frames. ], batch size: 89, lr: 7.37e-03, grad_scale: 32.0 2023-06-15 14:00:36,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-06-15 14:00:43,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.24 vs. limit=15.0 2023-06-15 14:01:03,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2023-06-15 14:01:14,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.44 vs. limit=15.0 2023-06-15 14:01:33,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=158706.66666666666, ans=0.125 2023-06-15 14:01:34,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=158706.66666666666, ans=0.2 2023-06-15 14:01:37,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=158706.66666666666, ans=0.125 2023-06-15 14:01:41,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-06-15 14:02:00,899 INFO [train.py:988] (3/4) Epoch 45, batch 400, loss[loss=0.2076, simple_loss=0.3083, pruned_loss=0.05341, over 18330.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2839, pruned_loss=0.05907, over 3292599.68 frames. ], batch size: 72, lr: 7.36e-03, grad_scale: 32.0 2023-06-15 14:02:02,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.797e+02 1.967e+02 2.284e+02 3.128e+02, threshold=3.934e+02, percent-clipped=0.0 2023-06-15 14:02:04,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=12.0 2023-06-15 14:02:23,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-15 14:03:00,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159040.0, ans=0.1 2023-06-15 14:03:28,288 INFO [train.py:988] (3/4) Epoch 45, batch 450, loss[loss=0.2109, simple_loss=0.3065, pruned_loss=0.05764, over 18305.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2841, pruned_loss=0.05911, over 3411151.02 frames. ], batch size: 72, lr: 7.36e-03, grad_scale: 16.0 2023-06-15 14:03:41,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-15 14:03:57,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=159240.0, ans=0.0 2023-06-15 14:04:43,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=159440.0, ans=0.125 2023-06-15 14:04:52,857 INFO [train.py:988] (3/4) Epoch 45, batch 500, loss[loss=0.1769, simple_loss=0.2634, pruned_loss=0.04514, over 19847.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2841, pruned_loss=0.05908, over 3494728.35 frames. ], batch size: 120, lr: 7.35e-03, grad_scale: 16.0 2023-06-15 14:04:56,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.833e+02 2.042e+02 2.435e+02 3.752e+02, threshold=4.085e+02, percent-clipped=0.0 2023-06-15 14:05:22,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.66 vs. limit=15.0 2023-06-15 14:05:43,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=159706.66666666666, ans=0.125 2023-06-15 14:06:05,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159726.66666666666, ans=0.0 2023-06-15 14:06:16,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-06-15 14:06:17,146 INFO [train.py:988] (3/4) Epoch 46, batch 0, loss[loss=0.2083, simple_loss=0.2863, pruned_loss=0.06516, over 20266.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2863, pruned_loss=0.06516, over 20266.00 frames. ], batch size: 141, lr: 7.27e-03, grad_scale: 32.0 2023-06-15 14:06:17,146 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 14:06:23,569 INFO [train.py:1020] (3/4) Epoch 46, validation: loss=0.2018, simple_loss=0.3001, pruned_loss=0.05177, over 143649.00 frames. 2023-06-15 14:06:23,570 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 14:06:35,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159726.66666666666, ans=0.125 2023-06-15 14:06:40,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=159793.33333333334, ans=0.04949747468305833 2023-06-15 14:06:44,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=159793.33333333334, ans=0.125 2023-06-15 14:06:47,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159793.33333333334, ans=0.1 2023-06-15 14:07:03,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=159860.0, ans=0.125 2023-06-15 14:07:14,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-06-15 14:07:40,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159993.33333333334, ans=0.1 2023-06-15 14:07:53,067 INFO [train.py:988] (3/4) Epoch 46, batch 50, loss[loss=0.2037, simple_loss=0.2846, pruned_loss=0.06137, over 20067.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2839, pruned_loss=0.06002, over 853099.68 frames. ], batch size: 133, lr: 7.26e-03, grad_scale: 32.0 2023-06-15 14:08:21,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.18 vs. limit=15.0 2023-06-15 14:08:25,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.799e+02 2.021e+02 2.409e+02 3.297e+02, threshold=4.042e+02, percent-clipped=0.0 2023-06-15 14:08:29,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.46 vs. limit=22.5 2023-06-15 14:09:04,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=12.0 2023-06-15 14:09:19,621 INFO [train.py:988] (3/4) Epoch 46, batch 100, loss[loss=0.1982, simple_loss=0.2876, pruned_loss=0.0544, over 19226.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2859, pruned_loss=0.05912, over 1498595.58 frames. ], batch size: 92, lr: 7.25e-03, grad_scale: 16.0 2023-06-15 14:09:21,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-15 14:09:40,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=160460.0, ans=0.2 2023-06-15 14:10:29,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=160660.0, ans=0.0 2023-06-15 14:10:45,276 INFO [train.py:988] (3/4) Epoch 46, batch 150, loss[loss=0.1953, simple_loss=0.2767, pruned_loss=0.0569, over 19235.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.286, pruned_loss=0.05976, over 1996852.20 frames. ], batch size: 92, lr: 7.24e-03, grad_scale: 16.0 2023-06-15 14:10:46,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.35 vs. limit=15.0 2023-06-15 14:10:54,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2023-06-15 14:11:04,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=160793.33333333334, ans=0.125 2023-06-15 14:11:08,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=15.0 2023-06-15 14:11:19,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.724e+02 1.880e+02 2.096e+02 2.711e+02, threshold=3.761e+02, percent-clipped=0.0 2023-06-15 14:11:19,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=160860.0, ans=0.2 2023-06-15 14:11:20,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2023-06-15 14:11:41,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-06-15 14:12:03,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=160993.33333333334, ans=0.125 2023-06-15 14:12:11,664 INFO [train.py:988] (3/4) Epoch 46, batch 200, loss[loss=0.212, simple_loss=0.2905, pruned_loss=0.06677, over 20299.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2845, pruned_loss=0.05887, over 2393972.22 frames. ], batch size: 149, lr: 7.24e-03, grad_scale: 16.0 2023-06-15 14:12:27,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=161126.66666666666, ans=0.125 2023-06-15 14:12:33,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=161126.66666666666, ans=0.0 2023-06-15 14:12:51,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.73 vs. limit=15.0 2023-06-15 14:12:54,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=161193.33333333334, ans=0.0 2023-06-15 14:13:23,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.67 vs. limit=22.5 2023-06-15 14:13:24,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=161326.66666666666, ans=0.125 2023-06-15 14:13:37,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=161326.66666666666, ans=0.2 2023-06-15 14:13:38,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=161393.33333333334, ans=0.07 2023-06-15 14:13:40,044 INFO [train.py:988] (3/4) Epoch 46, batch 250, loss[loss=0.2165, simple_loss=0.2924, pruned_loss=0.07031, over 20078.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2834, pruned_loss=0.05925, over 2719854.89 frames. ], batch size: 133, lr: 7.23e-03, grad_scale: 16.0 2023-06-15 14:13:47,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=161393.33333333334, ans=0.125 2023-06-15 14:14:04,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161460.0, ans=0.1 2023-06-15 14:14:06,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=161460.0, ans=0.125 2023-06-15 14:14:15,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.802e+02 2.040e+02 2.477e+02 3.551e+02, threshold=4.080e+02, percent-clipped=0.0 2023-06-15 14:14:27,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=161526.66666666666, ans=0.0 2023-06-15 14:14:35,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.59 vs. limit=10.0 2023-06-15 14:14:45,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2023-06-15 14:15:07,652 INFO [train.py:988] (3/4) Epoch 46, batch 300, loss[loss=0.2245, simple_loss=0.2982, pruned_loss=0.07541, over 20089.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2829, pruned_loss=0.05914, over 2953287.06 frames. ], batch size: 133, lr: 7.22e-03, grad_scale: 16.0 2023-06-15 14:15:23,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161793.33333333334, ans=0.125 2023-06-15 14:15:30,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=161793.33333333334, ans=0.125 2023-06-15 14:15:49,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=161860.0, ans=0.04949747468305833 2023-06-15 14:15:59,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=161926.66666666666, ans=0.2 2023-06-15 14:16:17,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=161993.33333333334, ans=0.5 2023-06-15 14:16:23,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=15.0 2023-06-15 14:16:35,645 INFO [train.py:988] (3/4) Epoch 46, batch 350, loss[loss=0.213, simple_loss=0.2809, pruned_loss=0.07251, over 20211.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2817, pruned_loss=0.05873, over 3143331.07 frames. ], batch size: 239, lr: 7.22e-03, grad_scale: 16.0 2023-06-15 14:16:42,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162060.0, ans=0.1 2023-06-15 14:16:44,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=162060.0, ans=0.125 2023-06-15 14:16:58,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=15.0 2023-06-15 14:17:10,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.820e+02 1.979e+02 2.259e+02 3.140e+02, threshold=3.959e+02, percent-clipped=0.0 2023-06-15 14:17:23,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=162193.33333333334, ans=0.0 2023-06-15 14:17:25,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=162193.33333333334, ans=0.125 2023-06-15 14:17:34,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=162260.0, ans=0.0 2023-06-15 14:18:01,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.36 vs. limit=15.0 2023-06-15 14:18:02,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.83 vs. limit=15.0 2023-06-15 14:18:04,282 INFO [train.py:988] (3/4) Epoch 46, batch 400, loss[loss=0.2151, simple_loss=0.2838, pruned_loss=0.07316, over 20582.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2816, pruned_loss=0.05878, over 3291060.73 frames. ], batch size: 189, lr: 7.21e-03, grad_scale: 32.0 2023-06-15 14:18:09,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=162393.33333333334, ans=0.05 2023-06-15 14:18:49,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=162526.66666666666, ans=0.05 2023-06-15 14:19:02,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=162593.33333333334, ans=0.0 2023-06-15 14:19:21,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=162660.0, ans=0.2 2023-06-15 14:19:33,240 INFO [train.py:988] (3/4) Epoch 46, batch 450, loss[loss=0.2059, simple_loss=0.2826, pruned_loss=0.06464, over 20250.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2822, pruned_loss=0.05897, over 3400025.87 frames. ], batch size: 141, lr: 7.20e-03, grad_scale: 32.0 2023-06-15 14:19:36,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.76 vs. limit=12.0 2023-06-15 14:19:40,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=162726.66666666666, ans=0.1 2023-06-15 14:19:42,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=29.42 vs. limit=22.5 2023-06-15 14:19:57,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-06-15 14:20:07,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.857e+02 2.090e+02 2.387e+02 3.299e+02, threshold=4.180e+02, percent-clipped=0.0 2023-06-15 14:20:14,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=162860.0, ans=0.5 2023-06-15 14:20:57,285 INFO [train.py:988] (3/4) Epoch 46, batch 500, loss[loss=0.2009, simple_loss=0.2873, pruned_loss=0.05725, over 18795.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2828, pruned_loss=0.05891, over 3477714.22 frames. ], batch size: 83, lr: 7.20e-03, grad_scale: 32.0 2023-06-15 14:21:38,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=163193.33333333334, ans=0.125 2023-06-15 14:21:40,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=22.5 2023-06-15 14:22:12,892 INFO [train.py:988] (3/4) Epoch 47, batch 0, loss[loss=0.2043, simple_loss=0.2751, pruned_loss=0.06676, over 19866.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2751, pruned_loss=0.06676, over 19866.00 frames. ], batch size: 293, lr: 7.11e-03, grad_scale: 32.0 2023-06-15 14:22:12,893 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 14:22:19,328 INFO [train.py:1020] (3/4) Epoch 47, validation: loss=0.2046, simple_loss=0.3006, pruned_loss=0.05427, over 143649.00 frames. 2023-06-15 14:22:19,329 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 14:22:54,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=163413.33333333334, ans=0.0 2023-06-15 14:23:23,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.765e+02 1.978e+02 2.218e+02 3.606e+02, threshold=3.956e+02, percent-clipped=0.0 2023-06-15 14:23:32,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163546.66666666666, ans=0.125 2023-06-15 14:23:46,069 INFO [train.py:988] (3/4) Epoch 47, batch 50, loss[loss=0.1974, simple_loss=0.2837, pruned_loss=0.0556, over 19430.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2833, pruned_loss=0.0587, over 859288.09 frames. ], batch size: 105, lr: 7.11e-03, grad_scale: 32.0 2023-06-15 14:24:11,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=163680.0, ans=0.0 2023-06-15 14:24:18,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=163680.0, ans=0.0 2023-06-15 14:24:29,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=163746.66666666666, ans=0.035 2023-06-15 14:24:42,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=15.0 2023-06-15 14:24:45,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=163813.33333333334, ans=0.025 2023-06-15 14:24:47,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=163813.33333333334, ans=0.0 2023-06-15 14:24:55,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-15 14:25:03,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=163880.0, ans=0.125 2023-06-15 14:25:13,301 INFO [train.py:988] (3/4) Epoch 47, batch 100, loss[loss=0.2094, simple_loss=0.2982, pruned_loss=0.0603, over 18775.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2825, pruned_loss=0.05853, over 1518836.04 frames. ], batch size: 83, lr: 7.10e-03, grad_scale: 32.0 2023-06-15 14:25:59,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=164080.0, ans=0.2 2023-06-15 14:26:03,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.88 vs. limit=10.0 2023-06-15 14:26:04,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:05,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=15.0 2023-06-15 14:26:05,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164146.66666666666, ans=0.1 2023-06-15 14:26:05,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:10,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:16,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:17,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.736e+02 1.928e+02 2.284e+02 3.416e+02, threshold=3.856e+02, percent-clipped=0.0 2023-06-15 14:26:24,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164213.33333333334, ans=0.1 2023-06-15 14:26:41,357 INFO [train.py:988] (3/4) Epoch 47, batch 150, loss[loss=0.2027, simple_loss=0.2879, pruned_loss=0.05871, over 19454.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2819, pruned_loss=0.05824, over 2011491.72 frames. ], batch size: 105, lr: 7.10e-03, grad_scale: 32.0 2023-06-15 14:26:41,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=164280.0, ans=0.125 2023-06-15 14:26:42,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.22 vs. limit=22.5 2023-06-15 14:26:46,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=164280.0, ans=0.09899494936611666 2023-06-15 14:27:00,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164346.66666666666, ans=0.1 2023-06-15 14:27:27,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164413.33333333334, ans=0.1 2023-06-15 14:27:30,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-06-15 14:27:37,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=164480.0, ans=0.1 2023-06-15 14:27:44,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=164480.0, ans=0.125 2023-06-15 14:28:08,848 INFO [train.py:988] (3/4) Epoch 47, batch 200, loss[loss=0.2073, simple_loss=0.2722, pruned_loss=0.07121, over 19910.00 frames. ], tot_loss[loss=0.199, simple_loss=0.282, pruned_loss=0.05799, over 2407626.02 frames. ], batch size: 293, lr: 7.09e-03, grad_scale: 32.0 2023-06-15 14:28:45,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=164746.66666666666, ans=0.125 2023-06-15 14:29:13,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2023-06-15 14:29:14,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.808e+02 2.037e+02 2.330e+02 4.045e+02, threshold=4.073e+02, percent-clipped=1.0 2023-06-15 14:29:36,284 INFO [train.py:988] (3/4) Epoch 47, batch 250, loss[loss=0.2087, simple_loss=0.2761, pruned_loss=0.0707, over 19942.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2817, pruned_loss=0.05816, over 2725910.29 frames. ], batch size: 293, lr: 7.08e-03, grad_scale: 32.0 2023-06-15 14:29:54,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165013.33333333334, ans=0.1 2023-06-15 14:29:54,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.21 vs. limit=15.0 2023-06-15 14:29:58,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=165013.33333333334, ans=0.125 2023-06-15 14:31:04,897 INFO [train.py:988] (3/4) Epoch 47, batch 300, loss[loss=0.2035, simple_loss=0.2942, pruned_loss=0.05642, over 16727.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2824, pruned_loss=0.05793, over 2957892.05 frames. ], batch size: 59, lr: 7.08e-03, grad_scale: 32.0 2023-06-15 14:31:14,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=165280.0, ans=0.0 2023-06-15 14:31:25,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=165346.66666666666, ans=0.0 2023-06-15 14:31:41,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=165413.33333333334, ans=0.2 2023-06-15 14:31:59,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=165480.0, ans=0.0 2023-06-15 14:32:11,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.863e+02 2.079e+02 2.322e+02 4.081e+02, threshold=4.157e+02, percent-clipped=1.0 2023-06-15 14:32:25,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165546.66666666666, ans=0.125 2023-06-15 14:32:33,699 INFO [train.py:988] (3/4) Epoch 47, batch 350, loss[loss=0.2188, simple_loss=0.2895, pruned_loss=0.07402, over 20286.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2827, pruned_loss=0.05817, over 3146861.14 frames. ], batch size: 239, lr: 7.07e-03, grad_scale: 32.0 2023-06-15 14:32:48,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=165613.33333333334, ans=0.5 2023-06-15 14:32:59,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=165680.0, ans=0.125 2023-06-15 14:33:42,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=165813.33333333334, ans=0.125 2023-06-15 14:34:03,137 INFO [train.py:988] (3/4) Epoch 47, batch 400, loss[loss=0.1788, simple_loss=0.2644, pruned_loss=0.04661, over 19860.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2833, pruned_loss=0.05863, over 3279264.42 frames. ], batch size: 120, lr: 7.06e-03, grad_scale: 32.0 2023-06-15 14:34:03,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165946.66666666666, ans=0.1 2023-06-15 14:34:11,291 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-15 14:34:16,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=165946.66666666666, ans=0.0 2023-06-15 14:34:42,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-15 14:35:05,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=166146.66666666666, ans=0.125 2023-06-15 14:35:08,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.781e+02 2.023e+02 2.326e+02 3.205e+02, threshold=4.047e+02, percent-clipped=0.0 2023-06-15 14:35:17,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=166213.33333333334, ans=0.0 2023-06-15 14:35:23,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166213.33333333334, ans=0.1 2023-06-15 14:35:32,249 INFO [train.py:988] (3/4) Epoch 47, batch 450, loss[loss=0.1929, simple_loss=0.2474, pruned_loss=0.06917, over 17126.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2818, pruned_loss=0.05769, over 3404106.55 frames. ], batch size: 392, lr: 7.06e-03, grad_scale: 32.0 2023-06-15 14:36:01,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.75 vs. limit=15.0 2023-06-15 14:36:14,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=166413.33333333334, ans=0.035 2023-06-15 14:36:31,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166480.0, ans=0.0 2023-06-15 14:36:42,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.73 vs. limit=15.0 2023-06-15 14:36:46,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=166546.66666666666, ans=0.125 2023-06-15 14:36:49,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=166546.66666666666, ans=0.125 2023-06-15 14:36:53,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=166546.66666666666, ans=0.2 2023-06-15 14:36:58,231 INFO [train.py:988] (3/4) Epoch 47, batch 500, loss[loss=0.2176, simple_loss=0.3024, pruned_loss=0.06642, over 16987.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2816, pruned_loss=0.05812, over 3502861.71 frames. ], batch size: 60, lr: 7.05e-03, grad_scale: 32.0 2023-06-15 14:37:16,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=166680.0, ans=0.125 2023-06-15 14:37:39,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=166746.66666666666, ans=0.2 2023-06-15 14:37:45,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166813.33333333334, ans=0.1 2023-06-15 14:38:07,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=166826.66666666666, ans=0.125 2023-06-15 14:38:12,617 INFO [train.py:988] (3/4) Epoch 48, batch 0, loss[loss=0.1911, simple_loss=0.272, pruned_loss=0.05512, over 19103.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.272, pruned_loss=0.05512, over 19103.00 frames. ], batch size: 94, lr: 6.97e-03, grad_scale: 32.0 2023-06-15 14:38:12,617 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 14:38:18,657 INFO [train.py:1020] (3/4) Epoch 48, validation: loss=0.1998, simple_loss=0.298, pruned_loss=0.05082, over 143649.00 frames. 2023-06-15 14:38:18,657 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 14:38:21,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2023-06-15 14:38:26,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.746e+02 1.946e+02 2.285e+02 3.541e+02, threshold=3.892e+02, percent-clipped=0.0 2023-06-15 14:38:47,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=166893.33333333334, ans=0.02 2023-06-15 14:38:58,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.73 vs. limit=6.0 2023-06-15 14:39:08,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=166960.0, ans=0.0 2023-06-15 14:39:11,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.13 vs. limit=15.0 2023-06-15 14:39:35,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=167093.33333333334, ans=0.0 2023-06-15 14:39:47,291 INFO [train.py:988] (3/4) Epoch 48, batch 50, loss[loss=0.1856, simple_loss=0.2719, pruned_loss=0.04962, over 19861.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2826, pruned_loss=0.05727, over 864315.12 frames. ], batch size: 120, lr: 6.96e-03, grad_scale: 32.0 2023-06-15 14:39:58,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=167160.0, ans=0.125 2023-06-15 14:40:21,591 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:40:24,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=167293.33333333334, ans=0.2 2023-06-15 14:40:28,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=167293.33333333334, ans=0.125 2023-06-15 14:40:41,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.01 vs. limit=22.5 2023-06-15 14:40:44,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=167360.0, ans=0.125 2023-06-15 14:40:51,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=167360.0, ans=0.125 2023-06-15 14:40:51,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=167360.0, ans=0.0 2023-06-15 14:41:00,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=167426.66666666666, ans=8.0 2023-06-15 14:41:15,603 INFO [train.py:988] (3/4) Epoch 48, batch 100, loss[loss=0.1788, simple_loss=0.267, pruned_loss=0.04534, over 19082.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2807, pruned_loss=0.05581, over 1513550.98 frames. ], batch size: 89, lr: 6.96e-03, grad_scale: 32.0 2023-06-15 14:41:25,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.839e+02 2.012e+02 2.249e+02 3.194e+02, threshold=4.023e+02, percent-clipped=0.0 2023-06-15 14:41:30,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167493.33333333334, ans=0.125 2023-06-15 14:41:36,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=167560.0, ans=0.0 2023-06-15 14:41:38,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.91 vs. limit=15.0 2023-06-15 14:41:40,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.65 vs. limit=12.0 2023-06-15 14:41:41,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=167560.0, ans=0.125 2023-06-15 14:41:48,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=167560.0, ans=0.5 2023-06-15 14:42:43,938 INFO [train.py:988] (3/4) Epoch 48, batch 150, loss[loss=0.1853, simple_loss=0.2648, pruned_loss=0.05292, over 20057.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2807, pruned_loss=0.05623, over 2026990.44 frames. ], batch size: 133, lr: 6.95e-03, grad_scale: 32.0 2023-06-15 14:42:51,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=167826.66666666666, ans=0.0 2023-06-15 14:43:11,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=167893.33333333334, ans=0.2 2023-06-15 14:43:17,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=167893.33333333334, ans=0.025 2023-06-15 14:43:19,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.51 vs. limit=15.0 2023-06-15 14:43:23,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=167960.0, ans=0.125 2023-06-15 14:43:23,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=167960.0, ans=0.125 2023-06-15 14:43:47,290 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:44:13,000 INFO [train.py:988] (3/4) Epoch 48, batch 200, loss[loss=0.1896, simple_loss=0.2661, pruned_loss=0.05653, over 20281.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2815, pruned_loss=0.05717, over 2416959.59 frames. ], batch size: 239, lr: 6.95e-03, grad_scale: 32.0 2023-06-15 14:44:21,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.752e+02 1.958e+02 2.186e+02 2.989e+02, threshold=3.915e+02, percent-clipped=0.0 2023-06-15 14:44:36,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=168226.66666666666, ans=0.0 2023-06-15 14:44:55,975 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:45:13,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=168360.0, ans=10.0 2023-06-15 14:45:25,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.21 vs. limit=15.0 2023-06-15 14:45:41,716 INFO [train.py:988] (3/4) Epoch 48, batch 250, loss[loss=0.1926, simple_loss=0.276, pruned_loss=0.05458, over 19940.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2819, pruned_loss=0.05776, over 2727452.75 frames. ], batch size: 126, lr: 6.94e-03, grad_scale: 32.0 2023-06-15 14:45:53,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-15 14:45:54,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=168493.33333333334, ans=0.2 2023-06-15 14:45:54,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=168493.33333333334, ans=10.0 2023-06-15 14:45:56,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.17 vs. limit=12.0 2023-06-15 14:46:12,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168560.0, ans=0.1 2023-06-15 14:46:32,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.85 vs. limit=6.0 2023-06-15 14:47:00,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=168760.0, ans=0.0 2023-06-15 14:47:02,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=168760.0, ans=0.04949747468305833 2023-06-15 14:47:10,261 INFO [train.py:988] (3/4) Epoch 48, batch 300, loss[loss=0.1853, simple_loss=0.2763, pruned_loss=0.04711, over 18617.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2828, pruned_loss=0.05766, over 2950435.25 frames. ], batch size: 80, lr: 6.93e-03, grad_scale: 32.0 2023-06-15 14:47:16,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=168826.66666666666, ans=0.125 2023-06-15 14:47:19,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.766e+02 2.086e+02 2.478e+02 4.078e+02, threshold=4.173e+02, percent-clipped=1.0 2023-06-15 14:47:31,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=168893.33333333334, ans=0.125 2023-06-15 14:47:51,642 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:48:38,584 INFO [train.py:988] (3/4) Epoch 48, batch 350, loss[loss=0.2229, simple_loss=0.3123, pruned_loss=0.06677, over 16335.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.282, pruned_loss=0.05723, over 3139444.80 frames. ], batch size: 52, lr: 6.93e-03, grad_scale: 32.0 2023-06-15 14:48:43,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=169160.0, ans=0.2 2023-06-15 14:49:06,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=169226.66666666666, ans=0.125 2023-06-15 14:50:04,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=169493.33333333334, ans=0.0 2023-06-15 14:50:05,726 INFO [train.py:988] (3/4) Epoch 48, batch 400, loss[loss=0.1889, simple_loss=0.2702, pruned_loss=0.05382, over 19075.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2819, pruned_loss=0.05725, over 3278840.59 frames. ], batch size: 89, lr: 6.92e-03, grad_scale: 32.0 2023-06-15 14:50:13,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.792e+02 1.982e+02 2.238e+02 3.664e+02, threshold=3.964e+02, percent-clipped=0.0 2023-06-15 14:50:58,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=169693.33333333334, ans=0.09899494936611666 2023-06-15 14:51:05,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=169693.33333333334, ans=0.0 2023-06-15 14:51:22,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=169760.0, ans=0.125 2023-06-15 14:51:32,757 INFO [train.py:988] (3/4) Epoch 48, batch 450, loss[loss=0.1976, simple_loss=0.2832, pruned_loss=0.05601, over 18753.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2812, pruned_loss=0.05724, over 3391705.97 frames. ], batch size: 83, lr: 6.91e-03, grad_scale: 32.0 2023-06-15 14:51:45,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169826.66666666666, ans=0.1 2023-06-15 14:51:56,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2023-06-15 14:52:17,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-06-15 14:52:26,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170026.66666666666, ans=0.1 2023-06-15 14:52:48,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=170093.33333333334, ans=0.125 2023-06-15 14:52:57,836 INFO [train.py:988] (3/4) Epoch 48, batch 500, loss[loss=0.2294, simple_loss=0.297, pruned_loss=0.08092, over 19923.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2807, pruned_loss=0.05785, over 3477697.14 frames. ], batch size: 126, lr: 6.91e-03, grad_scale: 32.0 2023-06-15 14:53:06,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.838e+02 2.045e+02 2.451e+02 3.381e+02, threshold=4.090e+02, percent-clipped=0.0 2023-06-15 14:53:08,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170160.0, ans=0.1 2023-06-15 14:53:16,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=170226.66666666666, ans=0.2 2023-06-15 14:53:32,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170293.33333333334, ans=0.1 2023-06-15 14:53:39,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=170293.33333333334, ans=0.05 2023-06-15 14:54:12,511 INFO [train.py:988] (3/4) Epoch 49, batch 0, loss[loss=0.2006, simple_loss=0.2856, pruned_loss=0.05783, over 19117.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2856, pruned_loss=0.05783, over 19117.00 frames. ], batch size: 94, lr: 6.83e-03, grad_scale: 32.0 2023-06-15 14:54:12,511 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 14:54:19,023 INFO [train.py:1020] (3/4) Epoch 49, validation: loss=0.2025, simple_loss=0.2999, pruned_loss=0.05253, over 143649.00 frames. 2023-06-15 14:54:19,024 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 14:54:33,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=170373.33333333334, ans=0.125 2023-06-15 14:54:40,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=170440.0, ans=0.125 2023-06-15 14:54:43,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=170440.0, ans=0.0 2023-06-15 14:54:50,981 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:55:03,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=170506.66666666666, ans=0.125 2023-06-15 14:55:15,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=170573.33333333334, ans=0.5 2023-06-15 14:55:44,835 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:55:47,864 INFO [train.py:988] (3/4) Epoch 49, batch 50, loss[loss=0.19, simple_loss=0.2808, pruned_loss=0.04955, over 19349.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2838, pruned_loss=0.05704, over 857736.40 frames. ], batch size: 98, lr: 6.83e-03, grad_scale: 32.0 2023-06-15 14:56:22,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170840.0, ans=0.1 2023-06-15 14:56:24,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=170840.0, ans=0.125 2023-06-15 14:56:29,115 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:56:31,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.705e+02 1.886e+02 2.163e+02 3.210e+02, threshold=3.772e+02, percent-clipped=0.0 2023-06-15 14:56:39,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-06-15 14:57:16,253 INFO [train.py:988] (3/4) Epoch 49, batch 100, loss[loss=0.1968, simple_loss=0.2851, pruned_loss=0.05423, over 18295.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2823, pruned_loss=0.05656, over 1512098.93 frames. ], batch size: 74, lr: 6.82e-03, grad_scale: 32.0 2023-06-15 14:57:16,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=171040.0, ans=0.125 2023-06-15 14:57:23,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=171040.0, ans=0.125 2023-06-15 14:57:32,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=171106.66666666666, ans=0.07 2023-06-15 14:57:50,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=171173.33333333334, ans=0.125 2023-06-15 14:57:53,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.13 vs. limit=15.0 2023-06-15 14:58:01,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=22.5 2023-06-15 14:58:05,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171173.33333333334, ans=0.1 2023-06-15 14:58:19,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-06-15 14:58:29,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=171306.66666666666, ans=0.09899494936611666 2023-06-15 14:58:44,424 INFO [train.py:988] (3/4) Epoch 49, batch 150, loss[loss=0.1867, simple_loss=0.28, pruned_loss=0.04664, over 19339.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2821, pruned_loss=0.05638, over 2011121.76 frames. ], batch size: 98, lr: 6.81e-03, grad_scale: 32.0 2023-06-15 14:58:58,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=171373.33333333334, ans=0.0 2023-06-15 14:59:13,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171440.0, ans=0.125 2023-06-15 14:59:27,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.782e+02 1.923e+02 2.190e+02 3.146e+02, threshold=3.845e+02, percent-clipped=0.0 2023-06-15 15:00:13,683 INFO [train.py:988] (3/4) Epoch 49, batch 200, loss[loss=0.1824, simple_loss=0.2647, pruned_loss=0.05006, over 19346.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2813, pruned_loss=0.0561, over 2405296.73 frames. ], batch size: 98, lr: 6.81e-03, grad_scale: 32.0 2023-06-15 15:00:23,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=171706.66666666666, ans=0.125 2023-06-15 15:00:38,336 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:01:05,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171906.66666666666, ans=0.1 2023-06-15 15:01:41,436 INFO [train.py:988] (3/4) Epoch 49, batch 250, loss[loss=0.1959, simple_loss=0.2748, pruned_loss=0.05854, over 20352.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2817, pruned_loss=0.05631, over 2706120.86 frames. ], batch size: 149, lr: 6.80e-03, grad_scale: 32.0 2023-06-15 15:02:08,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.59 vs. limit=15.0 2023-06-15 15:02:22,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.787e+02 2.024e+02 2.612e+02 4.231e+02, threshold=4.048e+02, percent-clipped=3.0 2023-06-15 15:02:23,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=172173.33333333334, ans=0.125 2023-06-15 15:02:56,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=172306.66666666666, ans=0.125 2023-06-15 15:03:08,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=172373.33333333334, ans=0.125 2023-06-15 15:03:09,724 INFO [train.py:988] (3/4) Epoch 49, batch 300, loss[loss=0.184, simple_loss=0.2757, pruned_loss=0.04609, over 19786.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2818, pruned_loss=0.05687, over 2947361.24 frames. ], batch size: 115, lr: 6.80e-03, grad_scale: 32.0 2023-06-15 15:04:38,776 INFO [train.py:988] (3/4) Epoch 49, batch 350, loss[loss=0.2056, simple_loss=0.2842, pruned_loss=0.06354, over 19337.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2809, pruned_loss=0.05682, over 3148602.01 frames. ], batch size: 98, lr: 6.79e-03, grad_scale: 32.0 2023-06-15 15:04:40,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=172706.66666666666, ans=0.125 2023-06-15 15:04:51,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=172706.66666666666, ans=0.125 2023-06-15 15:05:09,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=172773.33333333334, ans=0.2 2023-06-15 15:05:22,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.759e+02 1.916e+02 2.182e+02 3.623e+02, threshold=3.831e+02, percent-clipped=0.0 2023-06-15 15:06:07,873 INFO [train.py:988] (3/4) Epoch 49, batch 400, loss[loss=0.1949, simple_loss=0.2784, pruned_loss=0.0557, over 20103.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2813, pruned_loss=0.05664, over 3283021.48 frames. ], batch size: 133, lr: 6.78e-03, grad_scale: 32.0 2023-06-15 15:06:38,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.65 vs. limit=10.0 2023-06-15 15:06:38,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-15 15:06:56,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=173173.33333333334, ans=0.0 2023-06-15 15:07:22,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=173306.66666666666, ans=0.125 2023-06-15 15:07:37,456 INFO [train.py:988] (3/4) Epoch 49, batch 450, loss[loss=0.2085, simple_loss=0.292, pruned_loss=0.06248, over 19203.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2808, pruned_loss=0.05639, over 3405240.80 frames. ], batch size: 92, lr: 6.78e-03, grad_scale: 32.0 2023-06-15 15:07:39,588 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:08:03,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=173440.0, ans=0.125 2023-06-15 15:08:13,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=173506.66666666666, ans=0.125 2023-06-15 15:08:20,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.766e+02 1.952e+02 2.304e+02 4.249e+02, threshold=3.904e+02, percent-clipped=1.0 2023-06-15 15:08:25,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=173506.66666666666, ans=0.2 2023-06-15 15:09:00,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=173640.0, ans=0.125 2023-06-15 15:09:04,005 INFO [train.py:988] (3/4) Epoch 49, batch 500, loss[loss=0.1778, simple_loss=0.2651, pruned_loss=0.04527, over 18907.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2813, pruned_loss=0.05638, over 3483580.60 frames. ], batch size: 86, lr: 6.77e-03, grad_scale: 32.0 2023-06-15 15:09:04,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=173706.66666666666, ans=0.2 2023-06-15 15:09:27,549 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:09:46,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=173840.0, ans=0.125 2023-06-15 15:09:48,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=173840.0, ans=0.125 2023-06-15 15:09:49,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=173840.0, ans=0.125 2023-06-15 15:09:55,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-15 15:10:17,332 INFO [train.py:988] (3/4) Epoch 50, batch 0, loss[loss=0.1973, simple_loss=0.2939, pruned_loss=0.05035, over 15618.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2939, pruned_loss=0.05035, over 15618.00 frames. ], batch size: 44, lr: 6.70e-03, grad_scale: 32.0 2023-06-15 15:10:17,333 INFO [train.py:1011] (3/4) Computing validation loss 2023-06-15 15:10:23,502 INFO [train.py:1020] (3/4) Epoch 50, validation: loss=0.202, simple_loss=0.299, pruned_loss=0.05252, over 143649.00 frames. 2023-06-15 15:10:23,503 INFO [train.py:1021] (3/4) Maximum memory allocated so far is 13500MB 2023-06-15 15:10:33,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=173926.66666666666, ans=0.0 2023-06-15 15:10:52,407 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:11:05,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=174060.0, ans=0.07 2023-06-15 15:11:21,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-06-15 15:11:31,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174193.33333333334, ans=0.1 2023-06-15 15:11:33,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.754e+02 1.986e+02 2.353e+02 3.317e+02, threshold=3.972e+02, percent-clipped=0.0 2023-06-15 15:11:50,942 INFO [train.py:988] (3/4) Epoch 50, batch 50, loss[loss=0.1972, simple_loss=0.2714, pruned_loss=0.06151, over 20657.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2807, pruned_loss=0.05707, over 847957.56 frames. ], batch size: 211, lr: 6.69e-03, grad_scale: 32.0 2023-06-15 15:13:17,580 INFO [train.py:988] (3/4) Epoch 50, batch 100, loss[loss=0.1953, simple_loss=0.2782, pruned_loss=0.05621, over 20117.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2789, pruned_loss=0.05487, over 1492486.99 frames. ], batch size: 133, lr: 6.69e-03, grad_scale: 32.0 2023-06-15 15:13:24,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=174593.33333333334, ans=0.2 2023-06-15 15:13:38,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=174660.0, ans=0.125 2023-06-15 15:13:56,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=174726.66666666666, ans=0.125 2023-06-15 15:13:58,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=174726.66666666666, ans=0.1 2023-06-15 15:14:05,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-06-15 15:14:16,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=174793.33333333334, ans=0.125 2023-06-15 15:14:28,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.801e+02 1.997e+02 2.286e+02 3.614e+02, threshold=3.994e+02, percent-clipped=0.0 2023-06-15 15:14:35,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=174860.0, ans=0.2 2023-06-15 15:14:40,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=174860.0, ans=0.125 2023-06-15 15:14:42,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.51 vs. limit=6.0 2023-06-15 15:14:43,480 INFO [train.py:988] (3/4) Epoch 50, batch 150, loss[loss=0.2083, simple_loss=0.2986, pruned_loss=0.05901, over 17660.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2799, pruned_loss=0.05491, over 1999927.63 frames. ], batch size: 67, lr: 6.68e-03, grad_scale: 32.0 2023-06-15 15:15:00,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=174993.33333333334, ans=0.125 2023-06-15 15:15:38,385 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:16:05,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=175193.33333333334, ans=0.2 2023-06-15 15:16:09,939 INFO [train.py:988] (3/4) Epoch 50, batch 200, loss[loss=0.1949, simple_loss=0.2809, pruned_loss=0.05447, over 19719.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2798, pruned_loss=0.05498, over 2382981.44 frames. ], batch size: 110, lr: 6.68e-03, grad_scale: 32.0 2023-06-15 15:16:32,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=175326.66666666666, ans=0.0 2023-06-15 15:16:37,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=175326.66666666666, ans=0.04949747468305833 2023-06-15 15:16:55,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.74 vs. limit=22.5 2023-06-15 15:16:55,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=15.0 2023-06-15 15:17:05,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.61 vs. limit=22.5 2023-06-15 15:17:23,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.770e+02 1.950e+02 2.283e+02 3.292e+02, threshold=3.901e+02, percent-clipped=0.0 2023-06-15 15:17:28,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=175526.66666666666, ans=0.0 2023-06-15 15:17:35,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175526.66666666666, ans=0.1 2023-06-15 15:17:38,331 INFO [train.py:988] (3/4) Epoch 50, batch 250, loss[loss=0.1859, simple_loss=0.2606, pruned_loss=0.05555, over 20229.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2795, pruned_loss=0.05507, over 2692311.98 frames. ], batch size: 239, lr: 6.67e-03, grad_scale: 32.0 2023-06-15 15:18:17,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=175726.66666666666, ans=0.125 2023-06-15 15:18:49,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=175860.0, ans=0.0 2023-06-15 15:18:55,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=175860.0, ans=0.125 2023-06-15 15:19:03,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=175860.0, ans=0.1 2023-06-15 15:19:06,262 INFO [train.py:988] (3/4) Epoch 50, batch 300, loss[loss=0.1816, simple_loss=0.2676, pruned_loss=0.04779, over 19226.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2796, pruned_loss=0.05567, over 2934947.55 frames. ], batch size: 92, lr: 6.66e-03, grad_scale: 16.0 2023-06-15 15:19:08,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=175926.66666666666, ans=0.2 2023-06-15 15:19:12,503 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:19:16,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-15 15:19:38,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=175993.33333333334, ans=0.0 2023-06-15 15:19:49,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176060.0, ans=0.0 2023-06-15 15:20:00,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.99 vs. limit=15.0 2023-06-15 15:20:10,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=176126.66666666666, ans=0.125 2023-06-15 15:20:14,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=176126.66666666666, ans=0.125 2023-06-15 15:20:21,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.741e+02 2.011e+02 2.290e+02 4.002e+02, threshold=4.022e+02, percent-clipped=1.0 2023-06-15 15:20:34,812 INFO [train.py:988] (3/4) Epoch 50, batch 350, loss[loss=0.1881, simple_loss=0.2763, pruned_loss=0.04992, over 19841.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2803, pruned_loss=0.05557, over 3127891.03 frames. ], batch size: 115, lr: 6.66e-03, grad_scale: 16.0 2023-06-15 15:20:41,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=176260.0, ans=0.125 2023-06-15 15:20:51,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=176326.66666666666, ans=0.0 2023-06-15 15:21:17,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=176393.33333333334, ans=0.125 2023-06-15 15:21:25,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=176460.0, ans=0.125 2023-06-15 15:21:37,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=176460.0, ans=0.125 2023-06-15 15:21:42,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=176460.0, ans=0.125 2023-06-15 15:22:03,797 INFO [train.py:988] (3/4) Epoch 50, batch 400, loss[loss=0.195, simple_loss=0.2812, pruned_loss=0.0544, over 19513.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2799, pruned_loss=0.05563, over 3269502.59 frames. ], batch size: 102, lr: 6.65e-03, grad_scale: 32.0 2023-06-15 15:22:09,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=176593.33333333334, ans=0.0 2023-06-15 15:22:18,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=176593.33333333334, ans=0.1 2023-06-15 15:22:28,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=176660.0, ans=0.0 2023-06-15 15:23:05,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=176793.33333333334, ans=0.125 2023-06-15 15:23:17,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.749e+02 1.909e+02 2.121e+02 2.900e+02, threshold=3.818e+02, percent-clipped=0.0 2023-06-15 15:23:32,292 INFO [train.py:988] (3/4) Epoch 50, batch 450, loss[loss=0.1892, simple_loss=0.275, pruned_loss=0.05166, over 19108.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2805, pruned_loss=0.05559, over 3381598.23 frames. ], batch size: 94, lr: 6.65e-03, grad_scale: 32.0 2023-06-15 15:23:35,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.96 vs. limit=15.0 2023-06-15 15:23:57,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=176993.33333333334, ans=0.0 2023-06-15 15:24:01,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=12.0 2023-06-15 15:24:09,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=177060.0, ans=0.0 2023-06-15 15:24:19,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=177060.0, ans=0.125 2023-06-15 15:24:57,648 INFO [train.py:988] (3/4) Epoch 50, batch 500, loss[loss=0.2123, simple_loss=0.2915, pruned_loss=0.06652, over 20512.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2802, pruned_loss=0.05584, over 3490406.91 frames. ], batch size: 160, lr: 6.64e-03, grad_scale: 32.0 2023-06-15 15:25:01,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=177260.0, ans=0.2 2023-06-15 15:25:20,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-15 15:25:23,125 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:25:41,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.66 vs. limit=10.0 2023-06-15 15:25:51,662 INFO [train.py:1201] (3/4) Done!