2023-06-15 01:56:32,529 INFO [train.py:1056] (2/4) Training started 2023-06-15 01:56:32,530 INFO [train.py:1066] (2/4) Device: cuda:2 2023-06-15 01:56:32,536 INFO [train.py:1075] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Debug', 'k2-with-cuda': True, 'k2-git-sha1': '38211604d6a24b15f320578a1a38f6c12d7a711c', 'k2-git-date': 'Mon Jun 12 10:59:44 2023', 'lhotse-version': '1.15.0.dev+git.f1fd23d.clean', 'torch-version': '2.0.0+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.8', 'icefall-git-branch': 'ted/zipformer', 'icefall-git-sha1': '323a299-dirty', 'icefall-git-date': 'Tue Jun 13 04:47:15 2023', 'icefall-path': '/exp/draj/jsalt2023/icefall', 'k2-path': '/exp/draj/jsalt2023/k2/k2/python/k2/__init__.py', 'lhotse-path': '/exp/draj/jsalt2023/lhotse/lhotse/__init__.py', 'hostname': 'r2n01', 'IP address': '10.1.2.1'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp/v5'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/manifests'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-06-15 01:56:32,537 INFO [train.py:1077] (2/4) About to create model 2023-06-15 01:56:33,454 INFO [train.py:1081] (2/4) Number of model parameters: 65549011 2023-06-15 01:56:46,306 INFO [train.py:1096] (2/4) Using DDP 2023-06-15 01:56:46,643 INFO [asr_datamodule.py:356] (2/4) About to get train cuts 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:185] (2/4) Enable SpecAugment 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:186] (2/4) Time warp factor: 80 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:202] (2/4) About to get Musan cuts 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:205] (2/4) Enable MUSAN 2023-06-15 01:56:48,309 INFO [asr_datamodule.py:227] (2/4) About to create train dataset 2023-06-15 01:56:48,309 INFO [asr_datamodule.py:253] (2/4) Using DynamicBucketingSampler. 2023-06-15 01:56:50,471 INFO [asr_datamodule.py:274] (2/4) About to create train dataloader 2023-06-15 01:56:50,471 INFO [asr_datamodule.py:361] (2/4) About to get dev cuts 2023-06-15 01:56:50,472 INFO [asr_datamodule.py:295] (2/4) About to create dev dataset 2023-06-15 01:56:50,493 INFO [asr_datamodule.py:314] (2/4) About to create dev dataloader 2023-06-15 01:56:50,493 INFO [train.py:1249] (2/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-15 01:57:37,207 INFO [scaling.py:962] (2/4) Whitening: name=None, num_groups=4, num_channels=128, metric=13.47 vs. limit=3.0 2023-06-15 01:57:37,530 INFO [scaling.py:962] (2/4) Whitening: name=None, num_groups=1, num_channels=256, metric=44.35 vs. limit=5.0 2023-06-15 01:57:37,878 INFO [train.py:1277] (2/4) Maximum memory allocated so far is 8736MB 2023-06-15 01:57:40,147 INFO [train.py:1277] (2/4) Maximum memory allocated so far is 8863MB 2023-06-15 01:57:51,999 INFO [train.py:1277] (2/4) Maximum memory allocated so far is 11507MB 2023-06-15 01:57:58,249 INFO [train.py:1277] (2/4) Maximum memory allocated so far is 11837MB 2023-06-15 01:58:17,635 INFO [train.py:1277] (2/4) Maximum memory allocated so far is 11837MB 2023-06-15 01:58:27,467 INFO [train.py:1277] (2/4) Maximum memory allocated so far is 11988MB 2023-06-15 01:58:50,927 INFO [train.py:988] (2/4) Epoch 1, batch 0, loss[loss=7.847, simple_loss=7.147, pruned_loss=6.981, over 18273.00 frames. ], tot_loss[loss=7.847, simple_loss=7.147, pruned_loss=6.981, over 18273.00 frames. ], batch size: 74, lr: 2.00e-02, grad_scale: 1.0 2023-06-15 01:58:50,928 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 01:58:57,871 INFO [train.py:1020] (2/4) Epoch 1, validation: loss=7.824, simple_loss=7.131, pruned_loss=6.914, over 143649.00 frames. 2023-06-15 01:58:57,871 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 11988MB 2023-06-15 01:59:01,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=7.5 2023-06-15 01:59:05,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=0.0, ans=0.5 2023-06-15 01:59:09,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=0.0, ans=0.9 2023-06-15 01:59:21,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=7.5 2023-06-15 01:59:46,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=66.66666666666667, ans=0.1975 2023-06-15 02:00:12,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=133.33333333333334, ans=0.5 2023-06-15 02:00:18,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=502.07 vs. limit=7.6 2023-06-15 02:00:27,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=279.03 vs. limit=7.55 2023-06-15 02:00:35,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=31.64 vs. limit=5.05 2023-06-15 02:00:39,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=76.79 vs. limit=7.575 2023-06-15 02:01:05,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=266.6666666666667, ans=0.4875 2023-06-15 02:01:12,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=116.22 vs. limit=7.7 2023-06-15 02:01:16,037 INFO [train.py:988] (2/4) Epoch 1, batch 50, loss[loss=1.488, simple_loss=1.33, pruned_loss=1.421, over 19971.00 frames. ], tot_loss[loss=3.411, simple_loss=3.141, pruned_loss=2.64, over 871859.79 frames. ], batch size: 126, lr: 2.20e-02, grad_scale: 0.25 2023-06-15 02:01:16,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=333.3333333333333, ans=0.4583333333333333 2023-06-15 02:01:27,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=40.62 vs. limit=7.625 2023-06-15 02:01:33,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=333.3333333333333, ans=0.0925 2023-06-15 02:01:33,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=59.05 vs. limit=7.75 2023-06-15 02:01:33,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=17.34 vs. limit=5.083333333333333 2023-06-15 02:01:41,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=400.0, ans=0.20600000000000002 2023-06-15 02:01:41,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=400.0, ans=0.04875 2023-06-15 02:01:42,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=109.83 vs. limit=7.65 2023-06-15 02:01:46,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=4.16 2023-06-15 02:01:58,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=206.01 vs. limit=7.65 2023-06-15 02:02:07,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.51 vs. limit=7.85 2023-06-15 02:02:11,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=133.00 vs. limit=7.675 2023-06-15 02:02:23,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=533.3333333333334, ans=0.475 2023-06-15 02:02:24,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=120.45 vs. limit=7.7 2023-06-15 02:02:26,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=499.49 vs. limit=7.7 2023-06-15 02:02:27,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=533.3333333333334, ans=0.08800000000000001 2023-06-15 02:02:52,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.33 vs. limit=7.725 2023-06-15 02:02:53,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=600.0, ans=0.425 2023-06-15 02:03:07,760 INFO [train.py:988] (2/4) Epoch 1, batch 100, loss[loss=1.214, simple_loss=1.051, pruned_loss=1.303, over 19207.00 frames. ], tot_loss[loss=2.265, simple_loss=2.055, pruned_loss=1.936, over 1525555.96 frames. ], batch size: 92, lr: 2.40e-02, grad_scale: 0.5 2023-06-15 02:03:09,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=101.45 vs. limit=5.333333333333333 2023-06-15 02:03:14,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.068e+02 1.535e+02 6.068e+02 3.361e+03 1.967e+04, threshold=1.214e+03, percent-clipped=0.0 2023-06-15 02:03:17,582 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=4.266666666666667 2023-06-15 02:03:31,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=81.89 vs. limit=7.775 2023-06-15 02:03:48,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.42 vs. limit=7.775 2023-06-15 02:04:05,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=4.32 2023-06-15 02:04:24,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=7.825 2023-06-15 02:04:28,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.62 vs. limit=7.825 2023-06-15 02:04:46,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=4.373333333333333 2023-06-15 02:04:51,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=353.92 vs. limit=7.85 2023-06-15 02:04:54,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1000.0, ans=0.0775 2023-06-15 02:04:56,163 INFO [train.py:988] (2/4) Epoch 1, batch 150, loss[loss=1.113, simple_loss=0.9518, pruned_loss=1.175, over 18622.00 frames. ], tot_loss[loss=1.795, simple_loss=1.606, pruned_loss=1.632, over 2031428.21 frames. ], batch size: 80, lr: 2.60e-02, grad_scale: 0.5 2023-06-15 02:05:09,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1000.0, ans=0.453125 2023-06-15 02:05:17,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=79.86 vs. limit=7.9 2023-06-15 02:05:22,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.50 vs. limit=8.3 2023-06-15 02:05:47,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=4.453333333333333 2023-06-15 02:05:52,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.48 vs. limit=5.566666666666666 2023-06-15 02:05:52,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=64.50 vs. limit=7.925 2023-06-15 02:05:54,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1133.3333333333333, ans=5.283333333333333 2023-06-15 02:05:56,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=8.35 2023-06-15 02:06:05,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=48.92 vs. limit=5.6 2023-06-15 02:06:08,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=56.82 vs. limit=7.95 2023-06-15 02:06:24,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=77.57 vs. limit=7.975 2023-06-15 02:06:35,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1266.6666666666667, ans=0.035 2023-06-15 02:06:38,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=5.316666666666666 2023-06-15 02:06:39,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=20.74 vs. limit=7.975 2023-06-15 02:06:41,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=140.91 vs. limit=7.975 2023-06-15 02:06:44,794 INFO [train.py:988] (2/4) Epoch 1, batch 200, loss[loss=1.023, simple_loss=0.871, pruned_loss=1.026, over 19507.00 frames. ], tot_loss[loss=1.525, simple_loss=1.35, pruned_loss=1.428, over 2411863.73 frames. ], batch size: 102, lr: 2.80e-02, grad_scale: 1.0 2023-06-15 02:06:50,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.982e+01 9.369e+01 1.058e+02 1.160e+02 1.378e+03, threshold=2.115e+02, percent-clipped=1.0 2023-06-15 02:06:54,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.39 vs. limit=8.5 2023-06-15 02:06:54,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=32.82 vs. limit=8.0 2023-06-15 02:06:55,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1333.3333333333333, ans=0.4375 2023-06-15 02:07:04,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.09 vs. limit=8.55 2023-06-15 02:07:09,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.11 vs. limit=8.55 2023-06-15 02:07:13,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.66 vs. limit=8.025 2023-06-15 02:07:20,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.62 vs. limit=8.55 2023-06-15 02:07:26,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=67.03 vs. limit=8.05 2023-06-15 02:07:26,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.08 vs. limit=8.05 2023-06-15 02:07:28,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=29.80 vs. limit=8.05 2023-06-15 02:07:34,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=1466.6666666666667, ans=5.916666666666667 2023-06-15 02:07:51,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.87 vs. limit=8.075 2023-06-15 02:07:51,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=47.15 vs. limit=8.075 2023-06-15 02:08:30,662 INFO [train.py:988] (2/4) Epoch 1, batch 250, loss[loss=0.954, simple_loss=0.8075, pruned_loss=0.9261, over 19959.00 frames. ], tot_loss[loss=1.358, simple_loss=1.191, pruned_loss=1.288, over 2713944.37 frames. ], batch size: 126, lr: 3.00e-02, grad_scale: 1.0 2023-06-15 02:08:31,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=81.63 vs. limit=5.833333333333333 2023-06-15 02:08:35,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=21.01 vs. limit=8.125 2023-06-15 02:08:46,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.67 vs. limit=5.833333333333333 2023-06-15 02:08:48,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=8.75 2023-06-15 02:08:54,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=22.63 vs. limit=8.15 2023-06-15 02:08:55,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1733.3333333333333, ans=6.083333333333333 2023-06-15 02:09:02,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1733.3333333333333, ans=6.083333333333333 2023-06-15 02:09:02,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1733.3333333333333, ans=0.41875 2023-06-15 02:09:16,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=46.99 vs. limit=8.175 2023-06-15 02:09:23,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1800.0, ans=0.415625 2023-06-15 02:09:24,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=8.85 2023-06-15 02:09:30,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1800.0, ans=0.415625 2023-06-15 02:09:30,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1800.0, ans=0.415625 2023-06-15 02:09:32,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1866.6666666666667, ans=0.26666666666666666 2023-06-15 02:09:36,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1866.6666666666667, ans=0.4125 2023-06-15 02:09:45,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=26.04 vs. limit=8.2 2023-06-15 02:09:49,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=23.87 vs. limit=8.2 2023-06-15 02:09:51,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=8.2 2023-06-15 02:10:01,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=37.54 vs. limit=8.225 2023-06-15 02:10:08,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=8.225 2023-06-15 02:10:11,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1933.3333333333333, ans=8.225 2023-06-15 02:10:13,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=44.49 vs. limit=8.225 2023-06-15 02:10:16,474 INFO [train.py:988] (2/4) Epoch 1, batch 300, loss[loss=0.9536, simple_loss=0.7988, pruned_loss=0.9129, over 18618.00 frames. ], tot_loss[loss=1.244, simple_loss=1.082, pruned_loss=1.182, over 2955442.91 frames. ], batch size: 80, lr: 3.20e-02, grad_scale: 2.0 2023-06-15 02:10:20,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.23 vs. limit=9.0 2023-06-15 02:10:21,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=2000.0, ans=0.055 2023-06-15 02:10:21,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=236.78 vs. limit=8.25 2023-06-15 02:10:22,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.963e+01 1.052e+02 1.248e+02 1.628e+02 2.864e+02, threshold=2.496e+02, percent-clipped=3.0 2023-06-15 02:10:31,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=2000.0, ans=0.043750000000000004 2023-06-15 02:10:31,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=19.22 vs. limit=8.25 2023-06-15 02:10:44,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=17.63 vs. limit=5.516666666666667 2023-06-15 02:10:46,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.90 vs. limit=5.516666666666667 2023-06-15 02:11:01,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=2133.3333333333335, ans=0.08666666666666667 2023-06-15 02:11:15,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2133.3333333333335, ans=0.8253333333333334 2023-06-15 02:11:31,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=16.40 vs. limit=5.55 2023-06-15 02:11:58,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=9.2 2023-06-15 02:12:02,172 INFO [train.py:988] (2/4) Epoch 1, batch 350, loss[loss=0.907, simple_loss=0.7558, pruned_loss=0.8435, over 19061.00 frames. ], tot_loss[loss=1.164, simple_loss=1.004, pruned_loss=1.101, over 3127421.45 frames. ], batch size: 89, lr: 3.40e-02, grad_scale: 2.0 2023-06-15 02:12:02,620 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.044e-01 2023-06-15 02:12:06,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2333.3333333333335, ans=0.1125 2023-06-15 02:12:15,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=33.80 vs. limit=8.375 2023-06-15 02:12:19,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2333.3333333333335, ans=9.25 2023-06-15 02:12:21,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.66 vs. limit=9.3 2023-06-15 02:12:28,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2400.0, ans=0.5 2023-06-15 02:12:33,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.37 vs. limit=8.4 2023-06-15 02:12:50,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=18.02 vs. limit=8.425 2023-06-15 02:12:51,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2466.6666666666665, ans=0.1075 2023-06-15 02:13:03,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.10 vs. limit=9.4 2023-06-15 02:13:05,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=24.72 vs. limit=8.45 2023-06-15 02:13:13,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=28.18 vs. limit=8.45 2023-06-15 02:13:19,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.84 vs. limit=8.45 2023-06-15 02:13:27,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=8.475 2023-06-15 02:13:38,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2600.0, ans=0.27399999999999997 2023-06-15 02:13:38,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.78 vs. limit=8.475 2023-06-15 02:13:46,998 INFO [train.py:988] (2/4) Epoch 1, batch 400, loss[loss=1.006, simple_loss=0.8402, pruned_loss=0.8923, over 15956.00 frames. ], tot_loss[loss=1.102, simple_loss=0.9443, pruned_loss=1.031, over 3271381.61 frames. ], batch size: 51, lr: 3.60e-02, grad_scale: 4.0 2023-06-15 02:13:47,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.34 vs. limit=8.5 2023-06-15 02:13:52,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.424e+01 1.333e+02 1.553e+02 2.129e+02 3.991e+02, threshold=3.107e+02, percent-clipped=15.0 2023-06-15 02:14:10,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2733.3333333333335, ans=6.366666666666667 2023-06-15 02:14:18,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.52 vs. limit=9.55 2023-06-15 02:14:49,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=9.65 2023-06-15 02:14:59,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=9.65 2023-06-15 02:15:02,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.98 vs. limit=9.65 2023-06-15 02:15:13,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2933.3333333333335, ans=0.08999999999999998 2023-06-15 02:15:31,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=3000.0, ans=0.359375 2023-06-15 02:15:33,097 INFO [train.py:988] (2/4) Epoch 1, batch 450, loss[loss=0.9155, simple_loss=0.7669, pruned_loss=0.7762, over 19209.00 frames. ], tot_loss[loss=1.056, simple_loss=0.9004, pruned_loss=0.9719, over 3389999.77 frames. ], batch size: 92, lr: 3.80e-02, grad_scale: 4.0 2023-06-15 02:15:45,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=3000.0, ans=0.795 2023-06-15 02:16:01,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3066.6666666666665, ans=0.2693333333333333 2023-06-15 02:16:12,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.92 vs. limit=8.675 2023-06-15 02:16:36,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=3200.0, ans=0.35 2023-06-15 02:16:42,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=20.93 vs. limit=8.7 2023-06-15 02:16:43,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=9.9 2023-06-15 02:16:48,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.33 vs. limit=6.6 2023-06-15 02:16:58,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=8.725 2023-06-15 02:17:00,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.49 vs. limit=8.725 2023-06-15 02:17:01,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=3266.6666666666665, ans=0.03979166666666667 2023-06-15 02:17:11,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.78 vs. limit=10.0 2023-06-15 02:17:12,968 INFO [train.py:988] (2/4) Epoch 1, batch 500, loss[loss=0.8409, simple_loss=0.7134, pruned_loss=0.6662, over 20261.00 frames. ], tot_loss[loss=1.009, simple_loss=0.8588, pruned_loss=0.908, over 3485113.68 frames. ], batch size: 141, lr: 4.00e-02, grad_scale: 8.0 2023-06-15 02:17:15,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.15 vs. limit=8.75 2023-06-15 02:17:18,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.868e+01 1.531e+02 1.902e+02 2.695e+02 6.993e+02, threshold=3.804e+02, percent-clipped=16.0 2023-06-15 02:17:25,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=10.0 2023-06-15 02:17:34,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=3400.0, ans=0.340625 2023-06-15 02:17:34,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=3400.0, ans=0.023499999999999993 2023-06-15 02:17:41,536 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=8.775 2023-06-15 02:17:46,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=3400.0, ans=0.07500000000000001 2023-06-15 02:17:56,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=3466.6666666666665, ans=0.7786666666666667 2023-06-15 02:18:02,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.09 vs. limit=8.8 2023-06-15 02:18:08,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3533.3333333333335, ans=0.26466666666666666 2023-06-15 02:18:43,108 INFO [train.py:988] (2/4) Epoch 2, batch 0, loss[loss=0.9071, simple_loss=0.7732, pruned_loss=0.6969, over 16789.00 frames. ], tot_loss[loss=0.9071, simple_loss=0.7732, pruned_loss=0.6969, over 16789.00 frames. ], batch size: 59, lr: 3.96e-02, grad_scale: 16.0 2023-06-15 02:18:43,108 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 02:18:49,230 INFO [train.py:1020] (2/4) Epoch 2, validation: loss=0.7911, simple_loss=0.6884, pruned_loss=0.5718, over 143649.00 frames. 2023-06-15 02:18:49,231 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 02:19:13,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=10.215 2023-06-15 02:19:17,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=10.215 2023-06-15 02:19:22,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.50 vs. limit=3.543 2023-06-15 02:19:31,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=3686.6666666666665, ans=0.7868666666666666 2023-06-15 02:19:36,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.64 vs. limit=8.8825 2023-06-15 02:19:39,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3686.6666666666665, ans=0.32718749999999996 2023-06-15 02:19:57,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.35 vs. limit=6.876666666666667 2023-06-15 02:20:28,445 INFO [train.py:988] (2/4) Epoch 2, batch 50, loss[loss=0.7353, simple_loss=0.6351, pruned_loss=0.5319, over 20311.00 frames. ], tot_loss[loss=0.7729, simple_loss=0.6631, pruned_loss=0.576, over 851335.20 frames. ], batch size: 149, lr: 3.95e-02, grad_scale: 16.0 2023-06-15 02:20:36,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=3886.6666666666665, ans=0.012550000000000006 2023-06-15 02:20:58,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=3953.3333333333335, ans=0.3146875 2023-06-15 02:21:07,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.379e+02 3.485e+02 5.608e+02 1.271e+03, threshold=6.971e+02, percent-clipped=46.0 2023-06-15 02:21:29,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.53 vs. limit=6.0216666666666665 2023-06-15 02:22:06,309 INFO [train.py:988] (2/4) Epoch 2, batch 100, loss[loss=0.7284, simple_loss=0.6343, pruned_loss=0.5052, over 19104.00 frames. ], tot_loss[loss=0.748, simple_loss=0.6455, pruned_loss=0.5423, over 1502903.21 frames. ], batch size: 94, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:22:08,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=4220.0, ans=0.7523 2023-06-15 02:22:12,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4220.0, ans=0.3021875 2023-06-15 02:22:17,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=4220.0, ans=0.3021875 2023-06-15 02:22:26,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=4286.666666666667, ans=0.00993768115942029 2023-06-15 02:22:49,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=4353.333333333333, ans=0.29593749999999996 2023-06-15 02:23:01,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=4353.333333333333, ans=0.036395833333333336 2023-06-15 02:23:04,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.49 vs. limit=6.088333333333333 2023-06-15 02:23:07,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=4420.0, ans=0.07237500000000001 2023-06-15 02:23:09,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=9.1575 2023-06-15 02:23:45,693 INFO [train.py:988] (2/4) Epoch 2, batch 150, loss[loss=0.6661, simple_loss=0.5892, pruned_loss=0.4355, over 19196.00 frames. ], tot_loss[loss=0.7197, simple_loss=0.6252, pruned_loss=0.5072, over 2006342.73 frames. ], batch size: 92, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:23:47,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=4553.333333333333, ans=0.04769444444444445 2023-06-15 02:24:15,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=10.965 2023-06-15 02:24:25,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.805e+02 4.216e+02 6.424e+02 2.276e+03, threshold=8.432e+02, percent-clipped=19.0 2023-06-15 02:24:29,746 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=4.451e+00 2023-06-15 02:24:35,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=4686.666666666667, ans=0.28031249999999996 2023-06-15 02:25:20,897 INFO [train.py:988] (2/4) Epoch 2, batch 200, loss[loss=0.66, simple_loss=0.5858, pruned_loss=0.4222, over 19702.00 frames. ], tot_loss[loss=0.6984, simple_loss=0.6104, pruned_loss=0.4793, over 2392840.48 frames. ], batch size: 110, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:25:31,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=4886.666666666667, ans=0.04630555555555556 2023-06-15 02:25:31,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=9.3325 2023-06-15 02:26:22,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=5086.666666666667, ans=0.0 2023-06-15 02:26:34,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5086.666666666667, ans=0.26156250000000003 2023-06-15 02:26:56,719 INFO [train.py:988] (2/4) Epoch 2, batch 250, loss[loss=0.5735, simple_loss=0.516, pruned_loss=0.3505, over 20566.00 frames. ], tot_loss[loss=0.6763, simple_loss=0.5947, pruned_loss=0.4525, over 2696988.25 frames. ], batch size: 173, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:27:01,617 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:27:16,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5286.666666666667, ans=0.24713333333333332 2023-06-15 02:27:18,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-15 02:27:28,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-15 02:27:37,763 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.822e+02 4.791e+02 8.752e+02 2.397e+03, threshold=9.582e+02, percent-clipped=28.0 2023-06-15 02:28:12,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5486.666666666667, ans=0.24513333333333331 2023-06-15 02:28:12,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.17 vs. limit=7.743333333333334 2023-06-15 02:28:14,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=5486.666666666667, ans=0.2428125 2023-06-15 02:28:32,183 INFO [train.py:988] (2/4) Epoch 2, batch 300, loss[loss=0.5619, simple_loss=0.5029, pruned_loss=0.3447, over 20299.00 frames. ], tot_loss[loss=0.6554, simple_loss=0.5795, pruned_loss=0.4285, over 2942099.93 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:28:32,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=5553.333333333333, ans=0.24446666666666667 2023-06-15 02:29:04,768 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:29:27,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=5753.333333333333, ans=0.23031249999999998 2023-06-15 02:29:47,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=5820.0, ans=0.04241666666666667 2023-06-15 02:29:57,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=5820.0, ans=0.6963 2023-06-15 02:30:00,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.21 vs. limit=7.91 2023-06-15 02:30:06,982 INFO [train.py:988] (2/4) Epoch 2, batch 350, loss[loss=0.5557, simple_loss=0.4948, pruned_loss=0.3421, over 20278.00 frames. ], tot_loss[loss=0.6349, simple_loss=0.5647, pruned_loss=0.4061, over 3111909.97 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:30:13,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=5886.666666666667, ans=0.19113333333333332 2023-06-15 02:30:17,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=5886.666666666667, ans=0.2240625 2023-06-15 02:30:30,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=5953.333333333333, ans=0.2209375 2023-06-15 02:30:46,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 3.241e+02 6.004e+02 9.437e+02 1.791e+03, threshold=1.201e+03, percent-clipped=24.0 2023-06-15 02:31:05,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=6086.666666666667, ans=0.21468749999999998 2023-06-15 02:31:20,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=6153.333333333333, ans=0.009531884057971014 2023-06-15 02:31:36,748 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:31:40,063 INFO [train.py:988] (2/4) Epoch 2, batch 400, loss[loss=0.5398, simple_loss=0.4878, pruned_loss=0.3189, over 20231.00 frames. ], tot_loss[loss=0.6169, simple_loss=0.5519, pruned_loss=0.3859, over 3264752.35 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 16.0 2023-06-15 02:31:44,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6220.0, ans=0.2084375 2023-06-15 02:31:48,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=9.8325 2023-06-15 02:31:58,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=6286.666666666667, ans=6.514666666666667 2023-06-15 02:32:05,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=6286.666666666667, ans=0.04047222222222222 2023-06-15 02:32:11,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=6286.666666666667, ans=0.2053125 2023-06-15 02:32:23,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=6353.333333333333, ans=0.6776333333333333 2023-06-15 02:32:37,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6420.0, ans=0.2358 2023-06-15 02:32:54,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=6486.666666666667, ans=0.009459420289855072 2023-06-15 02:32:58,042 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=4.330e-03 2023-06-15 02:33:13,457 INFO [train.py:988] (2/4) Epoch 2, batch 450, loss[loss=0.5575, simple_loss=0.5168, pruned_loss=0.3093, over 19227.00 frames. ], tot_loss[loss=0.6012, simple_loss=0.5408, pruned_loss=0.3687, over 3374980.93 frames. ], batch size: 92, lr: 3.94e-02, grad_scale: 8.0 2023-06-15 02:33:15,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=6553.333333333333, ans=0.1928125 2023-06-15 02:33:32,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=6620.0, ans=0.0 2023-06-15 02:33:46,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=6620.0, ans=0.6683 2023-06-15 02:33:52,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.37 vs. limit=6.674666666666667 2023-06-15 02:33:54,732 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 3.705e+02 5.386e+02 8.050e+02 1.837e+03, threshold=1.077e+03, percent-clipped=8.0 2023-06-15 02:34:20,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=6753.333333333333, ans=0.03852777777777778 2023-06-15 02:34:27,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=6820.0, ans=0.07 2023-06-15 02:34:43,342 INFO [train.py:988] (2/4) Epoch 2, batch 500, loss[loss=0.5304, simple_loss=0.4946, pruned_loss=0.2895, over 19546.00 frames. ], tot_loss[loss=0.5865, simple_loss=0.5306, pruned_loss=0.3526, over 3461863.90 frames. ], batch size: 102, lr: 3.94e-02, grad_scale: 8.0 2023-06-15 02:35:01,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=6953.333333333333, ans=0.6566333333333334 2023-06-15 02:35:21,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.95 vs. limit=6.808 2023-06-15 02:35:27,357 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.649e-01 2023-06-15 02:35:34,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=10.1575 2023-06-15 02:35:57,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=7100.0, ans=0.037083333333333336 2023-06-15 02:36:00,641 INFO [train.py:988] (2/4) Epoch 3, batch 0, loss[loss=0.5116, simple_loss=0.4618, pruned_loss=0.2981, over 19767.00 frames. ], tot_loss[loss=0.5116, simple_loss=0.4618, pruned_loss=0.2981, over 19767.00 frames. ], batch size: 293, lr: 3.84e-02, grad_scale: 16.0 2023-06-15 02:36:00,641 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 02:36:06,798 INFO [train.py:1020] (2/4) Epoch 3, validation: loss=0.4219, simple_loss=0.4383, pruned_loss=0.1731, over 143649.00 frames. 2023-06-15 02:36:06,798 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 02:36:16,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.47 vs. limit=12.825 2023-06-15 02:36:37,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7166.666666666667, ans=0.1640625 2023-06-15 02:36:37,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7166.666666666667, ans=0.22833333333333333 2023-06-15 02:36:42,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7233.333333333333, ans=0.22766666666666668 2023-06-15 02:36:47,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7233.333333333333, ans=0.22766666666666668 2023-06-15 02:36:56,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=7233.333333333333, ans=0.0 2023-06-15 02:37:05,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=7300.0, ans=0.22699999999999998 2023-06-15 02:37:12,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=7300.0, ans=0.15781250000000002 2023-06-15 02:37:17,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 3.334e+02 5.511e+02 7.843e+02 1.620e+03, threshold=1.102e+03, percent-clipped=11.0 2023-06-15 02:37:36,278 INFO [train.py:988] (2/4) Epoch 3, batch 50, loss[loss=0.5451, simple_loss=0.5055, pruned_loss=0.3, over 19862.00 frames. ], tot_loss[loss=0.522, simple_loss=0.4856, pruned_loss=0.2857, over 843424.79 frames. ], batch size: 120, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:38:03,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=7500.0, ans=0.035416666666666666 2023-06-15 02:38:07,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=7500.0, ans=0.035416666666666666 2023-06-15 02:38:09,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=7500.0, ans=0.3125 2023-06-15 02:38:11,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=7566.666666666667, ans=0.6351666666666667 2023-06-15 02:38:15,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=13.175 2023-06-15 02:38:36,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=10.3625 2023-06-15 02:39:06,061 INFO [train.py:988] (2/4) Epoch 3, batch 100, loss[loss=0.4805, simple_loss=0.4509, pruned_loss=0.2578, over 20555.00 frames. ], tot_loss[loss=0.5137, simple_loss=0.4807, pruned_loss=0.2776, over 1502690.79 frames. ], batch size: 173, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:39:26,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=7833.333333333333, ans=0.1328125 2023-06-15 02:39:28,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=7833.333333333333, ans=0.03402777777777778 2023-06-15 02:39:35,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=7833.333333333333, ans=0.6258333333333334 2023-06-15 02:39:47,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=7900.0, ans=0.009152173913043479 2023-06-15 02:39:52,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7900.0, ans=0.0 2023-06-15 02:40:19,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 2.598e+02 5.024e+02 8.933e+02 2.174e+03, threshold=1.005e+03, percent-clipped=11.0 2023-06-15 02:40:27,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.95 vs. limit=7.008333333333333 2023-06-15 02:40:29,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=13.525 2023-06-15 02:40:32,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.63 vs. limit=13.525 2023-06-15 02:40:35,851 INFO [train.py:988] (2/4) Epoch 3, batch 150, loss[loss=0.5331, simple_loss=0.513, pruned_loss=0.2721, over 18323.00 frames. ], tot_loss[loss=0.5076, simple_loss=0.4772, pruned_loss=0.2716, over 2020783.28 frames. ], batch size: 72, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:40:50,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=8100.0, ans=0.125 2023-06-15 02:41:06,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.10 vs. limit=13.625 2023-06-15 02:41:10,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=8233.333333333334, ans=0.6118333333333335 2023-06-15 02:41:19,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=8233.333333333334, ans=0.009079710144927537 2023-06-15 02:41:26,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=10.5875 2023-06-15 02:41:45,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=10.6375 2023-06-15 02:42:01,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=8366.666666666666, ans=0.125 2023-06-15 02:42:04,804 INFO [train.py:988] (2/4) Epoch 3, batch 200, loss[loss=0.4484, simple_loss=0.4337, pruned_loss=0.2268, over 20533.00 frames. ], tot_loss[loss=0.4994, simple_loss=0.4715, pruned_loss=0.2649, over 2424245.59 frames. ], batch size: 160, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:42:17,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=8433.333333333334, ans=0.009036231884057971 2023-06-15 02:42:33,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=10.6875 2023-06-15 02:42:54,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.47 vs. limit=10.712499999999999 2023-06-15 02:43:03,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8633.333333333334, ans=0.21366666666666667 2023-06-15 02:43:16,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.306e+02 5.233e+02 8.261e+02 1.948e+03, threshold=1.047e+03, percent-clipped=15.0 2023-06-15 02:43:17,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=8700.0, ans=0.008978260869565217 2023-06-15 02:43:33,618 INFO [train.py:988] (2/4) Epoch 3, batch 250, loss[loss=0.4747, simple_loss=0.4528, pruned_loss=0.2469, over 20601.00 frames. ], tot_loss[loss=0.494, simple_loss=0.4685, pruned_loss=0.2597, over 2713892.02 frames. ], batch size: 189, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:43:45,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=10.7875 2023-06-15 02:43:58,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=8833.333333333334, ans=0.02986111111111111 2023-06-15 02:44:17,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=8900.0, ans=0.125 2023-06-15 02:44:20,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=8900.0, ans=0.025 2023-06-15 02:44:55,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9033.333333333334, ans=0.20966666666666667 2023-06-15 02:45:03,142 INFO [train.py:988] (2/4) Epoch 3, batch 300, loss[loss=0.4403, simple_loss=0.434, pruned_loss=0.2162, over 18796.00 frames. ], tot_loss[loss=0.4893, simple_loss=0.4663, pruned_loss=0.2552, over 2942429.20 frames. ], batch size: 83, lr: 3.82e-02, grad_scale: 8.0 2023-06-15 02:45:03,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=9100.0, ans=0.008891304347826087 2023-06-15 02:45:29,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=9166.666666666666, ans=0.125 2023-06-15 02:45:31,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=9166.666666666666, ans=0.008876811594202899 2023-06-15 02:45:57,091 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:45:59,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=10.9875 2023-06-15 02:46:00,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=9300.0, ans=0.02791666666666667 2023-06-15 02:46:16,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.844e+02 4.784e+02 7.568e+02 1.827e+03, threshold=9.568e+02, percent-clipped=19.0 2023-06-15 02:46:16,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=9366.666666666666, ans=0.125 2023-06-15 02:46:23,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=9366.666666666666, ans=0.125 2023-06-15 02:46:32,412 INFO [train.py:988] (2/4) Epoch 3, batch 350, loss[loss=0.449, simple_loss=0.4314, pruned_loss=0.231, over 20286.00 frames. ], tot_loss[loss=0.4823, simple_loss=0.4613, pruned_loss=0.25, over 3143063.49 frames. ], batch size: 149, lr: 3.82e-02, grad_scale: 8.0 2023-06-15 02:47:36,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.96 vs. limit=11.1125 2023-06-15 02:47:46,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9700.0, ans=0.203 2023-06-15 02:48:02,108 INFO [train.py:988] (2/4) Epoch 3, batch 400, loss[loss=0.4325, simple_loss=0.4309, pruned_loss=0.21, over 19810.00 frames. ], tot_loss[loss=0.4759, simple_loss=0.4577, pruned_loss=0.2446, over 3293488.47 frames. ], batch size: 115, lr: 3.82e-02, grad_scale: 16.0 2023-06-15 02:48:10,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=9766.666666666666, ans=0.125 2023-06-15 02:48:21,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=9833.333333333334, ans=0.20166666666666666 2023-06-15 02:48:35,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=11.1875 2023-06-15 02:49:02,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.63 vs. limit=9.983333333333333 2023-06-15 02:49:15,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.793e+02 4.517e+02 6.574e+02 1.219e+03, threshold=9.033e+02, percent-clipped=7.0 2023-06-15 02:49:31,992 INFO [train.py:988] (2/4) Epoch 3, batch 450, loss[loss=0.4436, simple_loss=0.4382, pruned_loss=0.2192, over 18943.00 frames. ], tot_loss[loss=0.4718, simple_loss=0.4555, pruned_loss=0.2411, over 3403655.07 frames. ], batch size: 86, lr: 3.82e-02, grad_scale: 16.0 2023-06-15 02:49:52,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=10166.666666666666, ans=0.02430555555555556 2023-06-15 02:50:26,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.32 vs. limit=11.3625 2023-06-15 02:50:44,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10366.666666666666, ans=0.0 2023-06-15 02:50:59,009 INFO [train.py:988] (2/4) Epoch 3, batch 500, loss[loss=0.4747, simple_loss=0.4796, pruned_loss=0.2273, over 16615.00 frames. ], tot_loss[loss=0.465, simple_loss=0.4509, pruned_loss=0.2363, over 3502035.86 frames. ], batch size: 59, lr: 3.81e-02, grad_scale: 16.0 2023-06-15 02:51:30,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=10566.666666666666, ans=0.10225833333333335 2023-06-15 02:51:43,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=11.4625 2023-06-15 02:51:47,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.82 vs. limit=11.4875 2023-06-15 02:52:16,085 INFO [train.py:988] (2/4) Epoch 4, batch 0, loss[loss=0.4678, simple_loss=0.4618, pruned_loss=0.2326, over 19111.00 frames. ], tot_loss[loss=0.4678, simple_loss=0.4618, pruned_loss=0.2326, over 19111.00 frames. ], batch size: 94, lr: 3.66e-02, grad_scale: 32.0 2023-06-15 02:52:16,085 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 02:52:22,222 INFO [train.py:1020] (2/4) Epoch 4, validation: loss=0.3338, simple_loss=0.3946, pruned_loss=0.1182, over 143649.00 frames. 2023-06-15 02:52:22,222 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 02:52:26,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.10 vs. limit=15.485 2023-06-15 02:52:38,302 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.821e+02 4.565e+02 6.318e+02 1.774e+03, threshold=9.130e+02, percent-clipped=10.0 2023-06-15 02:53:27,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=10846.666666666666, ans=0.19153333333333333 2023-06-15 02:53:30,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=10846.666666666666, ans=11.567499999999999 2023-06-15 02:53:39,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=11.592500000000001 2023-06-15 02:53:49,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=10913.333333333334, ans=0.125 2023-06-15 02:53:52,043 INFO [train.py:988] (2/4) Epoch 4, batch 50, loss[loss=0.4326, simple_loss=0.4376, pruned_loss=0.2083, over 12570.00 frames. ], tot_loss[loss=0.4388, simple_loss=0.4358, pruned_loss=0.2165, over 838396.20 frames. ], batch size: 35, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:54:22,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.45 vs. limit=11.6425 2023-06-15 02:54:47,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=11180.0, ans=0.125 2023-06-15 02:55:13,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=11246.666666666666, ans=0.5063666666666667 2023-06-15 02:55:23,236 INFO [train.py:988] (2/4) Epoch 4, batch 100, loss[loss=0.4286, simple_loss=0.4335, pruned_loss=0.2073, over 19679.00 frames. ], tot_loss[loss=0.4339, simple_loss=0.4322, pruned_loss=0.2136, over 1513486.30 frames. ], batch size: 110, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:55:23,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=11313.333333333334, ans=0.019527777777777776 2023-06-15 02:55:29,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11313.333333333334, ans=0.18686666666666668 2023-06-15 02:55:37,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=11313.333333333334, ans=0.008410144927536231 2023-06-15 02:55:42,250 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.113e+02 4.609e+02 7.608e+02 1.612e+03, threshold=9.219e+02, percent-clipped=13.0 2023-06-15 02:55:42,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=11380.0, ans=0.008395652173913044 2023-06-15 02:55:55,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=11380.0, ans=0.125 2023-06-15 02:56:13,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=11446.666666666666, ans=0.035 2023-06-15 02:56:18,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11513.333333333334, ans=0.18486666666666668 2023-06-15 02:56:53,626 INFO [train.py:988] (2/4) Epoch 4, batch 150, loss[loss=0.4284, simple_loss=0.4558, pruned_loss=0.1937, over 18327.00 frames. ], tot_loss[loss=0.4321, simple_loss=0.4345, pruned_loss=0.2106, over 2011815.39 frames. ], batch size: 72, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:57:04,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=11646.666666666666, ans=0.125 2023-06-15 02:57:21,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.34 vs. limit=16.285 2023-06-15 02:57:32,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=11780.0, ans=0.09899494936611666 2023-06-15 02:58:09,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=11913.333333333334, ans=0.0 2023-06-15 02:58:16,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=11913.333333333334, ans=0.125 2023-06-15 02:58:23,380 INFO [train.py:988] (2/4) Epoch 4, batch 200, loss[loss=0.4353, simple_loss=0.4415, pruned_loss=0.2115, over 19331.00 frames. ], tot_loss[loss=0.4284, simple_loss=0.4323, pruned_loss=0.2082, over 2405828.55 frames. ], batch size: 98, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 02:58:30,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=11980.0, ans=0.0 2023-06-15 02:58:37,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11980.0, ans=0.18019999999999997 2023-06-15 02:58:40,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.791e+02 4.180e+02 6.641e+02 1.358e+03, threshold=8.360e+02, percent-clipped=6.0 2023-06-15 02:58:42,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.03 vs. limit=8.011666666666667 2023-06-15 02:58:52,630 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:58:57,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.63 vs. limit=16.535 2023-06-15 02:59:04,078 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:59:07,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=12113.333333333334, ans=0.05 2023-06-15 02:59:38,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=12246.666666666666, ans=0.125 2023-06-15 02:59:48,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=12246.666666666666, ans=0.125 2023-06-15 02:59:55,830 INFO [train.py:988] (2/4) Epoch 4, batch 250, loss[loss=0.4496, simple_loss=0.4532, pruned_loss=0.2209, over 18801.00 frames. ], tot_loss[loss=0.4265, simple_loss=0.4306, pruned_loss=0.2076, over 2718963.08 frames. ], batch size: 83, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 03:00:30,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.27 vs. limit=5.0 2023-06-15 03:00:33,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=12446.666666666666, ans=0.008163768115942029 2023-06-15 03:00:58,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=12513.333333333334, ans=0.46203333333333335 2023-06-15 03:01:26,973 INFO [train.py:988] (2/4) Epoch 4, batch 300, loss[loss=0.4204, simple_loss=0.4322, pruned_loss=0.2025, over 19441.00 frames. ], tot_loss[loss=0.4228, simple_loss=0.4295, pruned_loss=0.2049, over 2952144.04 frames. ], batch size: 105, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 03:01:39,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=12646.666666666666, ans=0.013972222222222226 2023-06-15 03:01:44,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.980e+02 4.845e+02 6.504e+02 1.050e+03, threshold=9.691e+02, percent-clipped=10.0 2023-06-15 03:02:00,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=12.2675 2023-06-15 03:02:29,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=12846.666666666666, ans=0.125 2023-06-15 03:02:46,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12913.333333333334, ans=0.17086666666666667 2023-06-15 03:02:59,273 INFO [train.py:988] (2/4) Epoch 4, batch 350, loss[loss=0.4122, simple_loss=0.4153, pruned_loss=0.2039, over 20276.00 frames. ], tot_loss[loss=0.4174, simple_loss=0.4263, pruned_loss=0.2016, over 3162504.20 frames. ], batch size: 239, lr: 3.64e-02, grad_scale: 16.0 2023-06-15 03:03:06,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=12980.0, ans=0.012583333333333335 2023-06-15 03:03:20,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=12.3925 2023-06-15 03:03:24,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13046.666666666666, ans=0.125 2023-06-15 03:03:41,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=13113.333333333334, ans=0.008018840579710146 2023-06-15 03:04:29,343 INFO [train.py:988] (2/4) Epoch 4, batch 400, loss[loss=0.3992, simple_loss=0.4035, pruned_loss=0.1974, over 20204.00 frames. ], tot_loss[loss=0.4132, simple_loss=0.4244, pruned_loss=0.1989, over 3297361.35 frames. ], batch size: 239, lr: 3.64e-02, grad_scale: 32.0 2023-06-15 03:04:40,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=12.4925 2023-06-15 03:04:46,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.163e+02 4.789e+02 6.291e+02 1.274e+03, threshold=9.578e+02, percent-clipped=4.0 2023-06-15 03:04:47,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.54 vs. limit=8.345 2023-06-15 03:05:40,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=12.592500000000001 2023-06-15 03:05:59,016 INFO [train.py:988] (2/4) Epoch 4, batch 450, loss[loss=0.4047, simple_loss=0.4199, pruned_loss=0.1947, over 20261.00 frames. ], tot_loss[loss=0.4094, simple_loss=0.4222, pruned_loss=0.1967, over 3409477.45 frames. ], batch size: 141, lr: 3.64e-02, grad_scale: 16.0 2023-06-15 03:06:38,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=13780.0, ans=0.125 2023-06-15 03:06:43,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=13780.0, ans=10.0 2023-06-15 03:06:45,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=13780.0, ans=0.00787391304347826 2023-06-15 03:07:08,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.27 vs. limit=11.956666666666667 2023-06-15 03:07:21,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=13913.333333333334, ans=0.125 2023-06-15 03:07:24,280 INFO [train.py:988] (2/4) Epoch 4, batch 500, loss[loss=0.4092, simple_loss=0.4325, pruned_loss=0.1929, over 10689.00 frames. ], tot_loss[loss=0.4062, simple_loss=0.422, pruned_loss=0.194, over 3478310.05 frames. ], batch size: 30, lr: 3.63e-02, grad_scale: 16.0 2023-06-15 03:07:29,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=13980.0, ans=0.125 2023-06-15 03:07:32,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=13980.0, ans=0.007830434782608696 2023-06-15 03:07:33,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=13980.0, ans=0.8897999999999999 2023-06-15 03:07:37,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.35 vs. limit=8.495000000000001 2023-06-15 03:07:42,946 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.873e+02 4.186e+02 6.544e+02 1.200e+03, threshold=8.372e+02, percent-clipped=10.0 2023-06-15 03:07:56,635 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:08:13,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14180.0, ans=0.125 2023-06-15 03:08:43,602 INFO [train.py:988] (2/4) Epoch 5, batch 0, loss[loss=0.382, simple_loss=0.3973, pruned_loss=0.1833, over 20695.00 frames. ], tot_loss[loss=0.382, simple_loss=0.3973, pruned_loss=0.1833, over 20695.00 frames. ], batch size: 211, lr: 3.47e-02, grad_scale: 32.0 2023-06-15 03:08:43,603 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 03:08:49,782 INFO [train.py:1020] (2/4) Epoch 5, validation: loss=0.2868, simple_loss=0.3756, pruned_loss=0.09898, over 143649.00 frames. 2023-06-15 03:08:49,783 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 03:09:46,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=12.8975 2023-06-15 03:10:18,926 INFO [train.py:988] (2/4) Epoch 5, batch 50, loss[loss=0.3874, simple_loss=0.4056, pruned_loss=0.1846, over 19965.00 frames. ], tot_loss[loss=0.3842, simple_loss=0.4107, pruned_loss=0.1789, over 864097.39 frames. ], batch size: 126, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:10:22,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=14526.666666666666, ans=0.007711594202898551 2023-06-15 03:10:56,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14660.0, ans=0.125 2023-06-15 03:11:10,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.824e+02 3.862e+02 4.906e+02 1.527e+03, threshold=7.724e+02, percent-clipped=12.0 2023-06-15 03:11:19,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=14726.666666666666, ans=0.125 2023-06-15 03:11:33,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=14793.333333333334, ans=0.3822333333333333 2023-06-15 03:11:48,655 INFO [train.py:988] (2/4) Epoch 5, batch 100, loss[loss=0.4214, simple_loss=0.4536, pruned_loss=0.1946, over 17641.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.4083, pruned_loss=0.1784, over 1508320.31 frames. ], batch size: 67, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:11:55,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=14860.0, ans=0.125 2023-06-15 03:11:56,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14860.0, ans=0.1514 2023-06-15 03:12:11,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-15 03:12:13,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=14926.666666666666, ans=0.004472222222222225 2023-06-15 03:12:15,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=14926.666666666666, ans=0.125 2023-06-15 03:12:22,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14993.333333333334, ans=0.125 2023-06-15 03:12:27,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=14993.333333333334, ans=0.10006666666666666 2023-06-15 03:13:00,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=15126.666666666666, ans=0.0 2023-06-15 03:13:17,978 INFO [train.py:988] (2/4) Epoch 5, batch 150, loss[loss=0.3992, simple_loss=0.4363, pruned_loss=0.181, over 16847.00 frames. ], tot_loss[loss=0.3844, simple_loss=0.4094, pruned_loss=0.1797, over 1996337.10 frames. ], batch size: 59, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:13:33,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=15260.0, ans=0.3659 2023-06-15 03:13:42,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=15260.0, ans=0.125 2023-06-15 03:14:10,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.930e+02 4.326e+02 6.625e+02 9.040e+02, threshold=8.653e+02, percent-clipped=9.0 2023-06-15 03:14:19,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=15393.333333333334, ans=0.0 2023-06-15 03:14:25,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=15393.333333333334, ans=0.14606666666666665 2023-06-15 03:14:28,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=15460.0, ans=0.07 2023-06-15 03:14:48,005 INFO [train.py:988] (2/4) Epoch 5, batch 200, loss[loss=0.3843, simple_loss=0.397, pruned_loss=0.1859, over 20680.00 frames. ], tot_loss[loss=0.3821, simple_loss=0.4074, pruned_loss=0.1783, over 2390364.45 frames. ], batch size: 211, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:15:11,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=15593.333333333334, ans=0.0016944444444444429 2023-06-15 03:15:12,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=15593.333333333334, ans=0.0 2023-06-15 03:15:12,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.75 vs. limit=13.3475 2023-06-15 03:15:15,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=15593.333333333334, ans=0.007479710144927536 2023-06-15 03:15:25,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=15660.0, ans=0.125 2023-06-15 03:15:30,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=15660.0, ans=0.007465217391304348 2023-06-15 03:15:54,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=15726.666666666666, ans=0.125 2023-06-15 03:16:14,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=15793.333333333334, ans=0.007436231884057971 2023-06-15 03:16:18,856 INFO [train.py:988] (2/4) Epoch 5, batch 250, loss[loss=0.3891, simple_loss=0.4045, pruned_loss=0.1869, over 20252.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4056, pruned_loss=0.1771, over 2708756.64 frames. ], batch size: 239, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:17:11,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.885e+02 4.184e+02 6.160e+02 1.201e+03, threshold=8.369e+02, percent-clipped=9.0 2023-06-15 03:17:19,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=13.5225 2023-06-15 03:17:21,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=16060.0, ans=0.125 2023-06-15 03:17:49,368 INFO [train.py:988] (2/4) Epoch 5, batch 300, loss[loss=0.3822, simple_loss=0.4002, pruned_loss=0.1821, over 20723.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.4047, pruned_loss=0.1753, over 2954922.78 frames. ], batch size: 211, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:18:10,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=16260.0, ans=0.0 2023-06-15 03:18:32,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=16326.666666666666, ans=0.0 2023-06-15 03:18:36,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=13.622499999999999 2023-06-15 03:19:18,968 INFO [train.py:988] (2/4) Epoch 5, batch 350, loss[loss=0.3562, simple_loss=0.3812, pruned_loss=0.1656, over 20661.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.4022, pruned_loss=0.1733, over 3156568.89 frames. ], batch size: 211, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:19:31,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=16526.666666666668, ans=0.025 2023-06-15 03:19:41,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16593.333333333332, ans=0.1340666666666667 2023-06-15 03:19:50,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=16593.333333333332, ans=0.125 2023-06-15 03:19:51,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=16593.333333333332, ans=0.125 2023-06-15 03:20:09,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.834e+02 3.544e+02 5.201e+02 9.187e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-15 03:20:12,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=20.045 2023-06-15 03:20:13,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=16726.666666666668, ans=0.0 2023-06-15 03:20:15,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=16726.666666666668, ans=0.125 2023-06-15 03:20:28,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=10.717333333333332 2023-06-15 03:20:47,504 INFO [train.py:988] (2/4) Epoch 5, batch 400, loss[loss=0.3382, simple_loss=0.3821, pruned_loss=0.1472, over 19362.00 frames. ], tot_loss[loss=0.3755, simple_loss=0.4045, pruned_loss=0.1732, over 3274772.54 frames. ], batch size: 98, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:21:05,637 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:21:30,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.48 vs. limit=13.872499999999999 2023-06-15 03:22:06,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.77 vs. limit=20.345 2023-06-15 03:22:09,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=17126.666666666668, ans=0.125 2023-06-15 03:22:17,711 INFO [train.py:988] (2/4) Epoch 5, batch 450, loss[loss=0.3844, simple_loss=0.4086, pruned_loss=0.1801, over 19514.00 frames. ], tot_loss[loss=0.3737, simple_loss=0.4033, pruned_loss=0.172, over 3399630.04 frames. ], batch size: 102, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:22:21,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=17193.333333333332, ans=0.125 2023-06-15 03:22:25,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=17193.333333333332, ans=10.0 2023-06-15 03:22:30,818 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:22:31,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=5.579000000000001 2023-06-15 03:22:32,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=17193.333333333332, ans=0.125 2023-06-15 03:23:07,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=17326.666666666668, ans=0.125 2023-06-15 03:23:08,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.161e+02 3.951e+02 6.319e+02 1.120e+03, threshold=7.903e+02, percent-clipped=17.0 2023-06-15 03:23:13,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17393.333333333332, ans=0.1260666666666667 2023-06-15 03:23:23,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=17393.333333333332, ans=0.29123333333333346 2023-06-15 03:23:27,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.28 vs. limit=20.595 2023-06-15 03:23:45,514 INFO [train.py:988] (2/4) Epoch 5, batch 500, loss[loss=0.3769, simple_loss=0.4046, pruned_loss=0.1746, over 19817.00 frames. ], tot_loss[loss=0.3717, simple_loss=0.4023, pruned_loss=0.1705, over 3490612.20 frames. ], batch size: 115, lr: 3.43e-02, grad_scale: 32.0 2023-06-15 03:23:59,206 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:24:02,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=17593.333333333332, ans=0.0 2023-06-15 03:24:29,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=17660.0, ans=10.0 2023-06-15 03:25:04,637 INFO [train.py:988] (2/4) Epoch 6, batch 0, loss[loss=0.3441, simple_loss=0.3926, pruned_loss=0.1479, over 18481.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3926, pruned_loss=0.1479, over 18481.00 frames. ], batch size: 77, lr: 3.27e-02, grad_scale: 32.0 2023-06-15 03:25:04,638 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 03:25:10,740 INFO [train.py:1020] (2/4) Epoch 6, validation: loss=0.268, simple_loss=0.365, pruned_loss=0.08554, over 143649.00 frames. 2023-06-15 03:25:10,741 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 03:25:16,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17746.666666666668, ans=0.125 2023-06-15 03:25:39,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=17813.333333333332, ans=0.006997101449275362 2023-06-15 03:25:46,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=14.205 2023-06-15 03:25:56,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=17880.0, ans=0.2742 2023-06-15 03:26:06,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=14.23 2023-06-15 03:26:25,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=18013.333333333332, ans=0.125 2023-06-15 03:26:30,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.488e+02 3.063e+02 4.314e+02 9.185e+02, threshold=6.126e+02, percent-clipped=4.0 2023-06-15 03:26:37,322 INFO [train.py:988] (2/4) Epoch 6, batch 50, loss[loss=0.3779, simple_loss=0.397, pruned_loss=0.1794, over 20243.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.3917, pruned_loss=0.1634, over 864146.06 frames. ], batch size: 239, lr: 3.26e-02, grad_scale: 32.0 2023-06-15 03:26:41,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18080.0, ans=0.1192 2023-06-15 03:27:00,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=18146.666666666668, ans=0.2648666666666667 2023-06-15 03:27:12,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=14.33 2023-06-15 03:27:45,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=18346.666666666668, ans=0.125 2023-06-15 03:28:02,258 INFO [train.py:988] (2/4) Epoch 6, batch 100, loss[loss=0.3551, simple_loss=0.3972, pruned_loss=0.1565, over 19208.00 frames. ], tot_loss[loss=0.3587, simple_loss=0.3941, pruned_loss=0.1617, over 1514744.23 frames. ], batch size: 92, lr: 3.26e-02, grad_scale: 16.0 2023-06-15 03:28:13,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=14.405000000000001 2023-06-15 03:29:08,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18613.333333333332, ans=0.1138666666666667 2023-06-15 03:29:10,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=18680.0, ans=0.0 2023-06-15 03:29:11,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=18680.0, ans=10.0 2023-06-15 03:29:22,868 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.570e+02 3.710e+02 4.834e+02 1.052e+03, threshold=7.420e+02, percent-clipped=12.0 2023-06-15 03:29:27,807 INFO [train.py:988] (2/4) Epoch 6, batch 150, loss[loss=0.4321, simple_loss=0.4518, pruned_loss=0.2062, over 16794.00 frames. ], tot_loss[loss=0.3587, simple_loss=0.3938, pruned_loss=0.1617, over 2018065.94 frames. ], batch size: 59, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:29:28,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18746.666666666668, ans=0.0 2023-06-15 03:29:29,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=18746.666666666668, ans=0.04949747468305833 2023-06-15 03:29:38,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18746.666666666668, ans=0.11253333333333332 2023-06-15 03:29:39,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.68 vs. limit=21.560000000000002 2023-06-15 03:30:47,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=21.759999999999998 2023-06-15 03:30:52,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.60 vs. limit=14.629999999999999 2023-06-15 03:30:54,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=14.655000000000001 2023-06-15 03:30:54,545 INFO [train.py:988] (2/4) Epoch 6, batch 200, loss[loss=0.3487, simple_loss=0.3775, pruned_loss=0.1599, over 20526.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.3925, pruned_loss=0.1597, over 2428993.68 frames. ], batch size: 173, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:30:55,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=19080.0, ans=0.0 2023-06-15 03:31:10,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=19146.666666666668, ans=0.025 2023-06-15 03:31:16,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=19146.666666666668, ans=0.125 2023-06-15 03:31:21,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.35 vs. limit=14.573333333333334 2023-06-15 03:31:24,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.98 vs. limit=21.86 2023-06-15 03:31:37,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19213.333333333332, ans=0.1078666666666667 2023-06-15 03:31:51,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=19280.0, ans=0.0 2023-06-15 03:31:53,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=19280.0, ans=0.125 2023-06-15 03:32:04,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=19346.666666666668, ans=0.125 2023-06-15 03:32:15,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.813e+02 3.502e+02 4.540e+02 9.537e+02, threshold=7.003e+02, percent-clipped=3.0 2023-06-15 03:32:16,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=19346.666666666668, ans=0.125 2023-06-15 03:32:19,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=19413.333333333332, ans=0.49119999999999997 2023-06-15 03:32:20,496 INFO [train.py:988] (2/4) Epoch 6, batch 250, loss[loss=0.3472, simple_loss=0.4033, pruned_loss=0.1455, over 16017.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.3916, pruned_loss=0.1582, over 2732089.98 frames. ], batch size: 51, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:32:34,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=19413.333333333332, ans=0.2 2023-06-15 03:32:43,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=19480.0, ans=0.0 2023-06-15 03:33:13,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=19613.333333333332, ans=0.006605797101449276 2023-06-15 03:33:19,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=19613.333333333332, ans=0.125 2023-06-15 03:33:22,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=19613.333333333332, ans=0.21353333333333346 2023-06-15 03:33:31,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19680.0, ans=0.10320000000000001 2023-06-15 03:33:45,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=19746.666666666668, ans=0.0 2023-06-15 03:33:46,377 INFO [train.py:988] (2/4) Epoch 6, batch 300, loss[loss=0.3546, simple_loss=0.4, pruned_loss=0.1546, over 18441.00 frames. ], tot_loss[loss=0.3547, simple_loss=0.3924, pruned_loss=0.1585, over 2950381.58 frames. ], batch size: 77, lr: 3.24e-02, grad_scale: 16.0 2023-06-15 03:33:50,181 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:33:55,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=19746.666666666668, ans=0.125 2023-06-15 03:33:55,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.95 vs. limit=11.898666666666667 2023-06-15 03:33:58,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=19746.666666666668, ans=0.0 2023-06-15 03:34:09,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=19813.333333333332, ans=0.125 2023-06-15 03:34:18,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=19813.333333333332, ans=0.125 2023-06-15 03:34:22,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=14.94 2023-06-15 03:34:53,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=19946.666666666668, ans=0.125 2023-06-15 03:34:57,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=20013.333333333332, ans=0.2 2023-06-15 03:35:08,282 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.795e+02 3.555e+02 5.115e+02 9.485e+02, threshold=7.110e+02, percent-clipped=8.0 2023-06-15 03:35:13,421 INFO [train.py:988] (2/4) Epoch 6, batch 350, loss[loss=0.3536, simple_loss=0.3976, pruned_loss=0.1548, over 19827.00 frames. ], tot_loss[loss=0.3536, simple_loss=0.3907, pruned_loss=0.1583, over 3131483.55 frames. ], batch size: 115, lr: 3.24e-02, grad_scale: 16.0 2023-06-15 03:35:17,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=20080.0, ans=0.125 2023-06-15 03:35:51,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=20213.333333333332, ans=0.0 2023-06-15 03:36:11,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=20280.0, ans=0.125 2023-06-15 03:36:30,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.96 vs. limit=6.0 2023-06-15 03:36:33,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-06-15 03:36:40,610 INFO [train.py:988] (2/4) Epoch 6, batch 400, loss[loss=0.3471, simple_loss=0.3852, pruned_loss=0.1545, over 20298.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.3907, pruned_loss=0.1578, over 3275188.42 frames. ], batch size: 149, lr: 3.24e-02, grad_scale: 32.0 2023-06-15 03:36:58,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-15 03:37:06,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=20480.0, ans=0.0 2023-06-15 03:37:33,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-15 03:38:01,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.580e+02 3.237e+02 4.539e+02 1.447e+03, threshold=6.474e+02, percent-clipped=10.0 2023-06-15 03:38:06,914 INFO [train.py:988] (2/4) Epoch 6, batch 450, loss[loss=0.3398, simple_loss=0.372, pruned_loss=0.1538, over 20747.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.39, pruned_loss=0.1565, over 3401080.61 frames. ], batch size: 211, lr: 3.23e-02, grad_scale: 32.0 2023-06-15 03:39:01,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.70 vs. limit=6.0 2023-06-15 03:39:14,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21013.333333333332, ans=0.1 2023-06-15 03:39:31,076 INFO [train.py:988] (2/4) Epoch 6, batch 500, loss[loss=0.3516, simple_loss=0.3702, pruned_loss=0.1665, over 19842.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3895, pruned_loss=0.1557, over 3484843.34 frames. ], batch size: 294, lr: 3.23e-02, grad_scale: 32.0 2023-06-15 03:39:46,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=21146.666666666668, ans=0.0 2023-06-15 03:39:49,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=21146.666666666668, ans=0.125 2023-06-15 03:40:12,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21213.333333333332, ans=0.125 2023-06-15 03:40:50,451 INFO [train.py:988] (2/4) Epoch 7, batch 0, loss[loss=0.3621, simple_loss=0.4139, pruned_loss=0.1551, over 17875.00 frames. ], tot_loss[loss=0.3621, simple_loss=0.4139, pruned_loss=0.1551, over 17875.00 frames. ], batch size: 68, lr: 3.07e-02, grad_scale: 32.0 2023-06-15 03:40:50,451 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 03:40:56,366 INFO [train.py:1020] (2/4) Epoch 7, validation: loss=0.2561, simple_loss=0.3562, pruned_loss=0.07803, over 143649.00 frames. 2023-06-15 03:40:56,368 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 03:41:01,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21300.0, ans=0.1 2023-06-15 03:41:19,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.547e+02 3.083e+02 4.575e+02 1.238e+03, threshold=6.165e+02, percent-clipped=14.0 2023-06-15 03:41:56,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=21500.0, ans=0.0 2023-06-15 03:42:20,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21633.333333333332, ans=0.125 2023-06-15 03:42:21,402 INFO [train.py:988] (2/4) Epoch 7, batch 50, loss[loss=0.3285, simple_loss=0.3744, pruned_loss=0.1413, over 19233.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3797, pruned_loss=0.1478, over 871998.90 frames. ], batch size: 92, lr: 3.07e-02, grad_scale: 32.0 2023-06-15 03:42:28,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=21633.333333333332, ans=0.125 2023-06-15 03:42:36,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=21633.333333333332, ans=0.0061666666666666675 2023-06-15 03:42:52,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.35 vs. limit=22.5 2023-06-15 03:43:24,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-15 03:43:49,159 INFO [train.py:988] (2/4) Epoch 7, batch 100, loss[loss=0.3388, simple_loss=0.3774, pruned_loss=0.1501, over 20518.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3821, pruned_loss=0.149, over 1525864.79 frames. ], batch size: 173, lr: 3.06e-02, grad_scale: 32.0 2023-06-15 03:44:05,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-15 03:44:13,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 2.359e+02 2.777e+02 3.733e+02 1.026e+03, threshold=5.554e+02, percent-clipped=6.0 2023-06-15 03:44:42,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=22166.666666666668, ans=0.125 2023-06-15 03:45:01,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=22233.333333333332, ans=0.125 2023-06-15 03:45:09,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=22233.333333333332, ans=0.125 2023-06-15 03:45:15,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=22300.0, ans=0.125 2023-06-15 03:45:16,411 INFO [train.py:988] (2/4) Epoch 7, batch 150, loss[loss=0.3632, simple_loss=0.3973, pruned_loss=0.1645, over 20272.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3818, pruned_loss=0.1475, over 2028380.19 frames. ], batch size: 141, lr: 3.06e-02, grad_scale: 32.0 2023-06-15 03:45:50,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=22433.333333333332, ans=0.125 2023-06-15 03:46:15,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=22500.0, ans=0.005978260869565218 2023-06-15 03:46:18,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=22.5 2023-06-15 03:46:26,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22566.666666666668, ans=0.1 2023-06-15 03:46:45,796 INFO [train.py:988] (2/4) Epoch 7, batch 200, loss[loss=0.3316, simple_loss=0.3661, pruned_loss=0.1486, over 20521.00 frames. ], tot_loss[loss=0.3389, simple_loss=0.3822, pruned_loss=0.1478, over 2409148.29 frames. ], batch size: 189, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:47:01,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=22633.333333333332, ans=0.05 2023-06-15 03:47:11,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.670e+02 3.468e+02 4.313e+02 7.598e+02, threshold=6.936e+02, percent-clipped=8.0 2023-06-15 03:47:55,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-15 03:48:15,522 INFO [train.py:988] (2/4) Epoch 7, batch 250, loss[loss=0.3204, simple_loss=0.3702, pruned_loss=0.1354, over 19892.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3808, pruned_loss=0.147, over 2737579.60 frames. ], batch size: 120, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:48:16,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2023-06-15 03:49:02,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23100.0, ans=0.0 2023-06-15 03:49:31,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23233.333333333332, ans=0.1 2023-06-15 03:49:36,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2023-06-15 03:49:44,554 INFO [train.py:988] (2/4) Epoch 7, batch 300, loss[loss=0.3752, simple_loss=0.4015, pruned_loss=0.1744, over 20480.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3803, pruned_loss=0.1465, over 2975502.92 frames. ], batch size: 160, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:50:08,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.398e+02 2.879e+02 3.819e+02 6.544e+02, threshold=5.757e+02, percent-clipped=0.0 2023-06-15 03:50:09,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=23366.666666666668, ans=0.125 2023-06-15 03:50:12,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=23366.666666666668, ans=0.125 2023-06-15 03:50:14,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=23366.666666666668, ans=0.125 2023-06-15 03:50:33,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=23433.333333333332, ans=0.5 2023-06-15 03:50:57,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.51 vs. limit=6.0 2023-06-15 03:51:12,755 INFO [train.py:988] (2/4) Epoch 7, batch 350, loss[loss=0.3779, simple_loss=0.4178, pruned_loss=0.169, over 18463.00 frames. ], tot_loss[loss=0.3348, simple_loss=0.3791, pruned_loss=0.1453, over 3154573.32 frames. ], batch size: 77, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:51:18,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=23633.333333333332, ans=0.005731884057971015 2023-06-15 03:51:37,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=23700.0, ans=0.125 2023-06-15 03:51:37,987 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:52:40,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=23966.666666666668, ans=0.125 2023-06-15 03:52:41,276 INFO [train.py:988] (2/4) Epoch 7, batch 400, loss[loss=0.3181, simple_loss=0.3736, pruned_loss=0.1313, over 18453.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3787, pruned_loss=0.1453, over 3299304.99 frames. ], batch size: 77, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:52:53,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=23966.666666666668, ans=0.125 2023-06-15 03:53:06,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 3.077e+02 3.837e+02 5.079e+02 9.527e+02, threshold=7.674e+02, percent-clipped=15.0 2023-06-15 03:53:09,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=15.0 2023-06-15 03:53:12,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.99 vs. limit=15.0 2023-06-15 03:53:21,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=12.0 2023-06-15 03:53:37,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=24166.666666666668, ans=0.025 2023-06-15 03:53:43,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.08 vs. limit=22.5 2023-06-15 03:53:54,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=24233.333333333332, ans=0.2 2023-06-15 03:53:55,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24233.333333333332, ans=0.1 2023-06-15 03:54:09,655 INFO [train.py:988] (2/4) Epoch 7, batch 450, loss[loss=0.3163, simple_loss=0.3592, pruned_loss=0.1367, over 20087.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.3786, pruned_loss=0.1445, over 3396664.63 frames. ], batch size: 133, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:54:22,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24300.0, ans=0.1 2023-06-15 03:54:45,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=24433.333333333332, ans=0.125 2023-06-15 03:54:53,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=24433.333333333332, ans=0.2 2023-06-15 03:54:55,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=24433.333333333332, ans=0.005557971014492754 2023-06-15 03:54:56,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=24433.333333333332, ans=0.0 2023-06-15 03:55:06,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=24500.0, ans=0.125 2023-06-15 03:55:08,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=24500.0, ans=0.125 2023-06-15 03:55:10,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.67 vs. limit=10.0 2023-06-15 03:55:35,127 INFO [train.py:988] (2/4) Epoch 7, batch 500, loss[loss=0.3302, simple_loss=0.3674, pruned_loss=0.1465, over 20498.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3775, pruned_loss=0.1433, over 3489325.41 frames. ], batch size: 160, lr: 3.03e-02, grad_scale: 32.0 2023-06-15 03:55:45,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=24633.333333333332, ans=0.04949747468305833 2023-06-15 03:55:48,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=24633.333333333332, ans=0.05 2023-06-15 03:55:57,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24700.0, ans=0.1 2023-06-15 03:55:58,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.744e+02 3.251e+02 4.322e+02 7.093e+02, threshold=6.501e+02, percent-clipped=0.0 2023-06-15 03:56:05,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=24700.0, ans=0.125 2023-06-15 03:56:15,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24766.666666666668, ans=0.1 2023-06-15 03:56:20,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=24766.666666666668, ans=0.0 2023-06-15 03:56:53,604 INFO [train.py:988] (2/4) Epoch 8, batch 0, loss[loss=0.3163, simple_loss=0.3696, pruned_loss=0.1315, over 19770.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3696, pruned_loss=0.1315, over 19770.00 frames. ], batch size: 115, lr: 2.89e-02, grad_scale: 32.0 2023-06-15 03:56:53,605 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 03:56:59,691 INFO [train.py:1020] (2/4) Epoch 8, validation: loss=0.2482, simple_loss=0.3483, pruned_loss=0.0741, over 143649.00 frames. 2023-06-15 03:56:59,691 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 03:57:29,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=24913.333333333332, ans=0.125 2023-06-15 03:57:49,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-15 03:58:10,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=25113.333333333332, ans=0.125 2023-06-15 03:58:24,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=25113.333333333332, ans=0.025 2023-06-15 03:58:26,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=25180.0, ans=0.005395652173913044 2023-06-15 03:58:28,210 INFO [train.py:988] (2/4) Epoch 8, batch 50, loss[loss=0.2983, simple_loss=0.3519, pruned_loss=0.1223, over 19469.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3703, pruned_loss=0.1361, over 854244.43 frames. ], batch size: 105, lr: 2.88e-02, grad_scale: 32.0 2023-06-15 03:58:36,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2023-06-15 03:59:25,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.660e+02 2.942e+02 3.457e+02 5.575e+02, threshold=5.885e+02, percent-clipped=0.0 2023-06-15 03:59:30,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=25380.0, ans=0.125 2023-06-15 03:59:53,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-06-15 03:59:57,948 INFO [train.py:988] (2/4) Epoch 8, batch 100, loss[loss=0.335, simple_loss=0.3826, pruned_loss=0.1437, over 18644.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.373, pruned_loss=0.138, over 1504643.00 frames. ], batch size: 80, lr: 2.88e-02, grad_scale: 32.0 2023-06-15 04:00:00,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-06-15 04:00:05,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.57 vs. limit=15.0 2023-06-15 04:00:15,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=25580.0, ans=0.0 2023-06-15 04:00:28,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.29 vs. limit=6.0 2023-06-15 04:00:40,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=25646.666666666668, ans=0.125 2023-06-15 04:01:16,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=25780.0, ans=0.125 2023-06-15 04:01:18,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=25780.0, ans=0.125 2023-06-15 04:01:27,340 INFO [train.py:988] (2/4) Epoch 8, batch 150, loss[loss=0.3221, simple_loss=0.3846, pruned_loss=0.1298, over 17047.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3734, pruned_loss=0.1394, over 2004021.90 frames. ], batch size: 60, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:01:27,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=25846.666666666668, ans=0.07 2023-06-15 04:01:46,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=25913.333333333332, ans=0.125 2023-06-15 04:01:57,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=25913.333333333332, ans=0.07 2023-06-15 04:02:23,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 2.617e+02 3.387e+02 4.495e+02 9.103e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-15 04:02:32,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=26046.666666666668, ans=0.005207246376811594 2023-06-15 04:02:50,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=26113.333333333332, ans=0.125 2023-06-15 04:02:55,959 INFO [train.py:988] (2/4) Epoch 8, batch 200, loss[loss=0.2924, simple_loss=0.3571, pruned_loss=0.1138, over 19665.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3719, pruned_loss=0.1387, over 2403316.98 frames. ], batch size: 110, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:03:20,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=26246.666666666668, ans=0.0 2023-06-15 04:03:29,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=26313.333333333332, ans=0.125 2023-06-15 04:03:45,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=26313.333333333332, ans=0.005149275362318841 2023-06-15 04:03:46,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.48 vs. limit=22.5 2023-06-15 04:03:46,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-06-15 04:03:51,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=26380.0, ans=0.125 2023-06-15 04:03:56,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=26380.0, ans=0.5 2023-06-15 04:03:57,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=26380.0, ans=0.125 2023-06-15 04:04:06,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=26446.666666666668, ans=0.0 2023-06-15 04:04:25,057 INFO [train.py:988] (2/4) Epoch 8, batch 250, loss[loss=0.3492, simple_loss=0.3689, pruned_loss=0.1648, over 20028.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3718, pruned_loss=0.1378, over 2712264.07 frames. ], batch size: 293, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:05:10,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=15.0 2023-06-15 04:05:11,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=26646.666666666668, ans=0.005076811594202898 2023-06-15 04:05:25,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.237e+02 2.856e+02 3.826e+02 6.923e+02, threshold=5.713e+02, percent-clipped=1.0 2023-06-15 04:05:30,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.73 vs. limit=12.0 2023-06-15 04:05:32,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=26713.333333333332, ans=10.0 2023-06-15 04:05:57,926 INFO [train.py:988] (2/4) Epoch 8, batch 300, loss[loss=0.3239, simple_loss=0.3752, pruned_loss=0.1363, over 19516.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.372, pruned_loss=0.1384, over 2922775.04 frames. ], batch size: 102, lr: 2.86e-02, grad_scale: 32.0 2023-06-15 04:06:41,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=26980.0, ans=0.025 2023-06-15 04:06:51,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27046.666666666668, ans=0.1 2023-06-15 04:07:01,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=27046.666666666668, ans=0.2 2023-06-15 04:07:08,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=27113.333333333332, ans=0.125 2023-06-15 04:07:19,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=12.0 2023-06-15 04:07:27,554 INFO [train.py:988] (2/4) Epoch 8, batch 350, loss[loss=0.3217, simple_loss=0.3644, pruned_loss=0.1395, over 20289.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.3709, pruned_loss=0.1372, over 3112173.88 frames. ], batch size: 149, lr: 2.86e-02, grad_scale: 32.0 2023-06-15 04:07:44,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-06-15 04:08:12,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.55 vs. limit=22.5 2023-06-15 04:08:17,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=27313.333333333332, ans=0.125 2023-06-15 04:08:24,609 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.602e+02 3.097e+02 4.121e+02 7.485e+02, threshold=6.195e+02, percent-clipped=4.0 2023-06-15 04:08:26,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=27380.0, ans=0.0 2023-06-15 04:08:35,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-06-15 04:08:56,006 INFO [train.py:988] (2/4) Epoch 8, batch 400, loss[loss=0.3135, simple_loss=0.3208, pruned_loss=0.1531, over 17151.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3713, pruned_loss=0.1372, over 3247544.84 frames. ], batch size: 391, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:08:58,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.95 vs. limit=22.5 2023-06-15 04:09:08,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=27513.333333333332, ans=0.2 2023-06-15 04:09:09,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=27513.333333333332, ans=6.0 2023-06-15 04:09:20,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=27580.0, ans=0.1 2023-06-15 04:09:33,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.66 vs. limit=6.0 2023-06-15 04:09:59,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=27713.333333333332, ans=0.004844927536231884 2023-06-15 04:10:02,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=27713.333333333332, ans=0.07 2023-06-15 04:10:17,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=27780.0, ans=0.125 2023-06-15 04:10:26,993 INFO [train.py:988] (2/4) Epoch 8, batch 450, loss[loss=0.3234, simple_loss=0.3669, pruned_loss=0.1399, over 19955.00 frames. ], tot_loss[loss=0.323, simple_loss=0.371, pruned_loss=0.1375, over 3357218.83 frames. ], batch size: 126, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:10:51,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.83 vs. limit=15.0 2023-06-15 04:11:05,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=27980.0, ans=0.125 2023-06-15 04:11:06,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=27980.0, ans=0.125 2023-06-15 04:11:06,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27980.0, ans=0.1 2023-06-15 04:11:23,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.599e+02 3.032e+02 3.753e+02 5.594e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-15 04:11:47,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-15 04:11:47,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=28113.333333333332, ans=12.0 2023-06-15 04:11:53,594 INFO [train.py:988] (2/4) Epoch 8, batch 500, loss[loss=0.3074, simple_loss=0.3664, pruned_loss=0.1242, over 18293.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.371, pruned_loss=0.1369, over 3432438.51 frames. ], batch size: 74, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:11:55,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=28180.0, ans=0.2 2023-06-15 04:12:44,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=28380.0, ans=0.04949747468305833 2023-06-15 04:13:14,339 INFO [train.py:988] (2/4) Epoch 9, batch 0, loss[loss=0.3123, simple_loss=0.3693, pruned_loss=0.1277, over 19785.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3693, pruned_loss=0.1277, over 19785.00 frames. ], batch size: 115, lr: 2.72e-02, grad_scale: 32.0 2023-06-15 04:13:14,340 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 04:13:20,336 INFO [train.py:1020] (2/4) Epoch 9, validation: loss=0.2394, simple_loss=0.343, pruned_loss=0.06786, over 143649.00 frames. 2023-06-15 04:13:20,337 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 04:13:54,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=28526.666666666668, ans=0.004668115942028986 2023-06-15 04:14:10,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28526.666666666668, ans=0.125 2023-06-15 04:14:14,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=28593.333333333332, ans=0.025 2023-06-15 04:14:36,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=22.5 2023-06-15 04:14:45,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-15 04:14:49,777 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.336e+02 2.823e+02 3.585e+02 6.203e+02, threshold=5.645e+02, percent-clipped=2.0 2023-06-15 04:14:49,823 INFO [train.py:988] (2/4) Epoch 9, batch 50, loss[loss=0.3233, simple_loss=0.3729, pruned_loss=0.1368, over 20013.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.3682, pruned_loss=0.1335, over 865128.17 frames. ], batch size: 126, lr: 2.71e-02, grad_scale: 32.0 2023-06-15 04:15:00,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=28726.666666666668, ans=0.09899494936611666 2023-06-15 04:15:01,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-15 04:15:23,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-15 04:15:47,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=28926.666666666668, ans=0.004581159420289855 2023-06-15 04:16:11,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=28993.333333333332, ans=0.125 2023-06-15 04:16:17,497 INFO [train.py:988] (2/4) Epoch 9, batch 100, loss[loss=0.3065, simple_loss=0.3601, pruned_loss=0.1265, over 19801.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3676, pruned_loss=0.133, over 1498741.18 frames. ], batch size: 115, lr: 2.71e-02, grad_scale: 32.0 2023-06-15 04:17:04,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=29193.333333333332, ans=0.0 2023-06-15 04:17:06,152 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:17:06,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=29193.333333333332, ans=0.0 2023-06-15 04:17:09,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=29260.0, ans=0.125 2023-06-15 04:17:24,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.83 vs. limit=12.0 2023-06-15 04:17:44,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.451e+02 3.023e+02 4.117e+02 8.643e+02, threshold=6.045e+02, percent-clipped=4.0 2023-06-15 04:17:44,349 INFO [train.py:988] (2/4) Epoch 9, batch 150, loss[loss=0.3155, simple_loss=0.3658, pruned_loss=0.1326, over 19109.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3686, pruned_loss=0.1338, over 1997663.42 frames. ], batch size: 94, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:18:06,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=29460.0, ans=0.5 2023-06-15 04:19:03,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=29660.0, ans=0.125 2023-06-15 04:19:12,158 INFO [train.py:988] (2/4) Epoch 9, batch 200, loss[loss=0.3042, simple_loss=0.3559, pruned_loss=0.1263, over 19093.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3671, pruned_loss=0.1334, over 2399363.57 frames. ], batch size: 89, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:19:22,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.03 vs. limit=22.5 2023-06-15 04:20:01,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29860.0, ans=0.0 2023-06-15 04:20:10,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=29926.666666666668, ans=0.004363768115942028 2023-06-15 04:20:12,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=29926.666666666668, ans=0.125 2023-06-15 04:20:20,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=29926.666666666668, ans=0.05 2023-06-15 04:20:41,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.377e+02 2.781e+02 3.474e+02 5.000e+02, threshold=5.562e+02, percent-clipped=0.0 2023-06-15 04:20:41,782 INFO [train.py:988] (2/4) Epoch 9, batch 250, loss[loss=0.3009, simple_loss=0.3594, pruned_loss=0.1212, over 19533.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3649, pruned_loss=0.1323, over 2709004.73 frames. ], batch size: 102, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:20:59,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=30126.666666666668, ans=0.004320289855072463 2023-06-15 04:21:07,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=30126.666666666668, ans=0.2 2023-06-15 04:21:26,575 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:21:28,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=30193.333333333332, ans=0.004305797101449275 2023-06-15 04:21:54,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=30326.666666666668, ans=0.2 2023-06-15 04:21:56,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=30326.666666666668, ans=0.1 2023-06-15 04:22:09,940 INFO [train.py:988] (2/4) Epoch 9, batch 300, loss[loss=0.3336, simple_loss=0.3949, pruned_loss=0.1361, over 18302.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3643, pruned_loss=0.1312, over 2956089.65 frames. ], batch size: 72, lr: 2.69e-02, grad_scale: 32.0 2023-06-15 04:22:12,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=30393.333333333332, ans=0.00426231884057971 2023-06-15 04:22:19,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-15 04:22:29,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=30460.0, ans=0.125 2023-06-15 04:22:44,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=30526.666666666668, ans=0.2 2023-06-15 04:22:55,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=30526.666666666668, ans=0.125 2023-06-15 04:22:59,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=30526.666666666668, ans=0.004233333333333333 2023-06-15 04:23:14,175 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:23:29,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=30660.0, ans=0.125 2023-06-15 04:23:32,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=30660.0, ans=0.05 2023-06-15 04:23:40,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.682e+02 3.189e+02 4.179e+02 7.690e+02, threshold=6.378e+02, percent-clipped=10.0 2023-06-15 04:23:40,532 INFO [train.py:988] (2/4) Epoch 9, batch 350, loss[loss=0.3065, simple_loss=0.3476, pruned_loss=0.1327, over 20667.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.363, pruned_loss=0.1311, over 3154884.59 frames. ], batch size: 211, lr: 2.69e-02, grad_scale: 32.0 2023-06-15 04:23:40,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=30726.666666666668, ans=0.125 2023-06-15 04:23:52,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=30726.666666666668, ans=0.125 2023-06-15 04:24:09,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=30793.333333333332, ans=0.125 2023-06-15 04:24:16,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.90 vs. limit=22.5 2023-06-15 04:24:19,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=30860.0, ans=0.2 2023-06-15 04:24:24,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-15 04:24:24,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=12.0 2023-06-15 04:25:05,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=30993.333333333332, ans=0.1 2023-06-15 04:25:09,341 INFO [train.py:988] (2/4) Epoch 9, batch 400, loss[loss=0.2838, simple_loss=0.3405, pruned_loss=0.1136, over 19499.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3625, pruned_loss=0.1301, over 3303476.27 frames. ], batch size: 105, lr: 2.68e-02, grad_scale: 32.0 2023-06-15 04:25:21,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=31060.0, ans=0.09899494936611666 2023-06-15 04:25:27,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-06-15 04:25:36,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=31126.666666666668, ans=0.004102898550724637 2023-06-15 04:25:58,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31193.333333333332, ans=0.125 2023-06-15 04:26:36,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.307e+02 2.949e+02 3.902e+02 6.879e+02, threshold=5.899e+02, percent-clipped=2.0 2023-06-15 04:26:36,188 INFO [train.py:988] (2/4) Epoch 9, batch 450, loss[loss=0.328, simple_loss=0.3957, pruned_loss=0.1302, over 16994.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3635, pruned_loss=0.1299, over 3387961.08 frames. ], batch size: 60, lr: 2.68e-02, grad_scale: 32.0 2023-06-15 04:26:50,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=31393.333333333332, ans=0.0 2023-06-15 04:27:07,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=31460.0, ans=10.0 2023-06-15 04:27:12,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31526.666666666668, ans=0.1 2023-06-15 04:27:41,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31593.333333333332, ans=0.1 2023-06-15 04:27:43,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.24 vs. limit=15.0 2023-06-15 04:28:01,907 INFO [train.py:988] (2/4) Epoch 9, batch 500, loss[loss=0.2836, simple_loss=0.3436, pruned_loss=0.1117, over 19814.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3628, pruned_loss=0.1293, over 3466630.27 frames. ], batch size: 115, lr: 2.68e-02, grad_scale: 64.0 2023-06-15 04:28:04,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.64 vs. limit=22.5 2023-06-15 04:28:15,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-15 04:28:19,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31793.333333333332, ans=0.1 2023-06-15 04:28:21,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=15.0 2023-06-15 04:28:25,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31793.333333333332, ans=0.125 2023-06-15 04:28:25,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31793.333333333332, ans=0.1 2023-06-15 04:28:33,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=31860.0, ans=0.125 2023-06-15 04:29:21,617 INFO [train.py:988] (2/4) Epoch 10, batch 0, loss[loss=0.3177, simple_loss=0.3731, pruned_loss=0.1311, over 19808.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.3731, pruned_loss=0.1311, over 19808.00 frames. ], batch size: 120, lr: 2.56e-02, grad_scale: 64.0 2023-06-15 04:29:21,617 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 04:29:28,494 INFO [train.py:1020] (2/4) Epoch 10, validation: loss=0.2327, simple_loss=0.3375, pruned_loss=0.06395, over 143649.00 frames. 2023-06-15 04:29:28,495 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 04:29:40,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.18 vs. limit=6.0 2023-06-15 04:29:50,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32006.666666666668, ans=0.125 2023-06-15 04:30:01,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.277e+02 2.643e+02 3.234e+02 5.475e+02, threshold=5.286e+02, percent-clipped=0.0 2023-06-15 04:30:02,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=32006.666666666668, ans=0.09899494936611666 2023-06-15 04:30:07,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-06-15 04:30:14,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=32073.333333333332, ans=0.1 2023-06-15 04:30:18,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=32073.333333333332, ans=0.0 2023-06-15 04:30:45,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=32206.666666666668, ans=0.125 2023-06-15 04:30:46,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32206.666666666668, ans=0.0 2023-06-15 04:30:58,906 INFO [train.py:988] (2/4) Epoch 10, batch 50, loss[loss=0.2912, simple_loss=0.348, pruned_loss=0.1172, over 19862.00 frames. ], tot_loss[loss=0.3085, simple_loss=0.3605, pruned_loss=0.1282, over 854467.96 frames. ], batch size: 120, lr: 2.56e-02, grad_scale: 64.0 2023-06-15 04:31:48,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=15.0 2023-06-15 04:32:02,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=32473.333333333332, ans=0.07 2023-06-15 04:32:04,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=32473.333333333332, ans=0.125 2023-06-15 04:32:07,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-15 04:32:28,861 INFO [train.py:988] (2/4) Epoch 10, batch 100, loss[loss=0.3166, simple_loss=0.3552, pruned_loss=0.139, over 20697.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3605, pruned_loss=0.1273, over 1515871.20 frames. ], batch size: 211, lr: 2.55e-02, grad_scale: 64.0 2023-06-15 04:32:59,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=32673.333333333332, ans=0.1 2023-06-15 04:33:02,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 2.450e+02 2.873e+02 3.278e+02 7.765e+02, threshold=5.745e+02, percent-clipped=3.0 2023-06-15 04:33:42,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=32873.333333333336, ans=0.2 2023-06-15 04:33:59,940 INFO [train.py:988] (2/4) Epoch 10, batch 150, loss[loss=0.2801, simple_loss=0.3407, pruned_loss=0.1097, over 10630.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3599, pruned_loss=0.1276, over 2008791.87 frames. ], batch size: 30, lr: 2.55e-02, grad_scale: 64.0 2023-06-15 04:34:08,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=32940.0, ans=0.2 2023-06-15 04:34:52,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=33140.0, ans=0.125 2023-06-15 04:35:09,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=33140.0, ans=0.125 2023-06-15 04:35:30,169 INFO [train.py:988] (2/4) Epoch 10, batch 200, loss[loss=0.3188, simple_loss=0.3748, pruned_loss=0.1314, over 18945.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3605, pruned_loss=0.127, over 2400717.52 frames. ], batch size: 86, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:35:39,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=33273.333333333336, ans=0.125 2023-06-15 04:35:39,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=33273.333333333336, ans=0.125 2023-06-15 04:35:40,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=33273.333333333336, ans=0.125 2023-06-15 04:35:56,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=33340.0, ans=0.2 2023-06-15 04:36:02,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.312e+02 2.745e+02 3.367e+02 5.641e+02, threshold=5.490e+02, percent-clipped=0.0 2023-06-15 04:36:04,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33406.666666666664, ans=0.1 2023-06-15 04:36:20,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=33406.666666666664, ans=0.09899494936611666 2023-06-15 04:36:38,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=33473.333333333336, ans=0.0 2023-06-15 04:36:55,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=33540.0, ans=0.125 2023-06-15 04:36:59,958 INFO [train.py:988] (2/4) Epoch 10, batch 250, loss[loss=0.304, simple_loss=0.3597, pruned_loss=0.1242, over 19347.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3606, pruned_loss=0.1265, over 2712449.55 frames. ], batch size: 98, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:37:09,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-15 04:37:19,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=33673.333333333336, ans=0.125 2023-06-15 04:37:22,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=12.0 2023-06-15 04:37:46,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=33740.0, ans=0.0035347826086956514 2023-06-15 04:38:21,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.97 vs. limit=12.0 2023-06-15 04:38:27,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=33940.0, ans=0.125 2023-06-15 04:38:29,340 INFO [train.py:988] (2/4) Epoch 10, batch 300, loss[loss=0.2863, simple_loss=0.3497, pruned_loss=0.1115, over 19807.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.36, pruned_loss=0.1257, over 2963039.21 frames. ], batch size: 115, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:38:38,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=33940.0, ans=0.0 2023-06-15 04:39:01,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.483e+02 2.953e+02 3.724e+02 5.914e+02, threshold=5.906e+02, percent-clipped=1.0 2023-06-15 04:39:03,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=34073.333333333336, ans=0.125 2023-06-15 04:39:13,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=34073.333333333336, ans=0.1 2023-06-15 04:39:14,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=34073.333333333336, ans=0.00346231884057971 2023-06-15 04:39:23,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=34140.0, ans=0.003447826086956521 2023-06-15 04:39:37,547 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:39:41,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=34206.666666666664, ans=0.2 2023-06-15 04:39:59,124 INFO [train.py:988] (2/4) Epoch 10, batch 350, loss[loss=0.2801, simple_loss=0.3423, pruned_loss=0.1089, over 19701.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3588, pruned_loss=0.1255, over 3151602.21 frames. ], batch size: 110, lr: 2.53e-02, grad_scale: 64.0 2023-06-15 04:40:06,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=34273.333333333336, ans=0.125 2023-06-15 04:40:25,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2023-06-15 04:40:53,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-15 04:40:54,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=34473.333333333336, ans=0.125 2023-06-15 04:41:07,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=34473.333333333336, ans=0.0 2023-06-15 04:41:26,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2023-06-15 04:41:29,043 INFO [train.py:988] (2/4) Epoch 10, batch 400, loss[loss=0.3242, simple_loss=0.3852, pruned_loss=0.1316, over 16360.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3581, pruned_loss=0.1254, over 3301177.43 frames. ], batch size: 52, lr: 2.53e-02, grad_scale: 32.0 2023-06-15 04:41:36,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34606.666666666664, ans=0.1 2023-06-15 04:42:00,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=34673.333333333336, ans=0.125 2023-06-15 04:42:02,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.389e+02 2.906e+02 3.855e+02 6.206e+02, threshold=5.812e+02, percent-clipped=1.0 2023-06-15 04:42:10,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=34740.0, ans=0.125 2023-06-15 04:42:19,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=34740.0, ans=0.125 2023-06-15 04:42:20,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=34806.666666666664, ans=0.003302898550724638 2023-06-15 04:42:21,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=34806.666666666664, ans=0.003302898550724638 2023-06-15 04:42:55,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=34873.333333333336, ans=0.125 2023-06-15 04:42:58,374 INFO [train.py:988] (2/4) Epoch 10, batch 450, loss[loss=0.3043, simple_loss=0.3563, pruned_loss=0.1261, over 19839.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3583, pruned_loss=0.1247, over 3409797.70 frames. ], batch size: 120, lr: 2.52e-02, grad_scale: 32.0 2023-06-15 04:43:03,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.36 vs. limit=10.0 2023-06-15 04:43:13,121 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:43:21,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=35006.666666666664, ans=0.125 2023-06-15 04:44:13,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=35206.666666666664, ans=0.125 2023-06-15 04:44:14,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35206.666666666664, ans=0.125 2023-06-15 04:44:24,816 INFO [train.py:988] (2/4) Epoch 10, batch 500, loss[loss=0.284, simple_loss=0.3398, pruned_loss=0.1141, over 20382.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.359, pruned_loss=0.1246, over 3489998.57 frames. ], batch size: 149, lr: 2.52e-02, grad_scale: 32.0 2023-06-15 04:44:30,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=35273.333333333336, ans=0.2 2023-06-15 04:44:37,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=15.0 2023-06-15 04:44:46,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=35340.0, ans=0.125 2023-06-15 04:44:56,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.429e+02 2.839e+02 3.294e+02 4.521e+02, threshold=5.678e+02, percent-clipped=0.0 2023-06-15 04:45:08,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=35406.666666666664, ans=0.125 2023-06-15 04:45:44,016 INFO [train.py:988] (2/4) Epoch 11, batch 0, loss[loss=0.2895, simple_loss=0.3471, pruned_loss=0.116, over 19655.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3471, pruned_loss=0.116, over 19655.00 frames. ], batch size: 110, lr: 2.42e-02, grad_scale: 32.0 2023-06-15 04:45:44,016 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 04:45:50,096 INFO [train.py:1020] (2/4) Epoch 11, validation: loss=0.2306, simple_loss=0.3357, pruned_loss=0.06271, over 143649.00 frames. 2023-06-15 04:45:50,097 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 04:45:58,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=35493.333333333336, ans=0.2 2023-06-15 04:46:26,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=35626.666666666664, ans=0.125 2023-06-15 04:46:29,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=35626.666666666664, ans=0.0 2023-06-15 04:46:43,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=35693.333333333336, ans=0.125 2023-06-15 04:46:45,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=35693.333333333336, ans=0.5 2023-06-15 04:47:19,227 INFO [train.py:988] (2/4) Epoch 11, batch 50, loss[loss=0.2765, simple_loss=0.3417, pruned_loss=0.1057, over 19081.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3543, pruned_loss=0.1244, over 871998.28 frames. ], batch size: 89, lr: 2.41e-02, grad_scale: 32.0 2023-06-15 04:47:31,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=35826.666666666664, ans=0.125 2023-06-15 04:47:34,030 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:47:35,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=35893.333333333336, ans=0.035 2023-06-15 04:47:35,622 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:47:55,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=35960.0, ans=0.125 2023-06-15 04:48:22,827 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.367e+02 2.815e+02 3.714e+02 5.103e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-15 04:48:36,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36093.333333333336, ans=0.1 2023-06-15 04:48:47,414 INFO [train.py:988] (2/4) Epoch 11, batch 100, loss[loss=0.3151, simple_loss=0.385, pruned_loss=0.1226, over 16384.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3547, pruned_loss=0.1215, over 1508385.04 frames. ], batch size: 52, lr: 2.41e-02, grad_scale: 32.0 2023-06-15 04:49:00,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=36160.0, ans=0.2 2023-06-15 04:49:24,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=36293.333333333336, ans=0.07 2023-06-15 04:49:26,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.50 vs. limit=22.5 2023-06-15 04:49:58,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=36426.666666666664, ans=0.125 2023-06-15 04:50:02,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=36426.666666666664, ans=0.0029507246376811597 2023-06-15 04:50:18,757 INFO [train.py:988] (2/4) Epoch 11, batch 150, loss[loss=0.2971, simple_loss=0.3428, pruned_loss=0.1257, over 20567.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3538, pruned_loss=0.1213, over 2023251.13 frames. ], batch size: 189, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:50:58,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.21 vs. limit=10.0 2023-06-15 04:51:18,904 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:51:22,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 2.254e+02 2.489e+02 3.022e+02 4.758e+02, threshold=4.979e+02, percent-clipped=0.0 2023-06-15 04:51:23,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=36693.333333333336, ans=0.2 2023-06-15 04:51:28,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=36760.0, ans=0.0 2023-06-15 04:51:47,732 INFO [train.py:988] (2/4) Epoch 11, batch 200, loss[loss=0.2807, simple_loss=0.3449, pruned_loss=0.1083, over 18631.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3522, pruned_loss=0.1199, over 2413365.01 frames. ], batch size: 80, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:51:48,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=36826.666666666664, ans=0.95 2023-06-15 04:51:52,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=36826.666666666664, ans=0.2 2023-06-15 04:52:16,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.71 vs. limit=12.0 2023-06-15 04:52:20,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.19 vs. limit=22.5 2023-06-15 04:52:47,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37026.666666666664, ans=0.1 2023-06-15 04:52:49,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=37026.666666666664, ans=0.125 2023-06-15 04:52:51,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37026.666666666664, ans=0.1 2023-06-15 04:53:15,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.92 vs. limit=22.5 2023-06-15 04:53:17,724 INFO [train.py:988] (2/4) Epoch 11, batch 250, loss[loss=0.3163, simple_loss=0.3835, pruned_loss=0.1246, over 16942.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3541, pruned_loss=0.1194, over 2704838.46 frames. ], batch size: 60, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:53:31,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=37160.0, ans=0.125 2023-06-15 04:54:04,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=37293.333333333336, ans=0.05 2023-06-15 04:54:04,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37293.333333333336, ans=0.1 2023-06-15 04:54:11,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=37360.0, ans=0.07 2023-06-15 04:54:22,489 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.173e+02 2.592e+02 3.214e+02 4.591e+02, threshold=5.183e+02, percent-clipped=0.0 2023-06-15 04:54:24,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=37360.0, ans=0.0 2023-06-15 04:54:48,148 INFO [train.py:988] (2/4) Epoch 11, batch 300, loss[loss=0.2894, simple_loss=0.3451, pruned_loss=0.1168, over 20483.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3533, pruned_loss=0.1192, over 2931877.22 frames. ], batch size: 160, lr: 2.39e-02, grad_scale: 32.0 2023-06-15 04:54:57,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=37493.333333333336, ans=0.0 2023-06-15 04:55:29,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37626.666666666664, ans=0.125 2023-06-15 04:55:37,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37626.666666666664, ans=0.1 2023-06-15 04:56:18,500 INFO [train.py:988] (2/4) Epoch 11, batch 350, loss[loss=0.2773, simple_loss=0.3367, pruned_loss=0.109, over 19897.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3524, pruned_loss=0.1187, over 3119674.21 frames. ], batch size: 120, lr: 2.39e-02, grad_scale: 32.0 2023-06-15 04:56:40,383 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:56:52,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=37960.0, ans=0.0026173913043478257 2023-06-15 04:56:56,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=37960.0, ans=0.125 2023-06-15 04:57:11,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=38026.666666666664, ans=0.0 2023-06-15 04:57:20,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38026.666666666664, ans=0.125 2023-06-15 04:57:23,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.398e+02 2.846e+02 3.298e+02 5.496e+02, threshold=5.692e+02, percent-clipped=3.0 2023-06-15 04:57:48,719 INFO [train.py:988] (2/4) Epoch 11, batch 400, loss[loss=0.3022, simple_loss=0.3554, pruned_loss=0.1245, over 20098.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3521, pruned_loss=0.1191, over 3284739.60 frames. ], batch size: 133, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 04:57:59,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.19 vs. limit=10.0 2023-06-15 04:58:14,408 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.566e-02 2023-06-15 04:58:21,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38226.666666666664, ans=0.1 2023-06-15 04:58:35,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=38293.333333333336, ans=0.02 2023-06-15 04:58:51,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-15 04:59:06,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=38426.666666666664, ans=0.2 2023-06-15 04:59:18,284 INFO [train.py:988] (2/4) Epoch 11, batch 450, loss[loss=0.2951, simple_loss=0.351, pruned_loss=0.1196, over 20269.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3518, pruned_loss=0.1194, over 3399639.97 frames. ], batch size: 141, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 05:00:15,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=38693.333333333336, ans=0.0 2023-06-15 05:00:21,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.134e+02 2.582e+02 3.273e+02 5.590e+02, threshold=5.163e+02, percent-clipped=0.0 2023-06-15 05:00:37,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=38760.0, ans=0.125 2023-06-15 05:00:45,633 INFO [train.py:988] (2/4) Epoch 11, batch 500, loss[loss=0.27, simple_loss=0.3453, pruned_loss=0.09731, over 16777.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3505, pruned_loss=0.1181, over 3478255.70 frames. ], batch size: 59, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 05:00:52,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=38826.666666666664, ans=0.0 2023-06-15 05:01:09,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=38893.333333333336, ans=0.125 2023-06-15 05:02:04,580 INFO [train.py:988] (2/4) Epoch 12, batch 0, loss[loss=0.2915, simple_loss=0.3546, pruned_loss=0.1142, over 19079.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3546, pruned_loss=0.1142, over 19079.00 frames. ], batch size: 94, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:02:04,580 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 05:02:10,666 INFO [train.py:1020] (2/4) Epoch 12, validation: loss=0.2286, simple_loss=0.3321, pruned_loss=0.06259, over 143649.00 frames. 2023-06-15 05:02:10,667 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 05:02:32,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=39106.666666666664, ans=0.125 2023-06-15 05:02:53,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=39173.333333333336, ans=0.2 2023-06-15 05:03:39,945 INFO [train.py:988] (2/4) Epoch 12, batch 50, loss[loss=0.3122, simple_loss=0.3633, pruned_loss=0.1305, over 19341.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3501, pruned_loss=0.1164, over 842789.87 frames. ], batch size: 98, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:03:41,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=39373.333333333336, ans=0.04949747468305833 2023-06-15 05:03:47,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.229e+02 2.614e+02 3.246e+02 5.755e+02, threshold=5.228e+02, percent-clipped=1.0 2023-06-15 05:03:51,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=39373.333333333336, ans=0.002310144927536232 2023-06-15 05:04:03,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39440.0, ans=0.1 2023-06-15 05:04:09,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.82 vs. limit=22.5 2023-06-15 05:04:12,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2023-06-15 05:04:35,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=39573.333333333336, ans=0.125 2023-06-15 05:04:37,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=39573.333333333336, ans=0.2 2023-06-15 05:05:03,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=39640.0, ans=0.04949747468305833 2023-06-15 05:05:09,983 INFO [train.py:988] (2/4) Epoch 12, batch 100, loss[loss=0.3009, simple_loss=0.364, pruned_loss=0.1189, over 16211.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3494, pruned_loss=0.1166, over 1500492.40 frames. ], batch size: 52, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:05:10,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=39706.666666666664, ans=0.07 2023-06-15 05:05:17,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=39706.666666666664, ans=0.5 2023-06-15 05:05:18,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.07 vs. limit=22.5 2023-06-15 05:05:28,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=39773.333333333336, ans=0.0022231884057971017 2023-06-15 05:05:44,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=39840.0, ans=0.0 2023-06-15 05:06:03,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.87 vs. limit=22.5 2023-06-15 05:06:27,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=39973.333333333336, ans=0.125 2023-06-15 05:06:40,210 INFO [train.py:988] (2/4) Epoch 12, batch 150, loss[loss=0.325, simple_loss=0.3815, pruned_loss=0.1343, over 19216.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3493, pruned_loss=0.1172, over 2014440.32 frames. ], batch size: 92, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:06:46,885 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 2.288e+02 2.648e+02 3.128e+02 5.617e+02, threshold=5.296e+02, percent-clipped=1.0 2023-06-15 05:07:20,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.69 vs. limit=12.0 2023-06-15 05:07:41,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=40240.0, ans=0.0 2023-06-15 05:07:50,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=40306.666666666664, ans=0.0 2023-06-15 05:08:09,673 INFO [train.py:988] (2/4) Epoch 12, batch 200, loss[loss=0.2925, simple_loss=0.3631, pruned_loss=0.1109, over 18284.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.349, pruned_loss=0.1172, over 2405573.85 frames. ], batch size: 74, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:08:27,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=40440.0, ans=0.125 2023-06-15 05:08:33,633 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:08:35,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=40440.0, ans=0.0 2023-06-15 05:09:02,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.56 vs. limit=15.0 2023-06-15 05:09:20,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40640.0, ans=0.1 2023-06-15 05:09:39,245 INFO [train.py:988] (2/4) Epoch 12, batch 250, loss[loss=0.2683, simple_loss=0.3313, pruned_loss=0.1026, over 19213.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3477, pruned_loss=0.1164, over 2715108.50 frames. ], batch size: 92, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:09:41,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=40706.666666666664, ans=0.002020289855072464 2023-06-15 05:09:46,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 2.177e+02 2.544e+02 3.093e+02 5.809e+02, threshold=5.088e+02, percent-clipped=2.0 2023-06-15 05:09:46,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=40706.666666666664, ans=0.1 2023-06-15 05:10:37,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40906.666666666664, ans=0.125 2023-06-15 05:10:51,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=40973.333333333336, ans=10.0 2023-06-15 05:10:52,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=40973.333333333336, ans=0.2 2023-06-15 05:11:09,728 INFO [train.py:988] (2/4) Epoch 12, batch 300, loss[loss=0.2807, simple_loss=0.3408, pruned_loss=0.1103, over 20495.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3484, pruned_loss=0.1169, over 2946067.18 frames. ], batch size: 160, lr: 2.26e-02, grad_scale: 32.0 2023-06-15 05:11:21,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=41040.0, ans=0.1 2023-06-15 05:12:23,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=41306.666666666664, ans=0.125 2023-06-15 05:12:40,660 INFO [train.py:988] (2/4) Epoch 12, batch 350, loss[loss=0.2774, simple_loss=0.3548, pruned_loss=0.1, over 18308.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3483, pruned_loss=0.1165, over 3124717.84 frames. ], batch size: 72, lr: 2.26e-02, grad_scale: 32.0 2023-06-15 05:12:47,498 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.134e+02 2.417e+02 3.017e+02 4.561e+02, threshold=4.834e+02, percent-clipped=0.0 2023-06-15 05:12:57,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41440.0, ans=0.125 2023-06-15 05:14:10,555 INFO [train.py:988] (2/4) Epoch 12, batch 400, loss[loss=0.2794, simple_loss=0.3397, pruned_loss=0.1096, over 19477.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3475, pruned_loss=0.1165, over 3267628.43 frames. ], batch size: 105, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:14:15,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=41706.666666666664, ans=0.125 2023-06-15 05:14:29,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=15.0 2023-06-15 05:15:02,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.01 vs. limit=15.0 2023-06-15 05:15:13,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.68 vs. limit=15.0 2023-06-15 05:15:25,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=12.0 2023-06-15 05:15:37,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=41973.333333333336, ans=0.0 2023-06-15 05:15:40,744 INFO [train.py:988] (2/4) Epoch 12, batch 450, loss[loss=0.2981, simple_loss=0.3505, pruned_loss=0.1229, over 19476.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3479, pruned_loss=0.1162, over 3390462.00 frames. ], batch size: 105, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:15:48,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.312e+02 2.678e+02 3.302e+02 6.342e+02, threshold=5.355e+02, percent-clipped=8.0 2023-06-15 05:16:06,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=42106.666666666664, ans=0.125 2023-06-15 05:16:49,602 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:17:08,279 INFO [train.py:988] (2/4) Epoch 12, batch 500, loss[loss=0.3112, simple_loss=0.3759, pruned_loss=0.1233, over 17640.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3479, pruned_loss=0.1152, over 3484149.47 frames. ], batch size: 67, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:17:12,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.49 vs. limit=22.5 2023-06-15 05:17:30,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=42440.0, ans=0.05 2023-06-15 05:17:40,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=42506.666666666664, ans=0.0016289855072463763 2023-06-15 05:17:40,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=42506.666666666664, ans=0.0 2023-06-15 05:18:28,984 INFO [train.py:988] (2/4) Epoch 13, batch 0, loss[loss=0.2919, simple_loss=0.3487, pruned_loss=0.1176, over 18805.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3487, pruned_loss=0.1176, over 18805.00 frames. ], batch size: 83, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:18:28,984 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 05:18:35,085 INFO [train.py:1020] (2/4) Epoch 13, validation: loss=0.2246, simple_loss=0.3282, pruned_loss=0.06053, over 143649.00 frames. 2023-06-15 05:18:35,086 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 05:18:35,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=42593.333333333336, ans=0.125 2023-06-15 05:18:39,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42593.333333333336, ans=0.125 2023-06-15 05:18:41,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=42593.333333333336, ans=0.125 2023-06-15 05:18:58,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=42660.0, ans=22.5 2023-06-15 05:19:10,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=42726.666666666664, ans=0.0 2023-06-15 05:19:13,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.235e+02 2.660e+02 3.477e+02 4.514e+02, threshold=5.320e+02, percent-clipped=0.0 2023-06-15 05:19:24,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=42726.666666666664, ans=10.0 2023-06-15 05:19:35,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=42793.333333333336, ans=6.0 2023-06-15 05:19:49,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=42860.0, ans=0.125 2023-06-15 05:20:04,770 INFO [train.py:988] (2/4) Epoch 13, batch 50, loss[loss=0.2714, simple_loss=0.3441, pruned_loss=0.09939, over 18298.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3444, pruned_loss=0.1115, over 852243.71 frames. ], batch size: 74, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:20:34,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.46 vs. limit=15.0 2023-06-15 05:20:37,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=42993.333333333336, ans=0.125 2023-06-15 05:20:41,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=43060.0, ans=0.0 2023-06-15 05:20:44,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=43060.0, ans=0.0 2023-06-15 05:20:46,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=43060.0, ans=0.035 2023-06-15 05:21:30,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=43193.333333333336, ans=0.125 2023-06-15 05:21:30,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.99 vs. limit=22.5 2023-06-15 05:21:33,300 INFO [train.py:988] (2/4) Epoch 13, batch 100, loss[loss=0.2672, simple_loss=0.3298, pruned_loss=0.1023, over 19340.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3454, pruned_loss=0.1116, over 1504551.81 frames. ], batch size: 98, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:21:35,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=43260.0, ans=0.125 2023-06-15 05:21:42,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=43260.0, ans=0.0 2023-06-15 05:21:46,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43260.0, ans=0.1 2023-06-15 05:21:49,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=43326.666666666664, ans=0.125 2023-06-15 05:21:49,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43326.666666666664, ans=0.125 2023-06-15 05:21:58,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=43326.666666666664, ans=0.00145072463768116 2023-06-15 05:22:10,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.949e+02 2.269e+02 2.644e+02 4.836e+02, threshold=4.538e+02, percent-clipped=0.0 2023-06-15 05:22:13,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=43393.333333333336, ans=0.125 2023-06-15 05:22:41,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=43460.0, ans=0.2 2023-06-15 05:22:44,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43526.666666666664, ans=0.1 2023-06-15 05:23:00,739 INFO [train.py:988] (2/4) Epoch 13, batch 150, loss[loss=0.2758, simple_loss=0.3438, pruned_loss=0.1039, over 19620.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3456, pruned_loss=0.1119, over 2001700.92 frames. ], batch size: 110, lr: 2.15e-02, grad_scale: 32.0 2023-06-15 05:23:01,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=43593.333333333336, ans=0.0013927536231884054 2023-06-15 05:23:03,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43593.333333333336, ans=0.1 2023-06-15 05:23:22,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=43660.0, ans=0.125 2023-06-15 05:23:58,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43793.333333333336, ans=0.125 2023-06-15 05:24:28,505 INFO [train.py:988] (2/4) Epoch 13, batch 200, loss[loss=0.28, simple_loss=0.3458, pruned_loss=0.1071, over 19207.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3443, pruned_loss=0.1117, over 2393129.35 frames. ], batch size: 92, lr: 2.15e-02, grad_scale: 32.0 2023-06-15 05:24:41,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=43926.666666666664, ans=0.05 2023-06-15 05:24:45,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.22 vs. limit=22.5 2023-06-15 05:25:05,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.172e+02 2.424e+02 2.924e+02 5.184e+02, threshold=4.848e+02, percent-clipped=5.0 2023-06-15 05:25:10,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-15 05:25:12,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=44060.0, ans=0.5 2023-06-15 05:25:14,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=44060.0, ans=0.125 2023-06-15 05:25:56,669 INFO [train.py:988] (2/4) Epoch 13, batch 250, loss[loss=0.2888, simple_loss=0.3419, pruned_loss=0.1178, over 19926.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3441, pruned_loss=0.1127, over 2701246.71 frames. ], batch size: 126, lr: 2.15e-02, grad_scale: 16.0 2023-06-15 05:26:15,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=44326.666666666664, ans=0.0 2023-06-15 05:26:32,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=44393.333333333336, ans=0.2 2023-06-15 05:26:58,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2023-06-15 05:27:05,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=44526.666666666664, ans=0.125 2023-06-15 05:27:16,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=44526.666666666664, ans=0.0011898550724637694 2023-06-15 05:27:21,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=44526.666666666664, ans=0.1 2023-06-15 05:27:22,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=44593.333333333336, ans=0.125 2023-06-15 05:27:24,266 INFO [train.py:988] (2/4) Epoch 13, batch 300, loss[loss=0.2848, simple_loss=0.3447, pruned_loss=0.1124, over 19218.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3438, pruned_loss=0.1127, over 2942113.68 frames. ], batch size: 92, lr: 2.14e-02, grad_scale: 16.0 2023-06-15 05:27:50,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44660.0, ans=0.1 2023-06-15 05:27:52,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=44660.0, ans=0.0 2023-06-15 05:28:03,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.137e+02 2.490e+02 3.192e+02 5.767e+02, threshold=4.980e+02, percent-clipped=3.0 2023-06-15 05:28:14,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=44726.666666666664, ans=0.0 2023-06-15 05:28:52,645 INFO [train.py:988] (2/4) Epoch 13, batch 350, loss[loss=0.2988, simple_loss=0.3701, pruned_loss=0.1138, over 15342.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3447, pruned_loss=0.1125, over 3119771.18 frames. ], batch size: 43, lr: 2.14e-02, grad_scale: 16.0 2023-06-15 05:28:56,504 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:29:05,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=44926.666666666664, ans=0.0 2023-06-15 05:29:15,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-15 05:29:33,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=45060.0, ans=0.125 2023-06-15 05:29:38,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45060.0, ans=0.0 2023-06-15 05:30:06,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=15.0 2023-06-15 05:30:19,948 INFO [train.py:988] (2/4) Epoch 13, batch 400, loss[loss=0.257, simple_loss=0.3324, pruned_loss=0.09081, over 17106.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3445, pruned_loss=0.1118, over 3264503.50 frames. ], batch size: 60, lr: 2.14e-02, grad_scale: 32.0 2023-06-15 05:30:20,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=45260.0, ans=0.95 2023-06-15 05:30:23,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=45260.0, ans=0.125 2023-06-15 05:30:50,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.86 vs. limit=22.5 2023-06-15 05:30:59,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.082e+02 2.371e+02 2.770e+02 5.646e+02, threshold=4.742e+02, percent-clipped=0.0 2023-06-15 05:31:05,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=45393.333333333336, ans=0.2 2023-06-15 05:31:05,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45393.333333333336, ans=0.125 2023-06-15 05:31:08,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=45393.333333333336, ans=0.0 2023-06-15 05:31:09,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.10 vs. limit=12.0 2023-06-15 05:31:10,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=45393.333333333336, ans=10.0 2023-06-15 05:31:15,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=45460.0, ans=0.00098695652173913 2023-06-15 05:31:48,743 INFO [train.py:988] (2/4) Epoch 13, batch 450, loss[loss=0.2687, simple_loss=0.3239, pruned_loss=0.1068, over 20306.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3427, pruned_loss=0.1117, over 3387596.62 frames. ], batch size: 141, lr: 2.13e-02, grad_scale: 32.0 2023-06-15 05:31:49,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45593.333333333336, ans=0.1 2023-06-15 05:31:51,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.35 vs. limit=15.0 2023-06-15 05:32:45,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=45793.333333333336, ans=0.0 2023-06-15 05:32:45,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=45793.333333333336, ans=0.125 2023-06-15 05:33:01,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=45860.0, ans=0.125 2023-06-15 05:33:04,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-15 05:33:13,468 INFO [train.py:988] (2/4) Epoch 13, batch 500, loss[loss=0.2771, simple_loss=0.3405, pruned_loss=0.1068, over 19858.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3426, pruned_loss=0.1114, over 3481221.76 frames. ], batch size: 115, lr: 2.13e-02, grad_scale: 32.0 2023-06-15 05:33:20,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.37 vs. limit=15.0 2023-06-15 05:33:26,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=45926.666666666664, ans=0.2 2023-06-15 05:33:49,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.112e+02 2.450e+02 3.177e+02 4.704e+02, threshold=4.901e+02, percent-clipped=1.0 2023-06-15 05:33:54,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=46060.0, ans=0.125 2023-06-15 05:33:56,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=46060.0, ans=0.2 2023-06-15 05:34:30,990 INFO [train.py:988] (2/4) Epoch 14, batch 0, loss[loss=0.2728, simple_loss=0.3395, pruned_loss=0.1031, over 18756.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3395, pruned_loss=0.1031, over 18756.00 frames. ], batch size: 83, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:34:30,991 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 05:34:37,026 INFO [train.py:1020] (2/4) Epoch 14, validation: loss=0.2205, simple_loss=0.3248, pruned_loss=0.05804, over 143649.00 frames. 2023-06-15 05:34:37,026 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 05:34:53,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=46206.666666666664, ans=0.0 2023-06-15 05:35:54,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-15 05:35:58,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46406.666666666664, ans=0.125 2023-06-15 05:36:03,646 INFO [train.py:988] (2/4) Epoch 14, batch 50, loss[loss=0.2712, simple_loss=0.3273, pruned_loss=0.1075, over 20114.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3425, pruned_loss=0.1131, over 876254.64 frames. ], batch size: 133, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:36:13,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=46473.333333333336, ans=0.0007666666666666672 2023-06-15 05:36:23,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=46540.0, ans=0.0007521739130434777 2023-06-15 05:36:24,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-15 05:36:32,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46540.0, ans=0.1 2023-06-15 05:36:52,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=46606.666666666664, ans=0.125 2023-06-15 05:36:52,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=46606.666666666664, ans=0.2 2023-06-15 05:37:09,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.59 vs. limit=15.0 2023-06-15 05:37:14,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.20 vs. limit=10.0 2023-06-15 05:37:14,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.122e+02 2.332e+02 2.601e+02 5.252e+02, threshold=4.663e+02, percent-clipped=1.0 2023-06-15 05:37:15,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=46740.0, ans=0.0 2023-06-15 05:37:21,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46740.0, ans=0.125 2023-06-15 05:37:32,788 INFO [train.py:988] (2/4) Epoch 14, batch 100, loss[loss=0.2657, simple_loss=0.3356, pruned_loss=0.09785, over 19530.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3399, pruned_loss=0.1089, over 1534007.96 frames. ], batch size: 102, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:37:48,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-15 05:37:49,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=46873.333333333336, ans=0.125 2023-06-15 05:38:00,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.43 vs. limit=22.5 2023-06-15 05:38:01,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46873.333333333336, ans=0.1 2023-06-15 05:38:06,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=22.5 2023-06-15 05:38:31,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-15 05:39:01,366 INFO [train.py:988] (2/4) Epoch 14, batch 150, loss[loss=0.2808, simple_loss=0.3347, pruned_loss=0.1134, over 20641.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3411, pruned_loss=0.1096, over 2033990.61 frames. ], batch size: 173, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:39:01,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=47140.0, ans=0.125 2023-06-15 05:39:08,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=47140.0, ans=0.07 2023-06-15 05:39:28,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=47206.666666666664, ans=0.07 2023-06-15 05:39:34,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=47273.333333333336, ans=0.0 2023-06-15 05:40:01,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=47340.0, ans=0.125 2023-06-15 05:40:01,145 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:40:09,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=47406.666666666664, ans=0.05 2023-06-15 05:40:11,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 2.106e+02 2.352e+02 2.722e+02 4.842e+02, threshold=4.704e+02, percent-clipped=2.0 2023-06-15 05:40:11,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=47406.666666666664, ans=0.125 2023-06-15 05:40:16,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=47406.666666666664, ans=0.125 2023-06-15 05:40:20,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=47406.666666666664, ans=0.2 2023-06-15 05:40:25,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=47406.666666666664, ans=0.125 2023-06-15 05:40:28,790 INFO [train.py:988] (2/4) Epoch 14, batch 200, loss[loss=0.2736, simple_loss=0.3366, pruned_loss=0.1053, over 19691.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3408, pruned_loss=0.1089, over 2419315.48 frames. ], batch size: 110, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:40:50,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=47540.0, ans=0.2 2023-06-15 05:40:56,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47540.0, ans=0.1 2023-06-15 05:41:36,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=47673.333333333336, ans=0.0 2023-06-15 05:41:36,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-06-15 05:41:56,671 INFO [train.py:988] (2/4) Epoch 14, batch 250, loss[loss=0.2961, simple_loss=0.3406, pruned_loss=0.1258, over 20295.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3407, pruned_loss=0.1082, over 2729042.68 frames. ], batch size: 239, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:42:14,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=47873.333333333336, ans=0.00046231884057970976 2023-06-15 05:42:42,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-06-15 05:42:58,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=48006.666666666664, ans=0.125 2023-06-15 05:43:06,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.140e+02 2.401e+02 2.980e+02 6.123e+02, threshold=4.801e+02, percent-clipped=4.0 2023-06-15 05:43:07,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.04 vs. limit=22.5 2023-06-15 05:43:09,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-06-15 05:43:23,498 INFO [train.py:988] (2/4) Epoch 14, batch 300, loss[loss=0.2975, simple_loss=0.3166, pruned_loss=0.1392, over 17017.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3408, pruned_loss=0.1085, over 2970304.21 frames. ], batch size: 392, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:43:25,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=48140.0, ans=0.2 2023-06-15 05:44:36,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48406.666666666664, ans=0.125 2023-06-15 05:44:50,838 INFO [train.py:988] (2/4) Epoch 14, batch 350, loss[loss=0.2864, simple_loss=0.3541, pruned_loss=0.1093, over 16938.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3391, pruned_loss=0.1076, over 3153276.08 frames. ], batch size: 60, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:44:52,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=48473.333333333336, ans=0.125 2023-06-15 05:44:58,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.00 vs. limit=22.5 2023-06-15 05:45:56,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=48673.333333333336, ans=0.00028840579710144934 2023-06-15 05:45:59,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.238e+02 2.722e+02 3.535e+02 5.292e+02, threshold=5.444e+02, percent-clipped=1.0 2023-06-15 05:46:16,361 INFO [train.py:988] (2/4) Epoch 14, batch 400, loss[loss=0.2691, simple_loss=0.3358, pruned_loss=0.1012, over 18632.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3396, pruned_loss=0.108, over 3298743.63 frames. ], batch size: 80, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:46:19,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.24 vs. limit=5.0 2023-06-15 05:46:52,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=48940.0, ans=0.0002304347826086947 2023-06-15 05:46:58,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=48940.0, ans=0.125 2023-06-15 05:47:12,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=49006.666666666664, ans=0.05 2023-06-15 05:47:13,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2023-06-15 05:47:17,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49006.666666666664, ans=0.125 2023-06-15 05:47:27,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2023-06-15 05:47:39,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49073.333333333336, ans=0.1 2023-06-15 05:47:42,468 INFO [train.py:988] (2/4) Epoch 14, batch 450, loss[loss=0.2828, simple_loss=0.3455, pruned_loss=0.1101, over 18799.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3395, pruned_loss=0.1075, over 3406804.92 frames. ], batch size: 83, lr: 2.02e-02, grad_scale: 32.0 2023-06-15 05:47:53,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49140.0, ans=0.125 2023-06-15 05:47:59,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-06-15 05:48:02,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=49206.666666666664, ans=0.035 2023-06-15 05:48:50,098 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.110e+02 2.476e+02 3.051e+02 4.850e+02, threshold=4.953e+02, percent-clipped=0.0 2023-06-15 05:49:06,429 INFO [train.py:988] (2/4) Epoch 14, batch 500, loss[loss=0.2789, simple_loss=0.3436, pruned_loss=0.1071, over 19087.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3392, pruned_loss=0.1068, over 3498158.51 frames. ], batch size: 89, lr: 2.02e-02, grad_scale: 32.0 2023-06-15 05:49:28,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=49540.0, ans=0.5 2023-06-15 05:49:38,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2023-06-15 05:49:40,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=49606.666666666664, ans=0.2 2023-06-15 05:49:45,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-15 05:50:25,274 INFO [train.py:988] (2/4) Epoch 15, batch 0, loss[loss=0.2668, simple_loss=0.3297, pruned_loss=0.102, over 18648.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3297, pruned_loss=0.102, over 18648.00 frames. ], batch size: 80, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:50:25,275 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 05:50:31,405 INFO [train.py:1020] (2/4) Epoch 15, validation: loss=0.2189, simple_loss=0.3232, pruned_loss=0.05727, over 143649.00 frames. 2023-06-15 05:50:31,406 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 05:50:36,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49693.333333333336, ans=0.1 2023-06-15 05:50:45,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2023-06-15 05:50:46,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=49760.0, ans=0.0 2023-06-15 05:51:12,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=49826.666666666664, ans=0.2 2023-06-15 05:51:44,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=49960.0, ans=0.125 2023-06-15 05:51:58,709 INFO [train.py:988] (2/4) Epoch 15, batch 50, loss[loss=0.3007, simple_loss=0.3597, pruned_loss=0.1209, over 20136.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3344, pruned_loss=0.1022, over 856357.08 frames. ], batch size: 133, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:52:10,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.143e+02 2.454e+02 2.855e+02 6.420e+02, threshold=4.907e+02, percent-clipped=3.0 2023-06-15 05:52:37,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=50160.0, ans=0.125 2023-06-15 05:52:55,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-06-15 05:53:10,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50293.333333333336, ans=0.1 2023-06-15 05:53:26,705 INFO [train.py:988] (2/4) Epoch 15, batch 100, loss[loss=0.297, simple_loss=0.3638, pruned_loss=0.1151, over 16967.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3358, pruned_loss=0.1036, over 1502632.60 frames. ], batch size: 60, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:53:30,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=50360.0, ans=0.0 2023-06-15 05:53:32,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-15 05:53:40,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=50360.0, ans=0.125 2023-06-15 05:53:42,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50426.666666666664, ans=0.125 2023-06-15 05:53:55,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=50426.666666666664, ans=0.09899494936611666 2023-06-15 05:54:54,019 INFO [train.py:988] (2/4) Epoch 15, batch 150, loss[loss=0.2574, simple_loss=0.3292, pruned_loss=0.09282, over 18309.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3352, pruned_loss=0.1029, over 2011814.65 frames. ], batch size: 74, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:55:06,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.090e+02 2.403e+02 2.891e+02 4.165e+02, threshold=4.806e+02, percent-clipped=0.0 2023-06-15 05:55:33,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=50826.666666666664, ans=0.09899494936611666 2023-06-15 05:55:48,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=50893.333333333336, ans=0.0 2023-06-15 05:55:52,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50893.333333333336, ans=0.1 2023-06-15 05:55:55,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-06-15 05:56:22,231 INFO [train.py:988] (2/4) Epoch 15, batch 200, loss[loss=0.2564, simple_loss=0.3279, pruned_loss=0.0924, over 19097.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3355, pruned_loss=0.1036, over 2415568.82 frames. ], batch size: 94, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:56:30,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=51026.666666666664, ans=0.07 2023-06-15 05:57:11,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=51160.0, ans=0.2 2023-06-15 05:57:11,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=51160.0, ans=0.95 2023-06-15 05:57:30,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=51293.333333333336, ans=0.0 2023-06-15 05:57:40,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-15 05:57:50,254 INFO [train.py:988] (2/4) Epoch 15, batch 250, loss[loss=0.278, simple_loss=0.3384, pruned_loss=0.1088, over 18939.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3353, pruned_loss=0.1034, over 2720471.76 frames. ], batch size: 86, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:58:03,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.983e+02 2.255e+02 2.708e+02 4.170e+02, threshold=4.510e+02, percent-clipped=0.0 2023-06-15 05:58:07,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51426.666666666664, ans=0.1 2023-06-15 05:58:16,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.93 vs. limit=15.0 2023-06-15 05:58:33,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=51493.333333333336, ans=0.0 2023-06-15 05:58:41,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-15 05:58:52,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=51560.0, ans=0.0 2023-06-15 05:59:19,777 INFO [train.py:988] (2/4) Epoch 15, batch 300, loss[loss=0.2708, simple_loss=0.33, pruned_loss=0.1058, over 20300.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3359, pruned_loss=0.1045, over 2964867.71 frames. ], batch size: 149, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 05:59:23,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=51693.333333333336, ans=0.2 2023-06-15 05:59:35,628 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:59:37,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=51760.0, ans=0.0 2023-06-15 05:59:49,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51760.0, ans=0.125 2023-06-15 06:00:12,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-15 06:00:18,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-06-15 06:00:31,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=51960.0, ans=0.0 2023-06-15 06:00:47,518 INFO [train.py:988] (2/4) Epoch 15, batch 350, loss[loss=0.2682, simple_loss=0.3383, pruned_loss=0.09904, over 19689.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3353, pruned_loss=0.1042, over 3140170.73 frames. ], batch size: 110, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 06:00:52,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=52026.666666666664, ans=0.125 2023-06-15 06:01:00,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.062e+02 2.429e+02 2.907e+02 4.781e+02, threshold=4.857e+02, percent-clipped=2.0 2023-06-15 06:01:41,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=52226.666666666664, ans=0.0 2023-06-15 06:02:16,543 INFO [train.py:988] (2/4) Epoch 15, batch 400, loss[loss=0.2561, simple_loss=0.3178, pruned_loss=0.09724, over 20316.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3347, pruned_loss=0.1044, over 3279009.80 frames. ], batch size: 149, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 06:02:37,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-15 06:02:39,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=52426.666666666664, ans=0.125 2023-06-15 06:03:29,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=22.5 2023-06-15 06:03:43,723 INFO [train.py:988] (2/4) Epoch 15, batch 450, loss[loss=0.2756, simple_loss=0.325, pruned_loss=0.113, over 20561.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3346, pruned_loss=0.1041, over 3378155.56 frames. ], batch size: 189, lr: 1.92e-02, grad_scale: 32.0 2023-06-15 06:03:51,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=52693.333333333336, ans=0.0 2023-06-15 06:03:56,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.130e+02 2.421e+02 3.094e+02 4.907e+02, threshold=4.841e+02, percent-clipped=1.0 2023-06-15 06:04:03,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-15 06:04:04,419 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:05:10,621 INFO [train.py:988] (2/4) Epoch 15, batch 500, loss[loss=0.2673, simple_loss=0.3162, pruned_loss=0.1092, over 20265.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3335, pruned_loss=0.1041, over 3467952.93 frames. ], batch size: 239, lr: 1.92e-02, grad_scale: 32.0 2023-06-15 06:05:55,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=53160.0, ans=0.95 2023-06-15 06:06:00,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.06 vs. limit=15.0 2023-06-15 06:06:28,441 INFO [train.py:988] (2/4) Epoch 16, batch 0, loss[loss=0.2586, simple_loss=0.3288, pruned_loss=0.09415, over 19320.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3288, pruned_loss=0.09415, over 19320.00 frames. ], batch size: 98, lr: 1.86e-02, grad_scale: 32.0 2023-06-15 06:06:28,442 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 06:06:34,509 INFO [train.py:1020] (2/4) Epoch 16, validation: loss=0.2134, simple_loss=0.3194, pruned_loss=0.05367, over 143649.00 frames. 2023-06-15 06:06:34,510 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 06:06:44,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=15.0 2023-06-15 06:07:17,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2023-06-15 06:07:19,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.212e+02 2.676e+02 3.191e+02 5.269e+02, threshold=5.353e+02, percent-clipped=1.0 2023-06-15 06:08:03,130 INFO [train.py:988] (2/4) Epoch 16, batch 50, loss[loss=0.2519, simple_loss=0.3315, pruned_loss=0.08612, over 18282.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.332, pruned_loss=0.1014, over 861810.90 frames. ], batch size: 74, lr: 1.86e-02, grad_scale: 32.0 2023-06-15 06:08:08,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=53573.333333333336, ans=0.125 2023-06-15 06:08:10,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=53573.333333333336, ans=0.125 2023-06-15 06:09:19,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53840.0, ans=0.1 2023-06-15 06:09:29,413 INFO [train.py:988] (2/4) Epoch 16, batch 100, loss[loss=0.2782, simple_loss=0.3398, pruned_loss=0.1083, over 19520.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3315, pruned_loss=0.1005, over 1505461.62 frames. ], batch size: 102, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:09:37,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-15 06:10:06,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=54040.0, ans=0.2 2023-06-15 06:10:07,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=54040.0, ans=0.0 2023-06-15 06:10:12,485 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.999e+02 2.214e+02 2.667e+02 3.874e+02, threshold=4.428e+02, percent-clipped=0.0 2023-06-15 06:10:20,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=54106.666666666664, ans=0.125 2023-06-15 06:10:55,311 INFO [train.py:988] (2/4) Epoch 16, batch 150, loss[loss=0.2584, simple_loss=0.3232, pruned_loss=0.09673, over 19119.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.331, pruned_loss=0.1004, over 2014652.07 frames. ], batch size: 94, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:10:58,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=54240.0, ans=0.125 2023-06-15 06:11:22,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=54306.666666666664, ans=0.125 2023-06-15 06:11:29,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=54373.333333333336, ans=0.0 2023-06-15 06:11:50,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=54440.0, ans=22.5 2023-06-15 06:12:12,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=54506.666666666664, ans=0.125 2023-06-15 06:12:22,651 INFO [train.py:988] (2/4) Epoch 16, batch 200, loss[loss=0.3036, simple_loss=0.3781, pruned_loss=0.1146, over 17615.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3314, pruned_loss=0.1019, over 2403125.57 frames. ], batch size: 67, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:12:56,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=15.0 2023-06-15 06:13:05,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.168e+02 2.479e+02 3.039e+02 4.350e+02, threshold=4.958e+02, percent-clipped=0.0 2023-06-15 06:13:25,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.13 vs. limit=15.0 2023-06-15 06:13:30,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=54840.0, ans=0.125 2023-06-15 06:13:50,153 INFO [train.py:988] (2/4) Epoch 16, batch 250, loss[loss=0.2495, simple_loss=0.3191, pruned_loss=0.08996, over 19069.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3318, pruned_loss=0.1015, over 2707542.91 frames. ], batch size: 89, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:13:56,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-15 06:14:11,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=54973.333333333336, ans=15.0 2023-06-15 06:14:18,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=54973.333333333336, ans=0.0 2023-06-15 06:14:19,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54973.333333333336, ans=0.1 2023-06-15 06:14:28,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=55040.0, ans=0.125 2023-06-15 06:14:43,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=55106.666666666664, ans=0.09899494936611666 2023-06-15 06:14:52,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55106.666666666664, ans=0.1 2023-06-15 06:14:56,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=55106.666666666664, ans=0.125 2023-06-15 06:15:08,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55173.333333333336, ans=0.1 2023-06-15 06:15:14,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=55240.0, ans=0.0 2023-06-15 06:15:15,819 INFO [train.py:988] (2/4) Epoch 16, batch 300, loss[loss=0.2846, simple_loss=0.3506, pruned_loss=0.1093, over 16723.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.332, pruned_loss=0.1011, over 2931101.39 frames. ], batch size: 59, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:15:17,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=55240.0, ans=0.0 2023-06-15 06:15:23,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=55240.0, ans=0.125 2023-06-15 06:15:59,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.170e+02 2.637e+02 3.215e+02 4.848e+02, threshold=5.274e+02, percent-clipped=0.0 2023-06-15 06:16:36,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=55506.666666666664, ans=0.5 2023-06-15 06:16:39,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=55506.666666666664, ans=0.125 2023-06-15 06:16:43,040 INFO [train.py:988] (2/4) Epoch 16, batch 350, loss[loss=0.2631, simple_loss=0.3258, pruned_loss=0.1002, over 20650.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3311, pruned_loss=0.1011, over 3132526.00 frames. ], batch size: 211, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:17:16,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=55706.666666666664, ans=0.125 2023-06-15 06:18:04,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=55840.0, ans=0.2 2023-06-15 06:18:10,029 INFO [train.py:988] (2/4) Epoch 16, batch 400, loss[loss=0.2606, simple_loss=0.3261, pruned_loss=0.09749, over 19695.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.332, pruned_loss=0.1012, over 3259941.58 frames. ], batch size: 110, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:18:31,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=55973.333333333336, ans=0.2 2023-06-15 06:18:50,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=56040.0, ans=0.2 2023-06-15 06:18:53,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-15 06:18:53,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.986e+02 2.238e+02 2.515e+02 3.872e+02, threshold=4.476e+02, percent-clipped=0.0 2023-06-15 06:19:18,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=56173.333333333336, ans=0.2 2023-06-15 06:19:21,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=56173.333333333336, ans=0.125 2023-06-15 06:19:36,766 INFO [train.py:988] (2/4) Epoch 16, batch 450, loss[loss=0.2421, simple_loss=0.3126, pruned_loss=0.08577, over 19924.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3315, pruned_loss=0.1004, over 3382371.04 frames. ], batch size: 120, lr: 1.83e-02, grad_scale: 32.0 2023-06-15 06:19:45,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=56240.0, ans=0.0 2023-06-15 06:20:01,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=56306.666666666664, ans=0.2 2023-06-15 06:20:28,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2023-06-15 06:20:59,734 INFO [train.py:988] (2/4) Epoch 16, batch 500, loss[loss=0.2505, simple_loss=0.3269, pruned_loss=0.0871, over 19343.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3316, pruned_loss=0.1007, over 3476071.82 frames. ], batch size: 98, lr: 1.83e-02, grad_scale: 32.0 2023-06-15 06:21:40,069 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.020e+02 2.317e+02 2.823e+02 4.617e+02, threshold=4.634e+02, percent-clipped=2.0 2023-06-15 06:22:11,789 INFO [train.py:988] (2/4) Epoch 17, batch 0, loss[loss=0.2449, simple_loss=0.3199, pruned_loss=0.08499, over 19537.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3199, pruned_loss=0.08499, over 19537.00 frames. ], batch size: 102, lr: 1.78e-02, grad_scale: 32.0 2023-06-15 06:22:11,789 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 06:22:17,838 INFO [train.py:1020] (2/4) Epoch 17, validation: loss=0.2144, simple_loss=0.3175, pruned_loss=0.05564, over 143649.00 frames. 2023-06-15 06:22:17,839 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 06:22:53,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=56920.0, ans=0.05 2023-06-15 06:22:55,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=56920.0, ans=0.125 2023-06-15 06:23:25,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=56986.666666666664, ans=0.0 2023-06-15 06:23:45,549 INFO [train.py:988] (2/4) Epoch 17, batch 50, loss[loss=0.2625, simple_loss=0.3267, pruned_loss=0.09917, over 20106.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3289, pruned_loss=0.09876, over 873593.85 frames. ], batch size: 133, lr: 1.77e-02, grad_scale: 32.0 2023-06-15 06:23:49,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=57120.0, ans=0.0 2023-06-15 06:24:07,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57186.666666666664, ans=0.1 2023-06-15 06:24:14,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=57186.666666666664, ans=0.2 2023-06-15 06:24:33,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=57253.333333333336, ans=0.1 2023-06-15 06:25:01,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 2.063e+02 2.327e+02 2.647e+02 3.796e+02, threshold=4.655e+02, percent-clipped=0.0 2023-06-15 06:25:14,211 INFO [train.py:988] (2/4) Epoch 17, batch 100, loss[loss=0.2685, simple_loss=0.3538, pruned_loss=0.09157, over 17633.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3282, pruned_loss=0.1001, over 1521631.70 frames. ], batch size: 67, lr: 1.77e-02, grad_scale: 32.0 2023-06-15 06:25:36,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=57520.0, ans=0.0 2023-06-15 06:25:49,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.08 vs. limit=15.0 2023-06-15 06:26:31,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-15 06:26:41,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=57786.666666666664, ans=0.2 2023-06-15 06:26:42,445 INFO [train.py:988] (2/4) Epoch 17, batch 150, loss[loss=0.2552, simple_loss=0.3212, pruned_loss=0.09462, over 19980.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3276, pruned_loss=0.09962, over 2011222.34 frames. ], batch size: 126, lr: 1.77e-02, grad_scale: 64.0 2023-06-15 06:27:32,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57920.0, ans=0.1 2023-06-15 06:27:56,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=58053.333333333336, ans=12.0 2023-06-15 06:27:57,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.306e+02 2.737e+02 3.252e+02 5.355e+02, threshold=5.474e+02, percent-clipped=3.0 2023-06-15 06:28:09,784 INFO [train.py:988] (2/4) Epoch 17, batch 200, loss[loss=0.258, simple_loss=0.326, pruned_loss=0.09502, over 18485.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3282, pruned_loss=0.09881, over 2395603.59 frames. ], batch size: 77, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:28:10,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=58120.0, ans=22.5 2023-06-15 06:28:27,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-06-15 06:28:35,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=58186.666666666664, ans=0.05 2023-06-15 06:28:37,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=58186.666666666664, ans=0.1 2023-06-15 06:28:44,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.49 vs. limit=15.0 2023-06-15 06:28:47,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=58253.333333333336, ans=0.125 2023-06-15 06:29:16,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=58320.0, ans=0.0 2023-06-15 06:29:16,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=58320.0, ans=0.125 2023-06-15 06:29:36,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=58453.333333333336, ans=0.125 2023-06-15 06:29:38,282 INFO [train.py:988] (2/4) Epoch 17, batch 250, loss[loss=0.2858, simple_loss=0.3397, pruned_loss=0.116, over 20294.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3284, pruned_loss=0.0991, over 2707144.49 frames. ], batch size: 141, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:29:45,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58453.333333333336, ans=0.1 2023-06-15 06:30:26,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=58586.666666666664, ans=0.125 2023-06-15 06:30:33,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2023-06-15 06:30:45,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-15 06:30:51,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=58720.0, ans=0.025 2023-06-15 06:30:52,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=58720.0, ans=0.125 2023-06-15 06:30:54,266 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.968e+02 2.205e+02 2.465e+02 3.814e+02, threshold=4.411e+02, percent-clipped=0.0 2023-06-15 06:31:06,209 INFO [train.py:988] (2/4) Epoch 17, batch 300, loss[loss=0.2628, simple_loss=0.3254, pruned_loss=0.1001, over 20113.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3284, pruned_loss=0.09865, over 2945903.31 frames. ], batch size: 133, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:31:19,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=58786.666666666664, ans=0.0 2023-06-15 06:31:22,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2023-06-15 06:31:40,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=58920.0, ans=0.125 2023-06-15 06:31:47,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-15 06:31:59,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.37 vs. limit=15.0 2023-06-15 06:32:20,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=59053.333333333336, ans=0.0 2023-06-15 06:32:33,713 INFO [train.py:988] (2/4) Epoch 17, batch 350, loss[loss=0.2726, simple_loss=0.3311, pruned_loss=0.107, over 20591.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3282, pruned_loss=0.09806, over 3138459.51 frames. ], batch size: 173, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:32:54,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=59186.666666666664, ans=0.025 2023-06-15 06:33:06,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=59186.666666666664, ans=0.125 2023-06-15 06:33:17,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=59253.333333333336, ans=0.125 2023-06-15 06:33:45,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=59386.666666666664, ans=12.0 2023-06-15 06:33:49,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 2.042e+02 2.346e+02 2.711e+02 3.857e+02, threshold=4.693e+02, percent-clipped=0.0 2023-06-15 06:33:57,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=59386.666666666664, ans=0.125 2023-06-15 06:33:58,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59386.666666666664, ans=0.1 2023-06-15 06:33:59,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.17 vs. limit=15.0 2023-06-15 06:34:01,698 INFO [train.py:988] (2/4) Epoch 17, batch 400, loss[loss=0.2676, simple_loss=0.3182, pruned_loss=0.1086, over 20204.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3279, pruned_loss=0.09801, over 3274540.67 frames. ], batch size: 239, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:34:06,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59453.333333333336, ans=0.1 2023-06-15 06:34:37,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2023-06-15 06:34:46,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2023-06-15 06:34:48,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-15 06:35:24,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=59720.0, ans=0.125 2023-06-15 06:35:27,602 INFO [train.py:988] (2/4) Epoch 17, batch 450, loss[loss=0.266, simple_loss=0.3254, pruned_loss=0.1033, over 20535.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3291, pruned_loss=0.09863, over 3370041.92 frames. ], batch size: 173, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:35:41,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59786.666666666664, ans=0.1 2023-06-15 06:35:48,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59853.333333333336, ans=0.1 2023-06-15 06:35:50,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=59853.333333333336, ans=0.0 2023-06-15 06:36:41,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 2.138e+02 2.622e+02 3.155e+02 6.039e+02, threshold=5.245e+02, percent-clipped=6.0 2023-06-15 06:36:52,649 INFO [train.py:988] (2/4) Epoch 17, batch 500, loss[loss=0.2668, simple_loss=0.3412, pruned_loss=0.09618, over 18464.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3292, pruned_loss=0.09813, over 3464778.61 frames. ], batch size: 77, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:36:54,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=60120.0, ans=0.0 2023-06-15 06:37:01,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.49 vs. limit=22.5 2023-06-15 06:37:05,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=60120.0, ans=0.125 2023-06-15 06:38:08,683 INFO [train.py:988] (2/4) Epoch 18, batch 0, loss[loss=0.2911, simple_loss=0.3616, pruned_loss=0.1103, over 18318.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3616, pruned_loss=0.1103, over 18318.00 frames. ], batch size: 72, lr: 1.70e-02, grad_scale: 64.0 2023-06-15 06:38:08,683 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 06:38:14,732 INFO [train.py:1020] (2/4) Epoch 18, validation: loss=0.2126, simple_loss=0.3161, pruned_loss=0.05459, over 143649.00 frames. 2023-06-15 06:38:14,733 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 06:38:22,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=60333.333333333336, ans=0.125 2023-06-15 06:38:24,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.34 vs. limit=22.5 2023-06-15 06:38:34,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=60400.0, ans=0.0 2023-06-15 06:38:51,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=60466.666666666664, ans=0.0 2023-06-15 06:39:19,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=60533.333333333336, ans=0.0 2023-06-15 06:39:32,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=60600.0, ans=0.0 2023-06-15 06:39:35,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=60600.0, ans=0.2 2023-06-15 06:39:40,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=60666.666666666664, ans=0.125 2023-06-15 06:39:42,222 INFO [train.py:988] (2/4) Epoch 18, batch 50, loss[loss=0.2593, simple_loss=0.3451, pruned_loss=0.08681, over 16385.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3247, pruned_loss=0.09309, over 837911.13 frames. ], batch size: 52, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:40:00,852 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.965e+02 2.316e+02 2.715e+02 4.312e+02, threshold=4.632e+02, percent-clipped=0.0 2023-06-15 06:40:27,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=60800.0, ans=0.0 2023-06-15 06:40:36,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=60866.666666666664, ans=0.125 2023-06-15 06:40:40,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-06-15 06:40:41,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60866.666666666664, ans=0.125 2023-06-15 06:40:52,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.50 vs. limit=5.0 2023-06-15 06:41:10,153 INFO [train.py:988] (2/4) Epoch 18, batch 100, loss[loss=0.2758, simple_loss=0.3303, pruned_loss=0.1106, over 19988.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3266, pruned_loss=0.09602, over 1503205.69 frames. ], batch size: 126, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:41:29,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=61066.666666666664, ans=0.125 2023-06-15 06:41:52,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=61133.333333333336, ans=0.125 2023-06-15 06:41:55,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=61133.333333333336, ans=0.0 2023-06-15 06:41:59,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.20 vs. limit=15.0 2023-06-15 06:42:02,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=61200.0, ans=0.125 2023-06-15 06:42:09,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.33 vs. limit=22.5 2023-06-15 06:42:37,680 INFO [train.py:988] (2/4) Epoch 18, batch 150, loss[loss=0.2655, simple_loss=0.3275, pruned_loss=0.1018, over 20294.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3257, pruned_loss=0.09464, over 2018970.54 frames. ], batch size: 141, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:42:45,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=61333.333333333336, ans=0.125 2023-06-15 06:42:57,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.047e+02 2.339e+02 2.700e+02 3.981e+02, threshold=4.677e+02, percent-clipped=0.0 2023-06-15 06:43:04,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=61400.0, ans=0.0 2023-06-15 06:43:10,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-15 06:43:16,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=61466.666666666664, ans=0.125 2023-06-15 06:43:32,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=61533.333333333336, ans=0.04949747468305833 2023-06-15 06:43:37,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.08 vs. limit=15.0 2023-06-15 06:43:45,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=61533.333333333336, ans=10.0 2023-06-15 06:43:53,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.09 vs. limit=10.0 2023-06-15 06:43:54,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=61600.0, ans=0.5 2023-06-15 06:43:57,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=15.0 2023-06-15 06:44:06,114 INFO [train.py:988] (2/4) Epoch 18, batch 200, loss[loss=0.2707, simple_loss=0.3304, pruned_loss=0.1055, over 20476.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3261, pruned_loss=0.0956, over 2398611.15 frames. ], batch size: 160, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:45:23,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=61933.333333333336, ans=0.0 2023-06-15 06:45:29,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-06-15 06:45:33,959 INFO [train.py:988] (2/4) Epoch 18, batch 250, loss[loss=0.2434, simple_loss=0.3243, pruned_loss=0.08128, over 17601.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3265, pruned_loss=0.09573, over 2701373.30 frames. ], batch size: 67, lr: 1.68e-02, grad_scale: 64.0 2023-06-15 06:45:53,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 2.076e+02 2.248e+02 2.609e+02 3.858e+02, threshold=4.496e+02, percent-clipped=0.0 2023-06-15 06:46:18,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=62133.333333333336, ans=0.0 2023-06-15 06:46:54,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=12.0 2023-06-15 06:46:58,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=62266.666666666664, ans=0.0 2023-06-15 06:47:00,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=62333.333333333336, ans=0.04949747468305833 2023-06-15 06:47:02,676 INFO [train.py:988] (2/4) Epoch 18, batch 300, loss[loss=0.2685, simple_loss=0.3325, pruned_loss=0.1022, over 19684.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3258, pruned_loss=0.09598, over 2941112.06 frames. ], batch size: 110, lr: 1.68e-02, grad_scale: 64.0 2023-06-15 06:47:27,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=6.0 2023-06-15 06:47:46,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62466.666666666664, ans=0.1 2023-06-15 06:48:30,187 INFO [train.py:988] (2/4) Epoch 18, batch 350, loss[loss=0.2455, simple_loss=0.3217, pruned_loss=0.08464, over 18302.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3258, pruned_loss=0.09584, over 3120973.18 frames. ], batch size: 74, lr: 1.68e-02, grad_scale: 32.0 2023-06-15 06:48:46,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62733.333333333336, ans=0.125 2023-06-15 06:48:51,048 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 2.098e+02 2.384e+02 2.713e+02 4.621e+02, threshold=4.767e+02, percent-clipped=2.0 2023-06-15 06:48:57,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.24 vs. limit=22.5 2023-06-15 06:49:13,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=22.5 2023-06-15 06:49:13,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-15 06:49:45,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=62933.333333333336, ans=0.2 2023-06-15 06:49:55,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=63000.0, ans=0.09899494936611666 2023-06-15 06:49:57,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=15.0 2023-06-15 06:49:57,684 INFO [train.py:988] (2/4) Epoch 18, batch 400, loss[loss=0.275, simple_loss=0.3565, pruned_loss=0.09674, over 16358.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3262, pruned_loss=0.09576, over 3264464.92 frames. ], batch size: 52, lr: 1.68e-02, grad_scale: 32.0 2023-06-15 06:49:59,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=63000.0, ans=0.2 2023-06-15 06:50:03,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2023-06-15 06:50:14,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63066.666666666664, ans=0.125 2023-06-15 06:50:24,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=22.5 2023-06-15 06:50:44,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=12.0 2023-06-15 06:51:20,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-15 06:51:26,165 INFO [train.py:988] (2/4) Epoch 18, batch 450, loss[loss=0.2451, simple_loss=0.3113, pruned_loss=0.08946, over 20543.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3255, pruned_loss=0.09548, over 3384215.25 frames. ], batch size: 189, lr: 1.67e-02, grad_scale: 32.0 2023-06-15 06:51:30,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.30 vs. limit=15.0 2023-06-15 06:51:47,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.045e+02 2.298e+02 2.779e+02 4.422e+02, threshold=4.596e+02, percent-clipped=0.0 2023-06-15 06:51:48,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=63400.0, ans=0.0 2023-06-15 06:52:30,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=63533.333333333336, ans=0.125 2023-06-15 06:52:45,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.63 vs. limit=22.5 2023-06-15 06:52:48,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=63600.0, ans=0.0 2023-06-15 06:52:51,116 INFO [train.py:988] (2/4) Epoch 18, batch 500, loss[loss=0.2318, simple_loss=0.3096, pruned_loss=0.07704, over 11472.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3255, pruned_loss=0.09482, over 3447084.82 frames. ], batch size: 32, lr: 1.67e-02, grad_scale: 32.0 2023-06-15 06:52:53,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=63666.666666666664, ans=0.2 2023-06-15 06:52:53,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=63666.666666666664, ans=0.0 2023-06-15 06:53:24,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=63800.0, ans=0.125 2023-06-15 06:53:28,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63800.0, ans=0.1 2023-06-15 06:53:32,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=63800.0, ans=0.2 2023-06-15 06:53:36,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63800.0, ans=0.1 2023-06-15 06:53:40,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63866.666666666664, ans=0.125 2023-06-15 06:54:08,033 INFO [train.py:988] (2/4) Epoch 19, batch 0, loss[loss=0.2427, simple_loss=0.3144, pruned_loss=0.08547, over 18766.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3144, pruned_loss=0.08547, over 18766.00 frames. ], batch size: 83, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:54:08,034 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 06:54:14,161 INFO [train.py:1020] (2/4) Epoch 19, validation: loss=0.2113, simple_loss=0.3157, pruned_loss=0.05349, over 143649.00 frames. 2023-06-15 06:54:14,162 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 06:54:17,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=63880.0, ans=0.125 2023-06-15 06:54:43,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63946.666666666664, ans=0.125 2023-06-15 06:55:05,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.944e+02 2.133e+02 2.428e+02 3.266e+02, threshold=4.266e+02, percent-clipped=0.0 2023-06-15 06:55:12,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=64080.0, ans=0.1 2023-06-15 06:55:14,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=64080.0, ans=0.125 2023-06-15 06:55:17,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64080.0, ans=0.1 2023-06-15 06:55:40,303 INFO [train.py:988] (2/4) Epoch 19, batch 50, loss[loss=0.2654, simple_loss=0.3455, pruned_loss=0.09265, over 16462.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3217, pruned_loss=0.09337, over 857864.37 frames. ], batch size: 52, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:56:09,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=64280.0, ans=0.125 2023-06-15 06:56:24,266 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:56:31,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=64413.333333333336, ans=0.05 2023-06-15 06:57:08,422 INFO [train.py:988] (2/4) Epoch 19, batch 100, loss[loss=0.2446, simple_loss=0.321, pruned_loss=0.08408, over 19324.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3223, pruned_loss=0.09362, over 1515900.07 frames. ], batch size: 98, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:57:14,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=64546.666666666664, ans=0.05 2023-06-15 06:57:27,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=64613.333333333336, ans=0.2 2023-06-15 06:57:32,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=64613.333333333336, ans=0.2 2023-06-15 06:57:39,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2023-06-15 06:57:50,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=64680.0, ans=0.125 2023-06-15 06:58:00,781 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 2.073e+02 2.297e+02 2.619e+02 4.375e+02, threshold=4.594e+02, percent-clipped=1.0 2023-06-15 06:58:10,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=64746.666666666664, ans=0.0 2023-06-15 06:58:11,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=64746.666666666664, ans=0.2 2023-06-15 06:58:28,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.36 vs. limit=10.0 2023-06-15 06:58:32,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.68 vs. limit=22.5 2023-06-15 06:58:36,718 INFO [train.py:988] (2/4) Epoch 19, batch 150, loss[loss=0.2448, simple_loss=0.3017, pruned_loss=0.09393, over 20565.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3228, pruned_loss=0.09386, over 2019272.00 frames. ], batch size: 173, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:58:56,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-15 06:58:59,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=64946.666666666664, ans=0.0 2023-06-15 06:59:34,720 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:59:39,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=65080.0, ans=0.0 2023-06-15 06:59:41,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=65080.0, ans=0.125 2023-06-15 06:59:41,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=65080.0, ans=0.125 2023-06-15 06:59:46,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=65146.666666666664, ans=0.04949747468305833 2023-06-15 06:59:48,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=65146.666666666664, ans=0.0 2023-06-15 06:59:58,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=65146.666666666664, ans=0.125 2023-06-15 06:59:58,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=65146.666666666664, ans=0.125 2023-06-15 07:00:01,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=65146.666666666664, ans=0.125 2023-06-15 07:00:04,052 INFO [train.py:988] (2/4) Epoch 19, batch 200, loss[loss=0.2525, simple_loss=0.3092, pruned_loss=0.09785, over 20568.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3232, pruned_loss=0.09435, over 2416845.97 frames. ], batch size: 189, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:00:04,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=65213.333333333336, ans=0.0 2023-06-15 07:00:18,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=65213.333333333336, ans=0.125 2023-06-15 07:00:28,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=65280.0, ans=0.0 2023-06-15 07:00:55,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65413.333333333336, ans=0.1 2023-06-15 07:00:56,931 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.984e+02 2.245e+02 2.601e+02 3.971e+02, threshold=4.489e+02, percent-clipped=0.0 2023-06-15 07:01:11,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=65413.333333333336, ans=0.5 2023-06-15 07:01:32,684 INFO [train.py:988] (2/4) Epoch 19, batch 250, loss[loss=0.2758, simple_loss=0.3287, pruned_loss=0.1115, over 20607.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3228, pruned_loss=0.09447, over 2725665.96 frames. ], batch size: 189, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:01:38,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=65546.66666666667, ans=0.0 2023-06-15 07:01:47,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=65546.66666666667, ans=0.125 2023-06-15 07:01:54,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=65613.33333333333, ans=0.0 2023-06-15 07:02:23,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2023-06-15 07:02:57,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=65813.33333333333, ans=0.125 2023-06-15 07:03:01,180 INFO [train.py:988] (2/4) Epoch 19, batch 300, loss[loss=0.2708, simple_loss=0.3465, pruned_loss=0.09756, over 16746.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3228, pruned_loss=0.09442, over 2952914.34 frames. ], batch size: 59, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:03:06,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=65880.0, ans=0.125 2023-06-15 07:03:08,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65880.0, ans=0.1 2023-06-15 07:03:16,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=12.0 2023-06-15 07:03:19,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=65946.66666666667, ans=0.2 2023-06-15 07:03:43,388 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:03:45,231 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:03:53,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.956e+02 2.157e+02 2.430e+02 3.269e+02, threshold=4.313e+02, percent-clipped=0.0 2023-06-15 07:04:06,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=66080.0, ans=0.125 2023-06-15 07:04:12,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=15.0 2023-06-15 07:04:25,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=66146.66666666667, ans=0.125 2023-06-15 07:04:28,903 INFO [train.py:988] (2/4) Epoch 19, batch 350, loss[loss=0.2579, simple_loss=0.3313, pruned_loss=0.09221, over 19114.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3229, pruned_loss=0.09483, over 3109081.52 frames. ], batch size: 94, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:04:36,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=66213.33333333333, ans=0.125 2023-06-15 07:05:13,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.67 vs. limit=10.0 2023-06-15 07:05:24,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=66413.33333333333, ans=0.1 2023-06-15 07:05:27,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=66413.33333333333, ans=0.125 2023-06-15 07:05:51,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=66480.0, ans=0.2 2023-06-15 07:05:54,460 INFO [train.py:988] (2/4) Epoch 19, batch 400, loss[loss=0.2616, simple_loss=0.3259, pruned_loss=0.09864, over 20490.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3243, pruned_loss=0.0944, over 3247687.77 frames. ], batch size: 160, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:06:12,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=66613.33333333333, ans=0.0 2023-06-15 07:06:47,306 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.968e+02 2.388e+02 2.889e+02 4.258e+02, threshold=4.776e+02, percent-clipped=0.0 2023-06-15 07:07:01,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66746.66666666667, ans=0.125 2023-06-15 07:07:22,220 INFO [train.py:988] (2/4) Epoch 19, batch 450, loss[loss=0.2513, simple_loss=0.3147, pruned_loss=0.09392, over 20353.00 frames. ], tot_loss[loss=0.256, simple_loss=0.324, pruned_loss=0.09396, over 3353330.84 frames. ], batch size: 149, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:07:38,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=66946.66666666667, ans=0.0 2023-06-15 07:07:43,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=66946.66666666667, ans=0.125 2023-06-15 07:08:06,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67013.33333333333, ans=0.1 2023-06-15 07:08:11,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=67013.33333333333, ans=0.07 2023-06-15 07:08:25,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=67080.0, ans=0.2 2023-06-15 07:08:27,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=67080.0, ans=0.125 2023-06-15 07:08:27,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=67080.0, ans=0.125 2023-06-15 07:08:41,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=67146.66666666667, ans=0.0 2023-06-15 07:08:48,902 INFO [train.py:988] (2/4) Epoch 19, batch 500, loss[loss=0.2827, simple_loss=0.3601, pruned_loss=0.1026, over 16724.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3228, pruned_loss=0.094, over 3447144.87 frames. ], batch size: 59, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:08:59,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=67213.33333333333, ans=0.09899494936611666 2023-06-15 07:09:07,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-06-15 07:09:17,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.69 vs. limit=15.0 2023-06-15 07:09:18,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=67280.0, ans=0.2 2023-06-15 07:09:37,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.943e+02 2.120e+02 2.445e+02 3.405e+02, threshold=4.239e+02, percent-clipped=0.0 2023-06-15 07:09:39,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=67413.33333333333, ans=0.2 2023-06-15 07:10:07,206 INFO [train.py:988] (2/4) Epoch 20, batch 0, loss[loss=0.2659, simple_loss=0.345, pruned_loss=0.09342, over 18322.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.345, pruned_loss=0.09342, over 18322.00 frames. ], batch size: 72, lr: 1.56e-02, grad_scale: 32.0 2023-06-15 07:10:07,206 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 07:10:13,269 INFO [train.py:1020] (2/4) Epoch 20, validation: loss=0.2092, simple_loss=0.3126, pruned_loss=0.05295, over 143649.00 frames. 2023-06-15 07:10:13,271 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 07:10:24,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67433.33333333333, ans=0.0 2023-06-15 07:10:35,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=67500.0, ans=0.125 2023-06-15 07:11:12,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-15 07:11:21,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=67633.33333333333, ans=0.125 2023-06-15 07:11:31,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67700.0, ans=0.125 2023-06-15 07:11:36,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=67700.0, ans=0.2 2023-06-15 07:11:41,415 INFO [train.py:988] (2/4) Epoch 20, batch 50, loss[loss=0.2562, simple_loss=0.3167, pruned_loss=0.09789, over 20648.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3193, pruned_loss=0.09029, over 857316.17 frames. ], batch size: 211, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:12:02,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=67833.33333333333, ans=0.2 2023-06-15 07:13:03,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 2.103e+02 2.332e+02 2.749e+02 4.592e+02, threshold=4.664e+02, percent-clipped=1.0 2023-06-15 07:13:09,729 INFO [train.py:988] (2/4) Epoch 20, batch 100, loss[loss=0.258, simple_loss=0.345, pruned_loss=0.0855, over 17620.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3225, pruned_loss=0.09106, over 1505886.02 frames. ], batch size: 67, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:13:27,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2023-06-15 07:13:43,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=68233.33333333333, ans=0.95 2023-06-15 07:14:01,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.09 vs. limit=15.0 2023-06-15 07:14:11,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=68300.0, ans=0.0 2023-06-15 07:14:17,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68300.0, ans=0.1 2023-06-15 07:14:17,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=68300.0, ans=0.125 2023-06-15 07:14:27,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=68366.66666666667, ans=0.2 2023-06-15 07:14:37,873 INFO [train.py:988] (2/4) Epoch 20, batch 150, loss[loss=0.26, simple_loss=0.336, pruned_loss=0.09202, over 18769.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3214, pruned_loss=0.09029, over 2014980.96 frames. ], batch size: 83, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:14:44,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=68433.33333333333, ans=0.125 2023-06-15 07:14:46,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=68433.33333333333, ans=0.0 2023-06-15 07:15:13,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=68566.66666666667, ans=0.2 2023-06-15 07:15:59,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.981e+02 2.212e+02 2.517e+02 3.865e+02, threshold=4.424e+02, percent-clipped=0.0 2023-06-15 07:16:05,162 INFO [train.py:988] (2/4) Epoch 20, batch 200, loss[loss=0.2444, simple_loss=0.306, pruned_loss=0.09139, over 20603.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3213, pruned_loss=0.09074, over 2390910.85 frames. ], batch size: 189, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:16:06,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-15 07:16:51,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=68900.0, ans=0.125 2023-06-15 07:17:07,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=68966.66666666667, ans=0.07 2023-06-15 07:17:11,365 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:17:14,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=69033.33333333333, ans=0.0 2023-06-15 07:17:23,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=69033.33333333333, ans=0.125 2023-06-15 07:17:28,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=69033.33333333333, ans=0.125 2023-06-15 07:17:33,562 INFO [train.py:988] (2/4) Epoch 20, batch 250, loss[loss=0.233, simple_loss=0.3102, pruned_loss=0.07792, over 19107.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3213, pruned_loss=0.08995, over 2708232.68 frames. ], batch size: 89, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:17:39,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2023-06-15 07:17:46,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=69100.0, ans=0.0 2023-06-15 07:18:25,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69300.0, ans=0.125 2023-06-15 07:18:29,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=69300.0, ans=0.125 2023-06-15 07:18:32,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=69300.0, ans=0.0 2023-06-15 07:18:37,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.22 vs. limit=6.0 2023-06-15 07:18:39,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=69300.0, ans=10.0 2023-06-15 07:18:40,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-15 07:18:49,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.49 vs. limit=15.0 2023-06-15 07:18:55,593 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.907e+02 2.171e+02 2.570e+02 4.262e+02, threshold=4.342e+02, percent-clipped=0.0 2023-06-15 07:19:00,690 INFO [train.py:988] (2/4) Epoch 20, batch 300, loss[loss=0.2386, simple_loss=0.3095, pruned_loss=0.08381, over 18930.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3209, pruned_loss=0.09005, over 2956401.98 frames. ], batch size: 86, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:19:08,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69433.33333333333, ans=0.125 2023-06-15 07:19:40,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=69566.66666666667, ans=0.125 2023-06-15 07:20:14,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-15 07:20:17,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-15 07:20:28,564 INFO [train.py:988] (2/4) Epoch 20, batch 350, loss[loss=0.2476, simple_loss=0.3297, pruned_loss=0.08272, over 10599.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3197, pruned_loss=0.0901, over 3118492.69 frames. ], batch size: 30, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:20:29,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-15 07:20:50,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2023-06-15 07:21:12,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-15 07:21:27,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69966.66666666667, ans=0.1 2023-06-15 07:21:30,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.38 vs. limit=15.0 2023-06-15 07:21:33,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-06-15 07:21:35,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=69966.66666666667, ans=0.07 2023-06-15 07:21:41,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=70033.33333333333, ans=0.2 2023-06-15 07:21:50,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.058e+02 2.277e+02 2.664e+02 3.684e+02, threshold=4.554e+02, percent-clipped=0.0 2023-06-15 07:21:55,243 INFO [train.py:988] (2/4) Epoch 20, batch 400, loss[loss=0.2619, simple_loss=0.333, pruned_loss=0.09537, over 18911.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3195, pruned_loss=0.09004, over 3260785.09 frames. ], batch size: 86, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:22:10,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=70100.0, ans=0.125 2023-06-15 07:22:35,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=70233.33333333333, ans=0.0 2023-06-15 07:22:40,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=70233.33333333333, ans=0.125 2023-06-15 07:22:42,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=70233.33333333333, ans=0.125 2023-06-15 07:22:42,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=70233.33333333333, ans=0.125 2023-06-15 07:23:23,991 INFO [train.py:988] (2/4) Epoch 20, batch 450, loss[loss=0.2693, simple_loss=0.3478, pruned_loss=0.09537, over 16245.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3199, pruned_loss=0.0907, over 3371300.88 frames. ], batch size: 52, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:23:44,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=70500.0, ans=0.2 2023-06-15 07:23:57,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=70566.66666666667, ans=0.2 2023-06-15 07:24:05,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70566.66666666667, ans=0.125 2023-06-15 07:24:13,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=70566.66666666667, ans=0.2 2023-06-15 07:24:42,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-06-15 07:24:43,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=70700.0, ans=0.125 2023-06-15 07:24:44,753 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.031e+02 2.228e+02 2.537e+02 4.676e+02, threshold=4.456e+02, percent-clipped=1.0 2023-06-15 07:24:49,643 INFO [train.py:988] (2/4) Epoch 20, batch 500, loss[loss=0.2448, simple_loss=0.3299, pruned_loss=0.07989, over 17648.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3202, pruned_loss=0.09046, over 3464451.10 frames. ], batch size: 67, lr: 1.53e-02, grad_scale: 32.0 2023-06-15 07:26:07,914 INFO [train.py:988] (2/4) Epoch 21, batch 0, loss[loss=0.2393, simple_loss=0.3097, pruned_loss=0.08442, over 18786.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3097, pruned_loss=0.08442, over 18786.00 frames. ], batch size: 83, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:26:07,914 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 07:26:14,417 INFO [train.py:1020] (2/4) Epoch 21, validation: loss=0.209, simple_loss=0.3126, pruned_loss=0.05274, over 143649.00 frames. 2023-06-15 07:26:14,418 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 07:26:26,351 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.281e-02 2023-06-15 07:26:35,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=71046.66666666667, ans=0.2 2023-06-15 07:26:50,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=71113.33333333333, ans=15.0 2023-06-15 07:27:07,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.97 vs. limit=6.0 2023-06-15 07:27:28,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.40 vs. limit=22.5 2023-06-15 07:27:33,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=71246.66666666667, ans=10.0 2023-06-15 07:27:41,938 INFO [train.py:988] (2/4) Epoch 21, batch 50, loss[loss=0.2353, simple_loss=0.3146, pruned_loss=0.07807, over 19544.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3187, pruned_loss=0.0916, over 862867.06 frames. ], batch size: 102, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:27:50,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=71313.33333333333, ans=0.0 2023-06-15 07:28:00,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-15 07:28:07,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.043e+02 2.374e+02 2.878e+02 4.060e+02, threshold=4.748e+02, percent-clipped=0.0 2023-06-15 07:28:10,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=71380.0, ans=0.1 2023-06-15 07:28:29,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=71446.66666666667, ans=0.125 2023-06-15 07:28:38,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71513.33333333333, ans=0.1 2023-06-15 07:29:09,313 INFO [train.py:988] (2/4) Epoch 21, batch 100, loss[loss=0.2496, simple_loss=0.3085, pruned_loss=0.09539, over 20736.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3174, pruned_loss=0.09032, over 1518654.41 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:29:50,837 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:29:52,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-15 07:30:06,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=71846.66666666667, ans=0.125 2023-06-15 07:30:09,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=71846.66666666667, ans=0.1 2023-06-15 07:30:13,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=71846.66666666667, ans=0.07 2023-06-15 07:30:16,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=71913.33333333333, ans=0.0 2023-06-15 07:30:17,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.86 vs. limit=15.0 2023-06-15 07:30:30,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=71913.33333333333, ans=0.125 2023-06-15 07:30:35,476 INFO [train.py:988] (2/4) Epoch 21, batch 150, loss[loss=0.2439, simple_loss=0.3224, pruned_loss=0.08267, over 19831.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3179, pruned_loss=0.08877, over 2016261.25 frames. ], batch size: 115, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:30:39,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=71980.0, ans=0.5 2023-06-15 07:30:47,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.41 vs. limit=15.0 2023-06-15 07:31:01,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.007e+02 2.293e+02 2.740e+02 3.931e+02, threshold=4.586e+02, percent-clipped=0.0 2023-06-15 07:31:09,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=72113.33333333333, ans=0.0 2023-06-15 07:31:11,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72113.33333333333, ans=0.1 2023-06-15 07:31:23,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=72113.33333333333, ans=0.2 2023-06-15 07:31:40,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=72180.0, ans=0.0 2023-06-15 07:31:40,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=72180.0, ans=0.125 2023-06-15 07:31:42,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=72180.0, ans=0.0 2023-06-15 07:31:42,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=72180.0, ans=0.125 2023-06-15 07:31:45,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=72246.66666666667, ans=0.125 2023-06-15 07:32:01,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=72313.33333333333, ans=0.0 2023-06-15 07:32:02,951 INFO [train.py:988] (2/4) Epoch 21, batch 200, loss[loss=0.2353, simple_loss=0.309, pruned_loss=0.08083, over 19081.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3183, pruned_loss=0.08893, over 2403327.16 frames. ], batch size: 89, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:32:31,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=72380.0, ans=10.0 2023-06-15 07:32:37,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=72446.66666666667, ans=0.1 2023-06-15 07:32:48,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72446.66666666667, ans=0.1 2023-06-15 07:32:55,519 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:33:22,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=72580.0, ans=0.2 2023-06-15 07:33:29,793 INFO [train.py:988] (2/4) Epoch 21, batch 250, loss[loss=0.2614, simple_loss=0.3289, pruned_loss=0.09694, over 19120.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3189, pruned_loss=0.08864, over 2682544.03 frames. ], batch size: 94, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:33:31,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.74 vs. limit=8.0 2023-06-15 07:33:36,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=72646.66666666667, ans=0.125 2023-06-15 07:33:39,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=72646.66666666667, ans=0.125 2023-06-15 07:33:48,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=72713.33333333333, ans=0.025 2023-06-15 07:33:54,718 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.979e+02 2.224e+02 2.730e+02 4.525e+02, threshold=4.448e+02, percent-clipped=0.0 2023-06-15 07:34:00,748 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:34:07,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72780.0, ans=0.0 2023-06-15 07:34:55,949 INFO [train.py:988] (2/4) Epoch 21, batch 300, loss[loss=0.2592, simple_loss=0.2923, pruned_loss=0.113, over 17047.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3191, pruned_loss=0.08817, over 2918827.20 frames. ], batch size: 392, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:35:27,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73046.66666666667, ans=0.1 2023-06-15 07:35:34,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=73113.33333333333, ans=0.0 2023-06-15 07:35:36,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=73113.33333333333, ans=0.0 2023-06-15 07:35:42,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73113.33333333333, ans=0.1 2023-06-15 07:35:52,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73180.0, ans=0.1 2023-06-15 07:36:11,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=73246.66666666667, ans=0.2 2023-06-15 07:36:23,383 INFO [train.py:988] (2/4) Epoch 21, batch 350, loss[loss=0.2401, simple_loss=0.3149, pruned_loss=0.08266, over 19784.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3203, pruned_loss=0.08853, over 3101415.50 frames. ], batch size: 115, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:36:25,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=73313.33333333333, ans=0.125 2023-06-15 07:36:29,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73313.33333333333, ans=0.1 2023-06-15 07:36:49,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 2.024e+02 2.303e+02 3.042e+02 4.564e+02, threshold=4.607e+02, percent-clipped=2.0 2023-06-15 07:37:15,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.50 vs. limit=22.5 2023-06-15 07:37:38,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-06-15 07:37:46,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=73580.0, ans=0.125 2023-06-15 07:37:49,160 INFO [train.py:988] (2/4) Epoch 21, batch 400, loss[loss=0.2552, simple_loss=0.3196, pruned_loss=0.0954, over 19975.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3191, pruned_loss=0.08789, over 3264611.22 frames. ], batch size: 126, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:37:53,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=73646.66666666667, ans=0.125 2023-06-15 07:38:03,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=73646.66666666667, ans=0.125 2023-06-15 07:38:19,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=73713.33333333333, ans=0.0 2023-06-15 07:39:15,485 INFO [train.py:988] (2/4) Epoch 21, batch 450, loss[loss=0.2671, simple_loss=0.3487, pruned_loss=0.09274, over 17627.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3192, pruned_loss=0.08779, over 3372049.52 frames. ], batch size: 67, lr: 1.47e-02, grad_scale: 32.0 2023-06-15 07:39:30,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=73980.0, ans=0.125 2023-06-15 07:39:37,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=74046.66666666667, ans=0.125 2023-06-15 07:39:39,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=74046.66666666667, ans=0.125 2023-06-15 07:39:42,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 2.035e+02 2.412e+02 2.747e+02 5.192e+02, threshold=4.824e+02, percent-clipped=2.0 2023-06-15 07:40:04,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=74113.33333333333, ans=0.0 2023-06-15 07:40:21,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=74180.0, ans=0.0 2023-06-15 07:40:36,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=74246.66666666667, ans=0.125 2023-06-15 07:40:40,643 INFO [train.py:988] (2/4) Epoch 21, batch 500, loss[loss=0.2589, simple_loss=0.3375, pruned_loss=0.09021, over 16318.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3193, pruned_loss=0.08828, over 3451973.01 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 32.0 2023-06-15 07:40:42,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.45 vs. limit=22.5 2023-06-15 07:40:48,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-15 07:40:57,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=74380.0, ans=0.125 2023-06-15 07:41:05,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=74380.0, ans=0.0 2023-06-15 07:41:08,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=74380.0, ans=0.1 2023-06-15 07:42:00,895 INFO [train.py:988] (2/4) Epoch 22, batch 0, loss[loss=0.2572, simple_loss=0.3292, pruned_loss=0.09256, over 18627.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3292, pruned_loss=0.09256, over 18627.00 frames. ], batch size: 80, lr: 1.44e-02, grad_scale: 32.0 2023-06-15 07:42:00,895 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 07:42:07,051 INFO [train.py:1020] (2/4) Epoch 22, validation: loss=0.2075, simple_loss=0.3107, pruned_loss=0.05212, over 143649.00 frames. 2023-06-15 07:42:07,053 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 07:42:12,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74533.33333333333, ans=0.125 2023-06-15 07:42:17,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=74533.33333333333, ans=0.2 2023-06-15 07:42:44,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=74666.66666666667, ans=0.02 2023-06-15 07:43:03,104 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.996e+02 2.190e+02 2.519e+02 3.668e+02, threshold=4.380e+02, percent-clipped=0.0 2023-06-15 07:43:15,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.45 vs. limit=15.0 2023-06-15 07:43:16,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=74800.0, ans=0.0 2023-06-15 07:43:35,417 INFO [train.py:988] (2/4) Epoch 22, batch 50, loss[loss=0.2356, simple_loss=0.3093, pruned_loss=0.08099, over 18811.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3131, pruned_loss=0.08671, over 865769.32 frames. ], batch size: 83, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:44:04,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=15.0 2023-06-15 07:45:02,472 INFO [train.py:988] (2/4) Epoch 22, batch 100, loss[loss=0.2357, simple_loss=0.3029, pruned_loss=0.0843, over 20527.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3131, pruned_loss=0.08685, over 1509941.96 frames. ], batch size: 189, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:45:04,384 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:45:21,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=75266.66666666667, ans=0.0 2023-06-15 07:45:23,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=75266.66666666667, ans=0.125 2023-06-15 07:45:27,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=75266.66666666667, ans=0.125 2023-06-15 07:45:40,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=75333.33333333333, ans=0.05 2023-06-15 07:45:50,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=75333.33333333333, ans=0.0 2023-06-15 07:45:59,428 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.974e+02 2.218e+02 2.504e+02 3.922e+02, threshold=4.437e+02, percent-clipped=0.0 2023-06-15 07:46:13,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=6.0 2023-06-15 07:46:15,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.11 vs. limit=10.0 2023-06-15 07:46:24,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=75466.66666666667, ans=0.2 2023-06-15 07:46:26,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=75466.66666666667, ans=0.125 2023-06-15 07:46:31,643 INFO [train.py:988] (2/4) Epoch 22, batch 150, loss[loss=0.2435, simple_loss=0.3284, pruned_loss=0.0793, over 16389.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3137, pruned_loss=0.0857, over 2021827.71 frames. ], batch size: 52, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:46:40,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75533.33333333333, ans=0.125 2023-06-15 07:47:04,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.09 vs. limit=22.5 2023-06-15 07:47:19,496 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:47:19,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=75666.66666666667, ans=0.0 2023-06-15 07:47:50,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-15 07:48:00,063 INFO [train.py:988] (2/4) Epoch 22, batch 200, loss[loss=0.2308, simple_loss=0.3111, pruned_loss=0.07519, over 19504.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3153, pruned_loss=0.08627, over 2393159.26 frames. ], batch size: 105, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:48:26,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=75933.33333333333, ans=0.125 2023-06-15 07:48:37,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=76000.0, ans=0.1 2023-06-15 07:48:41,761 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:48:46,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=76000.0, ans=0.125 2023-06-15 07:48:56,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.899e+02 2.203e+02 2.409e+02 3.907e+02, threshold=4.406e+02, percent-clipped=0.0 2023-06-15 07:49:07,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=15.0 2023-06-15 07:49:11,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=76133.33333333333, ans=0.0 2023-06-15 07:49:22,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=76133.33333333333, ans=0.0 2023-06-15 07:49:29,230 INFO [train.py:988] (2/4) Epoch 22, batch 250, loss[loss=0.28, simple_loss=0.3548, pruned_loss=0.1026, over 16764.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3156, pruned_loss=0.08661, over 2697224.50 frames. ], batch size: 59, lr: 1.43e-02, grad_scale: 64.0 2023-06-15 07:49:40,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.64 vs. limit=15.0 2023-06-15 07:49:50,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76266.66666666667, ans=0.1 2023-06-15 07:49:51,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76266.66666666667, ans=0.125 2023-06-15 07:49:59,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=76266.66666666667, ans=0.2 2023-06-15 07:50:17,239 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:50:44,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=8.0 2023-06-15 07:50:57,240 INFO [train.py:988] (2/4) Epoch 22, batch 300, loss[loss=0.2481, simple_loss=0.3218, pruned_loss=0.0872, over 18293.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3151, pruned_loss=0.08689, over 2949029.38 frames. ], batch size: 74, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:51:02,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-15 07:51:05,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=15.0 2023-06-15 07:51:20,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=76600.0, ans=0.2 2023-06-15 07:51:23,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=76600.0, ans=0.1 2023-06-15 07:51:53,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 1.887e+02 2.089e+02 2.496e+02 3.491e+02, threshold=4.177e+02, percent-clipped=0.0 2023-06-15 07:52:10,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=76800.0, ans=0.07 2023-06-15 07:52:18,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=76800.0, ans=0.0 2023-06-15 07:52:19,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=76800.0, ans=0.125 2023-06-15 07:52:23,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=76866.66666666667, ans=0.07 2023-06-15 07:52:23,173 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:52:24,351 INFO [train.py:988] (2/4) Epoch 22, batch 350, loss[loss=0.2225, simple_loss=0.2956, pruned_loss=0.07467, over 20008.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3148, pruned_loss=0.08642, over 3130949.76 frames. ], batch size: 126, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:52:28,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-15 07:53:26,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=77066.66666666667, ans=0.1 2023-06-15 07:53:40,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=77133.33333333333, ans=0.125 2023-06-15 07:53:54,162 INFO [train.py:988] (2/4) Epoch 22, batch 400, loss[loss=0.2261, simple_loss=0.2977, pruned_loss=0.07725, over 19465.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.314, pruned_loss=0.08575, over 3273585.22 frames. ], batch size: 105, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:54:13,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=77266.66666666667, ans=0.125 2023-06-15 07:54:23,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=77266.66666666667, ans=0.0 2023-06-15 07:54:28,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=77333.33333333333, ans=0.125 2023-06-15 07:54:35,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=77333.33333333333, ans=0.0 2023-06-15 07:54:43,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.13 vs. limit=22.5 2023-06-15 07:54:50,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.936e+02 2.140e+02 2.519e+02 4.231e+02, threshold=4.281e+02, percent-clipped=1.0 2023-06-15 07:55:10,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=77466.66666666667, ans=0.2 2023-06-15 07:55:14,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=77466.66666666667, ans=0.0 2023-06-15 07:55:15,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=77466.66666666667, ans=0.2 2023-06-15 07:55:22,556 INFO [train.py:988] (2/4) Epoch 22, batch 450, loss[loss=0.2456, simple_loss=0.3298, pruned_loss=0.08067, over 16309.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3154, pruned_loss=0.08646, over 3383169.52 frames. ], batch size: 52, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:55:30,439 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:55:50,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=77600.0, ans=0.0 2023-06-15 07:55:59,406 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:56:10,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77666.66666666667, ans=0.1 2023-06-15 07:56:42,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77800.0, ans=0.1 2023-06-15 07:56:44,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=77800.0, ans=0.1 2023-06-15 07:56:49,278 INFO [train.py:988] (2/4) Epoch 22, batch 500, loss[loss=0.2368, simple_loss=0.3051, pruned_loss=0.08427, over 20575.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3147, pruned_loss=0.08631, over 3480574.46 frames. ], batch size: 189, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:56:58,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=77866.66666666667, ans=0.125 2023-06-15 07:57:17,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2023-06-15 07:58:03,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=15.0 2023-06-15 07:58:03,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.901e+02 2.129e+02 2.481e+02 3.635e+02, threshold=4.258e+02, percent-clipped=0.0 2023-06-15 07:58:03,890 INFO [train.py:988] (2/4) Epoch 23, batch 0, loss[loss=0.226, simple_loss=0.2977, pruned_loss=0.07715, over 20336.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2977, pruned_loss=0.07715, over 20336.00 frames. ], batch size: 149, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 07:58:03,891 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 07:58:10,163 INFO [train.py:1020] (2/4) Epoch 23, validation: loss=0.2051, simple_loss=0.3092, pruned_loss=0.05051, over 143649.00 frames. 2023-06-15 07:58:10,164 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 07:58:14,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2023-06-15 07:59:02,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=78280.0, ans=0.0 2023-06-15 07:59:11,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78280.0, ans=0.1 2023-06-15 07:59:39,933 INFO [train.py:988] (2/4) Epoch 23, batch 50, loss[loss=0.2454, simple_loss=0.3222, pruned_loss=0.08432, over 18768.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3134, pruned_loss=0.08582, over 855489.08 frames. ], batch size: 83, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 07:59:45,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=78413.33333333333, ans=0.125 2023-06-15 08:00:04,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=78480.0, ans=0.125 2023-06-15 08:00:04,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2023-06-15 08:00:09,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=78480.0, ans=0.0 2023-06-15 08:00:20,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=78546.66666666667, ans=0.0 2023-06-15 08:00:22,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78546.66666666667, ans=0.0 2023-06-15 08:00:38,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2023-06-15 08:01:10,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.902e+02 2.065e+02 2.419e+02 3.199e+02, threshold=4.129e+02, percent-clipped=0.0 2023-06-15 08:01:10,825 INFO [train.py:988] (2/4) Epoch 23, batch 100, loss[loss=0.2428, simple_loss=0.3113, pruned_loss=0.08719, over 20473.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3137, pruned_loss=0.08416, over 1496320.23 frames. ], batch size: 160, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:01:23,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.27 vs. limit=15.0 2023-06-15 08:01:30,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=78813.33333333333, ans=0.95 2023-06-15 08:01:32,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=78813.33333333333, ans=0.125 2023-06-15 08:02:38,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79080.0, ans=0.125 2023-06-15 08:02:40,334 INFO [train.py:988] (2/4) Epoch 23, batch 150, loss[loss=0.2303, simple_loss=0.3078, pruned_loss=0.07636, over 19531.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3124, pruned_loss=0.08343, over 2005761.08 frames. ], batch size: 102, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:02:54,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=79080.0, ans=0.0 2023-06-15 08:03:06,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=79146.66666666667, ans=0.0 2023-06-15 08:03:48,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=79280.0, ans=0.035 2023-06-15 08:04:09,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.913e+02 2.150e+02 2.448e+02 4.177e+02, threshold=4.300e+02, percent-clipped=1.0 2023-06-15 08:04:09,230 INFO [train.py:988] (2/4) Epoch 23, batch 200, loss[loss=0.2352, simple_loss=0.3148, pruned_loss=0.07777, over 19804.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3116, pruned_loss=0.08389, over 2412593.60 frames. ], batch size: 115, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:04:26,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=79480.0, ans=0.125 2023-06-15 08:04:58,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79546.66666666667, ans=0.125 2023-06-15 08:05:03,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79613.33333333333, ans=0.1 2023-06-15 08:05:08,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=79613.33333333333, ans=0.0 2023-06-15 08:05:11,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=15.0 2023-06-15 08:05:18,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.10 vs. limit=5.0 2023-06-15 08:05:25,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=79680.0, ans=0.0 2023-06-15 08:05:31,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=79680.0, ans=0.0 2023-06-15 08:05:36,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=79746.66666666667, ans=0.0 2023-06-15 08:05:37,752 INFO [train.py:988] (2/4) Epoch 23, batch 250, loss[loss=0.2196, simple_loss=0.3012, pruned_loss=0.06898, over 18913.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3126, pruned_loss=0.08425, over 2724280.19 frames. ], batch size: 86, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:05:49,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=79746.66666666667, ans=0.1 2023-06-15 08:05:55,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=79813.33333333333, ans=0.125 2023-06-15 08:06:34,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79946.66666666667, ans=0.125 2023-06-15 08:06:35,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=79946.66666666667, ans=0.0 2023-06-15 08:06:39,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79946.66666666667, ans=0.125 2023-06-15 08:06:58,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=80013.33333333333, ans=0.125 2023-06-15 08:07:04,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=80013.33333333333, ans=0.125 2023-06-15 08:07:09,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-15 08:07:10,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.868e+02 2.154e+02 2.660e+02 5.559e+02, threshold=4.308e+02, percent-clipped=2.0 2023-06-15 08:07:10,071 INFO [train.py:988] (2/4) Epoch 23, batch 300, loss[loss=0.2446, simple_loss=0.3031, pruned_loss=0.09303, over 20592.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3134, pruned_loss=0.08445, over 2941363.77 frames. ], batch size: 173, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:07:19,835 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=12.0 2023-06-15 08:07:28,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2023-06-15 08:07:31,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=80146.66666666667, ans=0.0 2023-06-15 08:07:53,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=80213.33333333333, ans=0.125 2023-06-15 08:08:08,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80280.0, ans=0.1 2023-06-15 08:08:13,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=80280.0, ans=0.125 2023-06-15 08:08:31,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=80346.66666666667, ans=0.125 2023-06-15 08:08:38,447 INFO [train.py:988] (2/4) Epoch 23, batch 350, loss[loss=0.2779, simple_loss=0.3018, pruned_loss=0.1271, over 16959.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3138, pruned_loss=0.08521, over 3127592.10 frames. ], batch size: 392, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:08:43,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=80413.33333333333, ans=0.125 2023-06-15 08:09:05,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=80480.0, ans=0.125 2023-06-15 08:09:24,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.86 vs. limit=12.0 2023-06-15 08:09:40,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=80613.33333333333, ans=0.125 2023-06-15 08:09:47,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=80680.0, ans=0.0 2023-06-15 08:10:05,638 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.945e+02 2.149e+02 2.617e+02 4.906e+02, threshold=4.297e+02, percent-clipped=1.0 2023-06-15 08:10:05,704 INFO [train.py:988] (2/4) Epoch 23, batch 400, loss[loss=0.235, simple_loss=0.3149, pruned_loss=0.07751, over 18323.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3138, pruned_loss=0.08507, over 3272376.16 frames. ], batch size: 74, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:10:18,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=80746.66666666667, ans=0.09899494936611666 2023-06-15 08:10:25,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80813.33333333333, ans=0.1 2023-06-15 08:10:46,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=80880.0, ans=0.0 2023-06-15 08:10:46,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=80880.0, ans=0.0 2023-06-15 08:11:25,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.90 vs. limit=22.5 2023-06-15 08:11:34,898 INFO [train.py:988] (2/4) Epoch 23, batch 450, loss[loss=0.2381, simple_loss=0.3218, pruned_loss=0.07718, over 17644.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3137, pruned_loss=0.08456, over 3382798.16 frames. ], batch size: 67, lr: 1.36e-02, grad_scale: 64.0 2023-06-15 08:11:40,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=81080.0, ans=0.125 2023-06-15 08:12:14,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81213.33333333333, ans=0.1 2023-06-15 08:12:17,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=81213.33333333333, ans=0.2 2023-06-15 08:12:37,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=81280.0, ans=0.1 2023-06-15 08:12:37,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=81280.0, ans=0.0 2023-06-15 08:13:01,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.014e+02 2.230e+02 2.721e+02 4.611e+02, threshold=4.461e+02, percent-clipped=1.0 2023-06-15 08:13:01,049 INFO [train.py:988] (2/4) Epoch 23, batch 500, loss[loss=0.251, simple_loss=0.3063, pruned_loss=0.09789, over 20154.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.313, pruned_loss=0.08469, over 3476901.88 frames. ], batch size: 239, lr: 1.36e-02, grad_scale: 64.0 2023-06-15 08:14:20,657 INFO [train.py:988] (2/4) Epoch 24, batch 0, loss[loss=0.2394, simple_loss=0.3055, pruned_loss=0.08664, over 20070.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3055, pruned_loss=0.08664, over 20070.00 frames. ], batch size: 133, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:14:20,657 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 08:14:27,203 INFO [train.py:1020] (2/4) Epoch 24, validation: loss=0.2057, simple_loss=0.3089, pruned_loss=0.05123, over 143649.00 frames. 2023-06-15 08:14:27,204 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 08:14:37,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=81626.66666666667, ans=0.0 2023-06-15 08:14:41,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.23 vs. limit=22.5 2023-06-15 08:14:55,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=81693.33333333333, ans=0.125 2023-06-15 08:14:58,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=81693.33333333333, ans=0.0 2023-06-15 08:14:58,799 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:14:59,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-15 08:15:35,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=81826.66666666667, ans=0.1 2023-06-15 08:15:47,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=81893.33333333333, ans=0.125 2023-06-15 08:15:57,124 INFO [train.py:988] (2/4) Epoch 24, batch 50, loss[loss=0.2251, simple_loss=0.2985, pruned_loss=0.0759, over 18614.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3124, pruned_loss=0.08633, over 837439.17 frames. ], batch size: 80, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:16:11,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=81960.0, ans=0.125 2023-06-15 08:16:24,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=82026.66666666667, ans=0.0 2023-06-15 08:16:29,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.940e+02 2.249e+02 2.625e+02 3.999e+02, threshold=4.499e+02, percent-clipped=0.0 2023-06-15 08:17:00,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=82160.0, ans=0.125 2023-06-15 08:17:06,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=82160.0, ans=22.5 2023-06-15 08:17:06,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-06-15 08:17:25,774 INFO [train.py:988] (2/4) Epoch 24, batch 100, loss[loss=0.2277, simple_loss=0.3008, pruned_loss=0.07727, over 19227.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3129, pruned_loss=0.08518, over 1492475.44 frames. ], batch size: 92, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:17:39,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=82293.33333333333, ans=0.125 2023-06-15 08:17:48,233 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:18:11,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.02 vs. limit=15.0 2023-06-15 08:18:27,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=82493.33333333333, ans=0.2 2023-06-15 08:18:33,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.93 vs. limit=10.0 2023-06-15 08:18:54,722 INFO [train.py:988] (2/4) Epoch 24, batch 150, loss[loss=0.2278, simple_loss=0.3067, pruned_loss=0.07445, over 19481.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3137, pruned_loss=0.08489, over 2015663.73 frames. ], batch size: 105, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:19:27,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.862e+02 2.076e+02 2.332e+02 3.767e+02, threshold=4.152e+02, percent-clipped=0.0 2023-06-15 08:19:27,486 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:20:10,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-15 08:20:10,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.11 vs. limit=10.0 2023-06-15 08:20:14,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2023-06-15 08:20:19,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82893.33333333333, ans=0.1 2023-06-15 08:20:24,109 INFO [train.py:988] (2/4) Epoch 24, batch 200, loss[loss=0.2511, simple_loss=0.3142, pruned_loss=0.09402, over 20583.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3139, pruned_loss=0.08375, over 2390484.73 frames. ], batch size: 173, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:20:37,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=82960.0, ans=0.04949747468305833 2023-06-15 08:21:52,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-15 08:21:53,196 INFO [train.py:988] (2/4) Epoch 24, batch 250, loss[loss=0.2324, simple_loss=0.3032, pruned_loss=0.08081, over 20091.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3122, pruned_loss=0.08361, over 2709134.20 frames. ], batch size: 133, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:22:07,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=83293.33333333333, ans=0.125 2023-06-15 08:22:24,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.954e+02 2.132e+02 2.511e+02 4.253e+02, threshold=4.265e+02, percent-clipped=1.0 2023-06-15 08:22:25,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=83360.0, ans=0.0 2023-06-15 08:22:25,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=83360.0, ans=0.125 2023-06-15 08:23:06,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-15 08:23:21,349 INFO [train.py:988] (2/4) Epoch 24, batch 300, loss[loss=0.23, simple_loss=0.3091, pruned_loss=0.07545, over 19211.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3123, pruned_loss=0.0835, over 2931805.57 frames. ], batch size: 92, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:23:42,708 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:24:02,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=83760.0, ans=0.125 2023-06-15 08:24:08,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=83760.0, ans=0.125 2023-06-15 08:24:09,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=83760.0, ans=0.125 2023-06-15 08:24:10,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.54 vs. limit=22.5 2023-06-15 08:24:14,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=83826.66666666667, ans=0.0 2023-06-15 08:24:24,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=15.0 2023-06-15 08:24:34,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=83893.33333333333, ans=0.125 2023-06-15 08:24:38,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=83893.33333333333, ans=0.125 2023-06-15 08:24:50,656 INFO [train.py:988] (2/4) Epoch 24, batch 350, loss[loss=0.2346, simple_loss=0.3074, pruned_loss=0.08086, over 20471.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3121, pruned_loss=0.08322, over 3132649.55 frames. ], batch size: 160, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:24:54,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=83960.0, ans=0.0 2023-06-15 08:25:22,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.988e+02 2.360e+02 2.767e+02 4.250e+02, threshold=4.720e+02, percent-clipped=0.0 2023-06-15 08:25:24,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-15 08:25:27,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84093.33333333333, ans=0.1 2023-06-15 08:25:31,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=84093.33333333333, ans=0.125 2023-06-15 08:25:48,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84160.0, ans=0.1 2023-06-15 08:25:57,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84160.0, ans=0.125 2023-06-15 08:26:21,074 INFO [train.py:988] (2/4) Epoch 24, batch 400, loss[loss=0.2555, simple_loss=0.3342, pruned_loss=0.08841, over 15474.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3116, pruned_loss=0.08284, over 3282122.24 frames. ], batch size: 44, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:26:31,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=84293.33333333333, ans=0.125 2023-06-15 08:26:36,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84360.0, ans=0.125 2023-06-15 08:26:37,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2023-06-15 08:26:45,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-15 08:27:18,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=84493.33333333333, ans=0.2 2023-06-15 08:27:19,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84493.33333333333, ans=0.0 2023-06-15 08:27:24,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=84493.33333333333, ans=0.125 2023-06-15 08:27:49,978 INFO [train.py:988] (2/4) Epoch 24, batch 450, loss[loss=0.2533, simple_loss=0.3416, pruned_loss=0.08248, over 17606.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3112, pruned_loss=0.08241, over 3382974.12 frames. ], batch size: 67, lr: 1.31e-02, grad_scale: 64.0 2023-06-15 08:28:10,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=84693.33333333333, ans=0.2 2023-06-15 08:28:21,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.787e+02 2.025e+02 2.287e+02 3.179e+02, threshold=4.050e+02, percent-clipped=0.0 2023-06-15 08:28:50,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84826.66666666667, ans=0.1 2023-06-15 08:29:12,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=84893.33333333333, ans=0.125 2023-06-15 08:29:15,494 INFO [train.py:988] (2/4) Epoch 24, batch 500, loss[loss=0.2418, simple_loss=0.3015, pruned_loss=0.09103, over 20546.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.311, pruned_loss=0.08194, over 3467568.64 frames. ], batch size: 189, lr: 1.31e-02, grad_scale: 32.0 2023-06-15 08:29:29,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=84960.0, ans=0.0 2023-06-15 08:29:33,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=85026.66666666667, ans=0.125 2023-06-15 08:29:45,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=85026.66666666667, ans=0.0 2023-06-15 08:30:05,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=85160.0, ans=0.0 2023-06-15 08:30:31,629 INFO [train.py:988] (2/4) Epoch 25, batch 0, loss[loss=0.2565, simple_loss=0.3381, pruned_loss=0.0874, over 17638.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3381, pruned_loss=0.0874, over 17638.00 frames. ], batch size: 67, lr: 1.29e-02, grad_scale: 32.0 2023-06-15 08:30:31,629 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 08:30:37,736 INFO [train.py:1020] (2/4) Epoch 25, validation: loss=0.205, simple_loss=0.3085, pruned_loss=0.05071, over 143649.00 frames. 2023-06-15 08:30:37,736 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 08:31:31,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-06-15 08:31:44,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.894e+02 2.218e+02 2.492e+02 3.446e+02, threshold=4.437e+02, percent-clipped=0.0 2023-06-15 08:31:55,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85440.0, ans=0.125 2023-06-15 08:32:07,387 INFO [train.py:988] (2/4) Epoch 25, batch 50, loss[loss=0.2367, simple_loss=0.3192, pruned_loss=0.07708, over 18928.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3103, pruned_loss=0.08196, over 868320.42 frames. ], batch size: 86, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:33:00,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.79 vs. limit=10.0 2023-06-15 08:33:34,486 INFO [train.py:988] (2/4) Epoch 25, batch 100, loss[loss=0.2424, simple_loss=0.3041, pruned_loss=0.09031, over 20202.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3106, pruned_loss=0.08177, over 1519155.25 frames. ], batch size: 239, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:34:39,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.831e+02 2.059e+02 2.312e+02 3.649e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-15 08:35:02,329 INFO [train.py:988] (2/4) Epoch 25, batch 150, loss[loss=0.2451, simple_loss=0.3272, pruned_loss=0.08146, over 17003.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3096, pruned_loss=0.08137, over 2012830.65 frames. ], batch size: 60, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:35:27,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.21 vs. limit=22.5 2023-06-15 08:35:38,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=86306.66666666667, ans=0.125 2023-06-15 08:35:38,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=86306.66666666667, ans=0.125 2023-06-15 08:35:50,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=86306.66666666667, ans=0.125 2023-06-15 08:36:30,648 INFO [train.py:988] (2/4) Epoch 25, batch 200, loss[loss=0.2474, simple_loss=0.3335, pruned_loss=0.08063, over 18333.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3093, pruned_loss=0.0811, over 2412746.70 frames. ], batch size: 72, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:36:55,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=86573.33333333333, ans=0.1 2023-06-15 08:36:59,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.08 vs. limit=10.0 2023-06-15 08:37:26,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=86706.66666666667, ans=0.0 2023-06-15 08:37:34,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.840e+02 2.036e+02 2.337e+02 3.806e+02, threshold=4.072e+02, percent-clipped=0.0 2023-06-15 08:37:35,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=86706.66666666667, ans=0.2 2023-06-15 08:37:50,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=86773.33333333333, ans=0.125 2023-06-15 08:37:58,096 INFO [train.py:988] (2/4) Epoch 25, batch 250, loss[loss=0.244, simple_loss=0.3088, pruned_loss=0.08956, over 20361.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.309, pruned_loss=0.08128, over 2709536.97 frames. ], batch size: 149, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:37:58,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86840.0, ans=0.125 2023-06-15 08:38:11,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=86840.0, ans=0.04949747468305833 2023-06-15 08:38:23,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=86906.66666666667, ans=0.125 2023-06-15 08:38:32,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=86973.33333333333, ans=0.2 2023-06-15 08:39:19,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=87106.66666666667, ans=0.125 2023-06-15 08:39:20,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2023-06-15 08:39:23,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=87106.66666666667, ans=0.2 2023-06-15 08:39:26,113 INFO [train.py:988] (2/4) Epoch 25, batch 300, loss[loss=0.2445, simple_loss=0.3189, pruned_loss=0.08499, over 19128.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3088, pruned_loss=0.08188, over 2964584.31 frames. ], batch size: 94, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:39:33,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=15.0 2023-06-15 08:39:40,356 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:39:51,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=87240.0, ans=0.0 2023-06-15 08:40:31,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.920e+02 2.106e+02 2.371e+02 3.707e+02, threshold=4.212e+02, percent-clipped=0.0 2023-06-15 08:40:31,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=87373.33333333333, ans=0.05 2023-06-15 08:40:34,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=87440.0, ans=0.0 2023-06-15 08:40:52,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=87506.66666666667, ans=0.125 2023-06-15 08:40:54,113 INFO [train.py:988] (2/4) Epoch 25, batch 350, loss[loss=0.2201, simple_loss=0.3151, pruned_loss=0.06252, over 16746.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3085, pruned_loss=0.08192, over 3128335.17 frames. ], batch size: 59, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:40:56,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=87506.66666666667, ans=0.0 2023-06-15 08:41:02,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87506.66666666667, ans=0.1 2023-06-15 08:41:15,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=87573.33333333333, ans=0.0 2023-06-15 08:41:17,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=87573.33333333333, ans=0.125 2023-06-15 08:41:40,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-15 08:41:59,331 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:41:59,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.21 vs. limit=15.0 2023-06-15 08:42:15,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=87773.33333333333, ans=0.07 2023-06-15 08:42:21,265 INFO [train.py:988] (2/4) Epoch 25, batch 400, loss[loss=0.2495, simple_loss=0.3319, pruned_loss=0.08354, over 16383.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3097, pruned_loss=0.0821, over 3264771.68 frames. ], batch size: 52, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:42:47,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=87906.66666666667, ans=0.125 2023-06-15 08:42:47,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=87906.66666666667, ans=22.5 2023-06-15 08:43:12,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=87973.33333333333, ans=0.125 2023-06-15 08:43:20,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88040.0, ans=0.125 2023-06-15 08:43:26,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-15 08:43:27,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.933e+02 2.148e+02 2.533e+02 3.587e+02, threshold=4.297e+02, percent-clipped=0.0 2023-06-15 08:43:28,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=88040.0, ans=0.2 2023-06-15 08:43:33,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=88106.66666666667, ans=0.125 2023-06-15 08:43:38,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=88106.66666666667, ans=0.125 2023-06-15 08:43:42,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=88106.66666666667, ans=0.0 2023-06-15 08:43:45,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=88106.66666666667, ans=0.0 2023-06-15 08:43:47,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88106.66666666667, ans=0.1 2023-06-15 08:43:47,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=88106.66666666667, ans=0.0 2023-06-15 08:43:50,326 INFO [train.py:988] (2/4) Epoch 25, batch 450, loss[loss=0.2504, simple_loss=0.2857, pruned_loss=0.1075, over 16877.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3089, pruned_loss=0.08173, over 3387692.86 frames. ], batch size: 391, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:44:15,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=88240.0, ans=0.125 2023-06-15 08:45:01,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.84 vs. limit=12.0 2023-06-15 08:45:15,661 INFO [train.py:988] (2/4) Epoch 25, batch 500, loss[loss=0.2188, simple_loss=0.3033, pruned_loss=0.06715, over 18794.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3085, pruned_loss=0.0813, over 3475069.48 frames. ], batch size: 83, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:45:15,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=88506.66666666667, ans=0.0 2023-06-15 08:45:35,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88573.33333333333, ans=0.1 2023-06-15 08:46:05,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=88706.66666666667, ans=10.0 2023-06-15 08:46:29,688 INFO [train.py:988] (2/4) Epoch 26, batch 0, loss[loss=0.2325, simple_loss=0.3035, pruned_loss=0.08076, over 10876.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3035, pruned_loss=0.08076, over 10876.00 frames. ], batch size: 30, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:46:29,688 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 08:46:35,682 INFO [train.py:1020] (2/4) Epoch 26, validation: loss=0.2057, simple_loss=0.3076, pruned_loss=0.05187, over 143649.00 frames. 2023-06-15 08:46:35,683 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 08:46:42,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=88720.0, ans=0.125 2023-06-15 08:46:43,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.988e+02 2.148e+02 2.357e+02 3.601e+02, threshold=4.296e+02, percent-clipped=0.0 2023-06-15 08:46:51,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=88786.66666666667, ans=0.0 2023-06-15 08:46:54,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=88786.66666666667, ans=0.0 2023-06-15 08:47:03,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=88786.66666666667, ans=0.2 2023-06-15 08:47:19,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88853.33333333333, ans=0.125 2023-06-15 08:47:31,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-15 08:48:02,845 INFO [train.py:988] (2/4) Epoch 26, batch 50, loss[loss=0.2486, simple_loss=0.3369, pruned_loss=0.0802, over 17651.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3072, pruned_loss=0.07932, over 836937.46 frames. ], batch size: 67, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:48:15,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=89053.33333333333, ans=0.0 2023-06-15 08:48:18,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=89120.0, ans=0.125 2023-06-15 08:49:03,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=89253.33333333333, ans=0.2 2023-06-15 08:49:30,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.11 vs. limit=15.0 2023-06-15 08:49:33,001 INFO [train.py:988] (2/4) Epoch 26, batch 100, loss[loss=0.2282, simple_loss=0.301, pruned_loss=0.07768, over 20098.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3066, pruned_loss=0.08013, over 1500285.39 frames. ], batch size: 133, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:49:41,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.958e+02 2.207e+02 2.501e+02 3.726e+02, threshold=4.413e+02, percent-clipped=0.0 2023-06-15 08:49:51,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=22.5 2023-06-15 08:50:06,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=89520.0, ans=0.125 2023-06-15 08:50:49,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=89653.33333333333, ans=0.2 2023-06-15 08:51:01,460 INFO [train.py:988] (2/4) Epoch 26, batch 150, loss[loss=0.2414, simple_loss=0.301, pruned_loss=0.09087, over 20691.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3077, pruned_loss=0.08012, over 1994368.46 frames. ], batch size: 211, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:51:13,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=89720.0, ans=0.125 2023-06-15 08:51:41,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-15 08:51:48,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=89853.33333333333, ans=0.2 2023-06-15 08:51:56,429 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:51:56,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89920.0, ans=0.1 2023-06-15 08:52:03,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=89920.0, ans=0.0 2023-06-15 08:52:17,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=89986.66666666667, ans=0.1 2023-06-15 08:52:23,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=89986.66666666667, ans=0.0 2023-06-15 08:52:24,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-15 08:52:29,994 INFO [train.py:988] (2/4) Epoch 26, batch 200, loss[loss=0.2439, simple_loss=0.2807, pruned_loss=0.1035, over 17101.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3079, pruned_loss=0.08016, over 2389860.49 frames. ], batch size: 392, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:52:39,008 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.858e+02 1.962e+02 2.261e+02 3.889e+02, threshold=3.924e+02, percent-clipped=0.0 2023-06-15 08:53:13,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.04 vs. limit=15.0 2023-06-15 08:53:27,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=90253.33333333333, ans=0.0 2023-06-15 08:53:36,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=90253.33333333333, ans=0.125 2023-06-15 08:53:58,446 INFO [train.py:988] (2/4) Epoch 26, batch 250, loss[loss=0.2197, simple_loss=0.3016, pruned_loss=0.06894, over 19671.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3071, pruned_loss=0.07942, over 2705733.36 frames. ], batch size: 110, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:53:58,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90386.66666666667, ans=0.1 2023-06-15 08:54:02,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=90386.66666666667, ans=0.0 2023-06-15 08:54:46,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=90520.0, ans=0.0 2023-06-15 08:54:51,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-15 08:54:57,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=90586.66666666667, ans=0.125 2023-06-15 08:55:10,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-06-15 08:55:26,667 INFO [train.py:988] (2/4) Epoch 26, batch 300, loss[loss=0.2381, simple_loss=0.3088, pruned_loss=0.08372, over 19229.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3072, pruned_loss=0.07933, over 2958855.00 frames. ], batch size: 92, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:55:36,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.907e+02 2.229e+02 2.657e+02 5.301e+02, threshold=4.457e+02, percent-clipped=1.0 2023-06-15 08:55:46,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=90786.66666666667, ans=0.0 2023-06-15 08:55:59,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.10 vs. limit=12.0 2023-06-15 08:56:42,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=90986.66666666667, ans=0.125 2023-06-15 08:56:47,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=90986.66666666667, ans=0.0 2023-06-15 08:56:50,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=90986.66666666667, ans=0.2 2023-06-15 08:56:55,554 INFO [train.py:988] (2/4) Epoch 26, batch 350, loss[loss=0.2295, simple_loss=0.3013, pruned_loss=0.07882, over 20599.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3083, pruned_loss=0.07971, over 3133784.95 frames. ], batch size: 173, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:57:00,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91053.33333333333, ans=0.0 2023-06-15 08:57:04,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-15 08:57:44,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=91186.66666666667, ans=0.5 2023-06-15 08:58:24,526 INFO [train.py:988] (2/4) Epoch 26, batch 400, loss[loss=0.2417, simple_loss=0.3193, pruned_loss=0.08202, over 18278.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3082, pruned_loss=0.07943, over 3255054.22 frames. ], batch size: 74, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:58:33,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.996e+02 2.310e+02 2.636e+02 4.226e+02, threshold=4.620e+02, percent-clipped=0.0 2023-06-15 08:58:33,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=91386.66666666667, ans=0.125 2023-06-15 08:58:38,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=91386.66666666667, ans=0.125 2023-06-15 08:59:06,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=91520.0, ans=0.125 2023-06-15 08:59:32,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-06-15 08:59:53,756 INFO [train.py:988] (2/4) Epoch 26, batch 450, loss[loss=0.227, simple_loss=0.3071, pruned_loss=0.07348, over 18658.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.308, pruned_loss=0.07948, over 3363971.42 frames. ], batch size: 80, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:59:59,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=91720.0, ans=0.025 2023-06-15 09:00:01,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=91720.0, ans=0.125 2023-06-15 09:00:24,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=91786.66666666667, ans=0.125 2023-06-15 09:00:27,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.86 vs. limit=10.0 2023-06-15 09:01:06,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-15 09:01:11,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=91986.66666666667, ans=0.125 2023-06-15 09:01:20,840 INFO [train.py:988] (2/4) Epoch 26, batch 500, loss[loss=0.2269, simple_loss=0.3002, pruned_loss=0.07681, over 19804.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3076, pruned_loss=0.07939, over 3469709.26 frames. ], batch size: 115, lr: 1.22e-02, grad_scale: 32.0 2023-06-15 09:01:28,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.890e+02 2.061e+02 2.503e+02 4.030e+02, threshold=4.121e+02, percent-clipped=0.0 2023-06-15 09:01:39,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-15 09:02:01,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=92186.66666666667, ans=0.0 2023-06-15 09:02:43,456 INFO [train.py:988] (2/4) Epoch 27, batch 0, loss[loss=0.2459, simple_loss=0.3113, pruned_loss=0.09024, over 20483.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3113, pruned_loss=0.09024, over 20483.00 frames. ], batch size: 160, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:02:43,457 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 09:02:52,296 INFO [train.py:1020] (2/4) Epoch 27, validation: loss=0.2009, simple_loss=0.305, pruned_loss=0.04841, over 143649.00 frames. 2023-06-15 09:02:52,296 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 09:03:40,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.67 vs. limit=22.5 2023-06-15 09:03:44,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=92473.33333333333, ans=0.125 2023-06-15 09:03:45,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=92473.33333333333, ans=15.0 2023-06-15 09:03:51,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=92473.33333333333, ans=0.125 2023-06-15 09:03:51,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=92473.33333333333, ans=0.2 2023-06-15 09:03:59,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=92473.33333333333, ans=0.125 2023-06-15 09:04:07,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=92540.0, ans=0.2 2023-06-15 09:04:21,901 INFO [train.py:988] (2/4) Epoch 27, batch 50, loss[loss=0.2418, simple_loss=0.3267, pruned_loss=0.07841, over 17877.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3075, pruned_loss=0.08083, over 861945.84 frames. ], batch size: 68, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:04:30,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92606.66666666667, ans=0.1 2023-06-15 09:04:35,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=92606.66666666667, ans=0.125 2023-06-15 09:04:38,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=92673.33333333333, ans=0.125 2023-06-15 09:04:39,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=92673.33333333333, ans=0.2 2023-06-15 09:04:48,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.66 vs. limit=10.0 2023-06-15 09:04:55,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.55 vs. limit=15.0 2023-06-15 09:04:59,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.842e+02 2.115e+02 2.294e+02 3.126e+02, threshold=4.230e+02, percent-clipped=0.0 2023-06-15 09:05:13,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=92806.66666666667, ans=0.07 2023-06-15 09:05:23,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=92806.66666666667, ans=0.0 2023-06-15 09:05:31,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-06-15 09:05:35,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.64 vs. limit=10.0 2023-06-15 09:05:39,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-06-15 09:05:47,902 INFO [train.py:988] (2/4) Epoch 27, batch 100, loss[loss=0.2316, simple_loss=0.307, pruned_loss=0.07805, over 19339.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3079, pruned_loss=0.08002, over 1505805.02 frames. ], batch size: 98, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:06:02,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=92940.0, ans=0.125 2023-06-15 09:06:16,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93006.66666666667, ans=0.125 2023-06-15 09:06:39,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=93140.0, ans=0.0 2023-06-15 09:06:41,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2023-06-15 09:06:44,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93140.0, ans=0.1 2023-06-15 09:06:54,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=93140.0, ans=0.2 2023-06-15 09:07:14,881 INFO [train.py:988] (2/4) Epoch 27, batch 150, loss[loss=0.268, simple_loss=0.2957, pruned_loss=0.1201, over 17190.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3064, pruned_loss=0.07924, over 2022208.44 frames. ], batch size: 392, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:07:21,835 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-15 09:07:51,282 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:07:53,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=93406.66666666667, ans=0.125 2023-06-15 09:07:54,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.908e+02 2.202e+02 2.518e+02 3.722e+02, threshold=4.404e+02, percent-clipped=0.0 2023-06-15 09:08:03,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=93406.66666666667, ans=0.04949747468305833 2023-06-15 09:08:12,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-15 09:08:21,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=93473.33333333333, ans=0.07 2023-06-15 09:08:28,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93540.0, ans=0.125 2023-06-15 09:08:43,714 INFO [train.py:988] (2/4) Epoch 27, batch 200, loss[loss=0.2533, simple_loss=0.3348, pruned_loss=0.08593, over 16740.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3061, pruned_loss=0.07824, over 2417488.23 frames. ], batch size: 59, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:09:21,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=93740.0, ans=0.0 2023-06-15 09:10:08,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=93873.33333333333, ans=0.0 2023-06-15 09:10:11,428 INFO [train.py:988] (2/4) Epoch 27, batch 250, loss[loss=0.2076, simple_loss=0.2841, pruned_loss=0.0656, over 19099.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3067, pruned_loss=0.07807, over 2715216.27 frames. ], batch size: 94, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:10:36,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=94006.66666666667, ans=0.125 2023-06-15 09:10:39,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-15 09:10:46,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-15 09:10:50,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.782e+02 1.944e+02 2.302e+02 3.570e+02, threshold=3.888e+02, percent-clipped=0.0 2023-06-15 09:11:11,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=94140.0, ans=0.125 2023-06-15 09:11:24,231 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:11:25,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=94206.66666666667, ans=0.0 2023-06-15 09:11:28,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=94206.66666666667, ans=0.125 2023-06-15 09:11:39,399 INFO [train.py:988] (2/4) Epoch 27, batch 300, loss[loss=0.2289, simple_loss=0.3031, pruned_loss=0.07736, over 20223.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3061, pruned_loss=0.07829, over 2955634.27 frames. ], batch size: 141, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:12:03,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.41 vs. limit=15.0 2023-06-15 09:12:07,943 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:12:28,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94406.66666666667, ans=0.1 2023-06-15 09:12:39,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.77 vs. limit=12.0 2023-06-15 09:12:52,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=94540.0, ans=0.2 2023-06-15 09:12:53,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=94540.0, ans=0.0 2023-06-15 09:12:55,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=94540.0, ans=0.0 2023-06-15 09:13:06,376 INFO [train.py:988] (2/4) Epoch 27, batch 350, loss[loss=0.2341, simple_loss=0.2957, pruned_loss=0.08627, over 20280.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3068, pruned_loss=0.078, over 3122783.98 frames. ], batch size: 239, lr: 1.19e-02, grad_scale: 16.0 2023-06-15 09:13:15,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=94606.66666666667, ans=0.125 2023-06-15 09:13:16,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=94606.66666666667, ans=0.125 2023-06-15 09:13:46,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.867e+02 2.054e+02 2.333e+02 3.496e+02, threshold=4.108e+02, percent-clipped=0.0 2023-06-15 09:13:48,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=94740.0, ans=0.0 2023-06-15 09:13:49,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=94740.0, ans=0.0 2023-06-15 09:13:50,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=94740.0, ans=0.0 2023-06-15 09:14:00,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=94806.66666666667, ans=0.125 2023-06-15 09:14:07,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94806.66666666667, ans=0.125 2023-06-15 09:14:33,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=94940.0, ans=0.125 2023-06-15 09:14:35,028 INFO [train.py:988] (2/4) Epoch 27, batch 400, loss[loss=0.219, simple_loss=0.304, pruned_loss=0.06696, over 19207.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3062, pruned_loss=0.07796, over 3270640.49 frames. ], batch size: 92, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:14:43,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94940.0, ans=0.1 2023-06-15 09:14:46,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-06-15 09:14:57,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=95006.66666666667, ans=0.0 2023-06-15 09:15:29,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=95140.0, ans=0.2 2023-06-15 09:15:46,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95206.66666666667, ans=0.125 2023-06-15 09:16:02,639 INFO [train.py:988] (2/4) Epoch 27, batch 450, loss[loss=0.2523, simple_loss=0.3319, pruned_loss=0.08639, over 16263.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3056, pruned_loss=0.07763, over 3384562.47 frames. ], batch size: 52, lr: 1.18e-02, grad_scale: 16.0 2023-06-15 09:16:06,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=95273.33333333333, ans=0.125 2023-06-15 09:16:27,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.16 vs. limit=5.0 2023-06-15 09:16:37,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95406.66666666667, ans=0.1 2023-06-15 09:16:43,705 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.922e+02 2.176e+02 2.834e+02 5.039e+02, threshold=4.352e+02, percent-clipped=1.0 2023-06-15 09:16:44,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=95406.66666666667, ans=0.09899494936611666 2023-06-15 09:17:22,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=95540.0, ans=0.2 2023-06-15 09:17:27,055 INFO [train.py:988] (2/4) Epoch 27, batch 500, loss[loss=0.1924, simple_loss=0.2753, pruned_loss=0.05481, over 19332.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3056, pruned_loss=0.07766, over 3477447.74 frames. ], batch size: 98, lr: 1.18e-02, grad_scale: 16.0 2023-06-15 09:18:06,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=95740.0, ans=0.125 2023-06-15 09:18:11,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=95740.0, ans=0.0 2023-06-15 09:18:13,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=95740.0, ans=0.125 2023-06-15 09:18:47,753 INFO [train.py:988] (2/4) Epoch 28, batch 0, loss[loss=0.2078, simple_loss=0.2936, pruned_loss=0.06099, over 18605.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2936, pruned_loss=0.06099, over 18605.00 frames. ], batch size: 80, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:18:47,754 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 09:18:53,823 INFO [train.py:1020] (2/4) Epoch 28, validation: loss=0.203, simple_loss=0.307, pruned_loss=0.0495, over 143649.00 frames. 2023-06-15 09:18:53,824 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 09:19:56,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-06-15 09:19:58,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.41 vs. limit=15.0 2023-06-15 09:20:05,434 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.841e+02 2.127e+02 2.560e+02 4.411e+02, threshold=4.254e+02, percent-clipped=1.0 2023-06-15 09:20:21,888 INFO [train.py:988] (2/4) Epoch 28, batch 50, loss[loss=0.2464, simple_loss=0.3303, pruned_loss=0.08122, over 16437.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3025, pruned_loss=0.07828, over 851836.38 frames. ], batch size: 52, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:20:29,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=96160.0, ans=0.125 2023-06-15 09:20:34,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=96160.0, ans=0.125 2023-06-15 09:20:45,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=22.5 2023-06-15 09:21:01,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=96293.33333333333, ans=0.0 2023-06-15 09:21:07,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-06-15 09:21:31,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=96426.66666666667, ans=0.125 2023-06-15 09:21:47,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=96426.66666666667, ans=0.125 2023-06-15 09:21:51,029 INFO [train.py:988] (2/4) Epoch 28, batch 100, loss[loss=0.2267, simple_loss=0.3193, pruned_loss=0.06707, over 11809.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3031, pruned_loss=0.07648, over 1500155.81 frames. ], batch size: 33, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:21:51,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=96493.33333333333, ans=0.125 2023-06-15 09:21:51,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=96493.33333333333, ans=0.2 2023-06-15 09:22:35,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=15.0 2023-06-15 09:22:50,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=96693.33333333333, ans=0.125 2023-06-15 09:22:58,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.65 vs. limit=22.5 2023-06-15 09:22:59,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=96760.0, ans=0.5 2023-06-15 09:22:59,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96760.0, ans=0.1 2023-06-15 09:23:02,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-06-15 09:23:02,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.826e+02 2.059e+02 2.321e+02 3.339e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-15 09:23:18,525 INFO [train.py:988] (2/4) Epoch 28, batch 150, loss[loss=0.231, simple_loss=0.3092, pruned_loss=0.07641, over 18628.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3038, pruned_loss=0.07709, over 2020720.62 frames. ], batch size: 80, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:23:46,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=96893.33333333333, ans=0.125 2023-06-15 09:23:58,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=96960.0, ans=0.0 2023-06-15 09:24:21,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=97026.66666666667, ans=0.0 2023-06-15 09:24:46,201 INFO [train.py:988] (2/4) Epoch 28, batch 200, loss[loss=0.2432, simple_loss=0.3097, pruned_loss=0.08835, over 20000.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3035, pruned_loss=0.07742, over 2408937.02 frames. ], batch size: 126, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:24:46,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97160.0, ans=0.125 2023-06-15 09:24:49,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=97160.0, ans=0.0 2023-06-15 09:25:10,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97226.66666666667, ans=0.125 2023-06-15 09:25:22,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97293.33333333333, ans=0.0 2023-06-15 09:25:44,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=15.0 2023-06-15 09:25:53,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=97426.66666666667, ans=0.0 2023-06-15 09:25:56,378 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.821e+02 1.977e+02 2.337e+02 4.646e+02, threshold=3.954e+02, percent-clipped=1.0 2023-06-15 09:26:11,651 INFO [train.py:988] (2/4) Epoch 28, batch 250, loss[loss=0.2251, simple_loss=0.3004, pruned_loss=0.07494, over 19667.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3034, pruned_loss=0.07697, over 2726492.23 frames. ], batch size: 110, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:26:31,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=97560.0, ans=0.125 2023-06-15 09:26:52,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=97626.66666666667, ans=0.0 2023-06-15 09:27:12,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=97693.33333333333, ans=0.0 2023-06-15 09:27:23,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=97760.0, ans=0.0 2023-06-15 09:27:30,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=97760.0, ans=0.125 2023-06-15 09:27:41,187 INFO [train.py:988] (2/4) Epoch 28, batch 300, loss[loss=0.2209, simple_loss=0.3026, pruned_loss=0.06956, over 19833.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3029, pruned_loss=0.07713, over 2961903.75 frames. ], batch size: 115, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:27:46,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=97826.66666666667, ans=0.125 2023-06-15 09:27:58,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=8.0 2023-06-15 09:28:14,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97960.0, ans=0.1 2023-06-15 09:28:25,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=97960.0, ans=0.125 2023-06-15 09:28:31,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.02 vs. limit=15.0 2023-06-15 09:28:52,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.947e+02 2.273e+02 2.757e+02 4.812e+02, threshold=4.546e+02, percent-clipped=3.0 2023-06-15 09:29:07,345 INFO [train.py:988] (2/4) Epoch 28, batch 350, loss[loss=0.2251, simple_loss=0.301, pruned_loss=0.07458, over 19433.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3028, pruned_loss=0.07645, over 3133788.64 frames. ], batch size: 105, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:29:43,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=98293.33333333333, ans=0.125 2023-06-15 09:30:15,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=15.0 2023-06-15 09:30:23,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=98426.66666666667, ans=0.125 2023-06-15 09:30:29,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98426.66666666667, ans=0.1 2023-06-15 09:30:34,495 INFO [train.py:988] (2/4) Epoch 28, batch 400, loss[loss=0.2285, simple_loss=0.3109, pruned_loss=0.07306, over 16842.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3036, pruned_loss=0.07596, over 3268487.54 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:30:43,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=98493.33333333333, ans=0.0 2023-06-15 09:30:56,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=98560.0, ans=0.125 2023-06-15 09:31:05,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98560.0, ans=0.1 2023-06-15 09:31:08,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=98626.66666666667, ans=0.125 2023-06-15 09:31:22,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=98626.66666666667, ans=0.2 2023-06-15 09:31:24,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=98626.66666666667, ans=0.125 2023-06-15 09:31:28,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98693.33333333333, ans=0.125 2023-06-15 09:31:45,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=98760.0, ans=0.125 2023-06-15 09:31:48,682 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.911e+02 2.084e+02 2.355e+02 3.960e+02, threshold=4.169e+02, percent-clipped=0.0 2023-06-15 09:32:01,892 INFO [train.py:988] (2/4) Epoch 28, batch 450, loss[loss=0.2289, simple_loss=0.2983, pruned_loss=0.07969, over 20632.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3039, pruned_loss=0.07601, over 3382617.39 frames. ], batch size: 189, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:32:17,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98826.66666666667, ans=0.1 2023-06-15 09:32:20,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98893.33333333333, ans=0.125 2023-06-15 09:32:46,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=98960.0, ans=0.125 2023-06-15 09:32:46,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=98960.0, ans=0.2 2023-06-15 09:33:04,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99026.66666666667, ans=0.1 2023-06-15 09:33:06,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=99026.66666666667, ans=0.2 2023-06-15 09:33:22,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=99093.33333333333, ans=0.0 2023-06-15 09:33:28,919 INFO [train.py:988] (2/4) Epoch 28, batch 500, loss[loss=0.2543, simple_loss=0.295, pruned_loss=0.1068, over 16874.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3032, pruned_loss=0.07644, over 3484192.77 frames. ], batch size: 391, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:33:37,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=12.0 2023-06-15 09:33:39,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-15 09:33:42,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99160.0, ans=0.0 2023-06-15 09:33:58,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=99226.66666666667, ans=0.125 2023-06-15 09:34:02,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.73 vs. limit=22.5 2023-06-15 09:34:46,538 INFO [train.py:988] (2/4) Epoch 29, batch 0, loss[loss=0.2214, simple_loss=0.2875, pruned_loss=0.0776, over 20574.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2875, pruned_loss=0.0776, over 20574.00 frames. ], batch size: 189, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:34:46,539 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 09:34:52,710 INFO [train.py:1020] (2/4) Epoch 29, validation: loss=0.2012, simple_loss=0.3049, pruned_loss=0.04872, over 143649.00 frames. 2023-06-15 09:34:52,712 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 09:35:07,437 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.808e+02 2.025e+02 2.226e+02 3.535e+02, threshold=4.050e+02, percent-clipped=0.0 2023-06-15 09:36:20,599 INFO [train.py:988] (2/4) Epoch 29, batch 50, loss[loss=0.2078, simple_loss=0.2864, pruned_loss=0.06465, over 19798.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3015, pruned_loss=0.07542, over 862713.95 frames. ], batch size: 115, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:36:42,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=99780.0, ans=0.025 2023-06-15 09:37:04,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=99846.66666666667, ans=0.125 2023-06-15 09:37:26,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=99913.33333333333, ans=0.0 2023-06-15 09:37:35,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=99980.0, ans=0.125 2023-06-15 09:37:48,024 INFO [train.py:988] (2/4) Epoch 29, batch 100, loss[loss=0.2159, simple_loss=0.2803, pruned_loss=0.07577, over 20661.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3019, pruned_loss=0.07515, over 1513245.58 frames. ], batch size: 211, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:38:03,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.850e+02 2.039e+02 2.478e+02 3.886e+02, threshold=4.079e+02, percent-clipped=0.0 2023-06-15 09:38:21,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=100180.0, ans=0.0 2023-06-15 09:38:37,247 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=2.562e-03 2023-06-15 09:38:39,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=100246.66666666667, ans=0.0 2023-06-15 09:38:45,720 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:38:47,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=100246.66666666667, ans=0.125 2023-06-15 09:38:59,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=100313.33333333333, ans=0.1 2023-06-15 09:39:15,394 INFO [train.py:988] (2/4) Epoch 29, batch 150, loss[loss=0.2375, simple_loss=0.3129, pruned_loss=0.08103, over 19128.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3021, pruned_loss=0.07557, over 2002468.83 frames. ], batch size: 94, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:39:21,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.54 vs. limit=15.0 2023-06-15 09:40:27,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=100646.66666666667, ans=0.125 2023-06-15 09:40:38,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.38 vs. limit=15.0 2023-06-15 09:40:40,965 INFO [train.py:988] (2/4) Epoch 29, batch 200, loss[loss=0.2197, simple_loss=0.302, pruned_loss=0.06874, over 19215.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3014, pruned_loss=0.07608, over 2393406.19 frames. ], batch size: 92, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:40:57,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.792e+02 2.001e+02 2.365e+02 3.519e+02, threshold=4.002e+02, percent-clipped=0.0 2023-06-15 09:40:57,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=100780.0, ans=0.125 2023-06-15 09:40:57,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=100780.0, ans=0.125 2023-06-15 09:41:28,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=100846.66666666667, ans=0.0 2023-06-15 09:42:08,477 INFO [train.py:988] (2/4) Epoch 29, batch 250, loss[loss=0.2446, simple_loss=0.3307, pruned_loss=0.07929, over 16335.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3019, pruned_loss=0.07582, over 2685305.17 frames. ], batch size: 52, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:42:51,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-15 09:42:58,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=101246.66666666667, ans=0.0 2023-06-15 09:43:15,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=101246.66666666667, ans=0.0 2023-06-15 09:43:35,037 INFO [train.py:988] (2/4) Epoch 29, batch 300, loss[loss=0.2596, simple_loss=0.3363, pruned_loss=0.09146, over 16982.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3022, pruned_loss=0.07566, over 2917338.38 frames. ], batch size: 60, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:43:50,908 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.806e+02 2.034e+02 2.286e+02 3.270e+02, threshold=4.068e+02, percent-clipped=0.0 2023-06-15 09:44:03,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=101446.66666666667, ans=0.09899494936611666 2023-06-15 09:44:03,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=101446.66666666667, ans=0.025 2023-06-15 09:44:08,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101513.33333333333, ans=0.1 2023-06-15 09:44:16,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.15 vs. limit=22.5 2023-06-15 09:44:33,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.99 vs. limit=15.0 2023-06-15 09:45:02,611 INFO [train.py:988] (2/4) Epoch 29, batch 350, loss[loss=0.2139, simple_loss=0.2965, pruned_loss=0.06565, over 19706.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3024, pruned_loss=0.07574, over 3103024.35 frames. ], batch size: 110, lr: 1.11e-02, grad_scale: 16.0 2023-06-15 09:45:22,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=101780.0, ans=0.125 2023-06-15 09:45:22,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=101780.0, ans=0.125 2023-06-15 09:45:27,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.72 vs. limit=15.0 2023-06-15 09:45:30,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=22.5 2023-06-15 09:45:35,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=101846.66666666667, ans=0.04949747468305833 2023-06-15 09:45:39,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=101846.66666666667, ans=0.125 2023-06-15 09:45:57,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2023-06-15 09:46:04,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=101913.33333333333, ans=0.05 2023-06-15 09:46:22,362 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:46:29,149 INFO [train.py:988] (2/4) Epoch 29, batch 400, loss[loss=0.2294, simple_loss=0.3011, pruned_loss=0.07883, over 19561.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3023, pruned_loss=0.07565, over 3261511.96 frames. ], batch size: 102, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:46:32,916 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:46:34,498 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:46:46,570 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.969e+02 2.274e+02 2.602e+02 3.577e+02, threshold=4.548e+02, percent-clipped=0.0 2023-06-15 09:46:48,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=102113.33333333333, ans=0.0 2023-06-15 09:47:29,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=102246.66666666667, ans=0.125 2023-06-15 09:47:55,567 INFO [train.py:988] (2/4) Epoch 29, batch 450, loss[loss=0.1909, simple_loss=0.2775, pruned_loss=0.05214, over 19843.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3025, pruned_loss=0.07533, over 3379513.78 frames. ], batch size: 120, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:47:56,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-15 09:48:05,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=102380.0, ans=0.125 2023-06-15 09:48:24,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102446.66666666667, ans=0.1 2023-06-15 09:48:51,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=102580.0, ans=0.0 2023-06-15 09:49:20,987 INFO [train.py:988] (2/4) Epoch 29, batch 500, loss[loss=0.2362, simple_loss=0.3046, pruned_loss=0.08384, over 20239.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3015, pruned_loss=0.07555, over 3487188.35 frames. ], batch size: 141, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:49:37,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.900e+02 2.167e+02 2.548e+02 3.918e+02, threshold=4.335e+02, percent-clipped=0.0 2023-06-15 09:49:48,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.60 vs. limit=22.5 2023-06-15 09:49:49,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102780.0, ans=0.125 2023-06-15 09:49:59,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.21 vs. limit=15.0 2023-06-15 09:50:04,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=102846.66666666667, ans=0.0 2023-06-15 09:50:35,323 INFO [train.py:988] (2/4) Epoch 30, batch 0, loss[loss=0.2155, simple_loss=0.2989, pruned_loss=0.06605, over 19865.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2989, pruned_loss=0.06605, over 19865.00 frames. ], batch size: 120, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:50:35,323 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 09:50:40,738 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.8422, 5.1625, 5.4739, 5.0172], device='cuda:2') 2023-06-15 09:50:41,566 INFO [train.py:1020] (2/4) Epoch 30, validation: loss=0.2006, simple_loss=0.3036, pruned_loss=0.04881, over 143649.00 frames. 2023-06-15 09:50:41,567 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 09:51:12,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102993.33333333333, ans=0.125 2023-06-15 09:51:13,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=103060.0, ans=0.0 2023-06-15 09:51:34,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.92 vs. limit=6.0 2023-06-15 09:52:00,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-06-15 09:52:08,089 INFO [train.py:988] (2/4) Epoch 30, batch 50, loss[loss=0.2346, simple_loss=0.3175, pruned_loss=0.07582, over 18791.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3022, pruned_loss=0.07467, over 856730.01 frames. ], batch size: 83, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:52:16,866 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:52:25,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=103326.66666666667, ans=0.0 2023-06-15 09:52:46,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=103393.33333333333, ans=0.2 2023-06-15 09:52:49,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=103393.33333333333, ans=0.2 2023-06-15 09:52:55,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.895e+02 2.155e+02 2.482e+02 4.117e+02, threshold=4.310e+02, percent-clipped=0.0 2023-06-15 09:53:14,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=103460.0, ans=0.2 2023-06-15 09:53:16,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-06-15 09:53:34,739 INFO [train.py:988] (2/4) Epoch 30, batch 100, loss[loss=0.2094, simple_loss=0.2917, pruned_loss=0.0635, over 19676.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3037, pruned_loss=0.07349, over 1478071.15 frames. ], batch size: 110, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:53:43,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=103593.33333333333, ans=10.0 2023-06-15 09:53:59,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=103660.0, ans=0.125 2023-06-15 09:54:43,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=103860.0, ans=0.1 2023-06-15 09:54:53,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-15 09:55:01,983 INFO [train.py:988] (2/4) Epoch 30, batch 150, loss[loss=0.2267, simple_loss=0.2999, pruned_loss=0.07678, over 20091.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3024, pruned_loss=0.07358, over 1995562.14 frames. ], batch size: 133, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:55:15,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=103926.66666666667, ans=0.2 2023-06-15 09:55:23,745 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:55:26,857 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:55:32,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-15 09:55:51,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.792e+02 2.087e+02 2.377e+02 3.180e+02, threshold=4.174e+02, percent-clipped=0.0 2023-06-15 09:56:08,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=104126.66666666667, ans=0.125 2023-06-15 09:56:13,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.68 vs. limit=22.5 2023-06-15 09:56:17,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=104193.33333333333, ans=0.125 2023-06-15 09:56:19,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=104193.33333333333, ans=0.2 2023-06-15 09:56:22,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=104193.33333333333, ans=0.125 2023-06-15 09:56:22,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=104193.33333333333, ans=0.125 2023-06-15 09:56:26,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=104193.33333333333, ans=0.125 2023-06-15 09:56:29,874 INFO [train.py:988] (2/4) Epoch 30, batch 200, loss[loss=0.224, simple_loss=0.2861, pruned_loss=0.08094, over 20178.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3016, pruned_loss=0.07413, over 2386020.29 frames. ], batch size: 239, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:56:36,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=104260.0, ans=0.0 2023-06-15 09:56:52,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=104326.66666666667, ans=0.125 2023-06-15 09:57:27,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.27 vs. limit=15.0 2023-06-15 09:57:48,349 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:57:54,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=104593.33333333333, ans=0.0 2023-06-15 09:57:56,079 INFO [train.py:988] (2/4) Epoch 30, batch 250, loss[loss=0.2344, simple_loss=0.3052, pruned_loss=0.08176, over 20120.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3006, pruned_loss=0.07418, over 2698507.06 frames. ], batch size: 133, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:57:56,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=104593.33333333333, ans=0.125 2023-06-15 09:58:40,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=104726.66666666667, ans=0.0 2023-06-15 09:58:45,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.806e+02 1.918e+02 2.139e+02 3.492e+02, threshold=3.836e+02, percent-clipped=0.0 2023-06-15 09:59:01,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=104793.33333333333, ans=0.0 2023-06-15 09:59:23,442 INFO [train.py:988] (2/4) Epoch 30, batch 300, loss[loss=0.2283, simple_loss=0.3041, pruned_loss=0.07622, over 20290.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3004, pruned_loss=0.074, over 2938922.06 frames. ], batch size: 149, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:59:34,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.22 vs. limit=15.0 2023-06-15 09:59:42,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=104993.33333333333, ans=0.025 2023-06-15 09:59:44,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=104993.33333333333, ans=0.125 2023-06-15 09:59:48,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=104993.33333333333, ans=0.125 2023-06-15 09:59:53,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=104993.33333333333, ans=0.1 2023-06-15 10:00:27,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-15 10:00:40,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=105193.33333333333, ans=0.125 2023-06-15 10:00:45,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105193.33333333333, ans=0.1 2023-06-15 10:00:49,822 INFO [train.py:988] (2/4) Epoch 30, batch 350, loss[loss=0.2174, simple_loss=0.3025, pruned_loss=0.06611, over 18648.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3003, pruned_loss=0.07355, over 3125849.31 frames. ], batch size: 80, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:00:54,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.37 vs. limit=22.5 2023-06-15 10:01:18,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=105326.66666666667, ans=0.2 2023-06-15 10:01:38,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.770e+02 1.941e+02 2.233e+02 2.810e+02, threshold=3.882e+02, percent-clipped=0.0 2023-06-15 10:01:50,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=105460.0, ans=0.0 2023-06-15 10:01:52,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105460.0, ans=0.1 2023-06-15 10:02:04,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=105526.66666666667, ans=0.0 2023-06-15 10:02:15,109 INFO [train.py:988] (2/4) Epoch 30, batch 400, loss[loss=0.2231, simple_loss=0.304, pruned_loss=0.07113, over 18819.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3008, pruned_loss=0.07336, over 3269925.70 frames. ], batch size: 83, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:02:35,728 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:02:56,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=105726.66666666667, ans=0.125 2023-06-15 10:03:12,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=105793.33333333333, ans=0.125 2023-06-15 10:03:20,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=105793.33333333333, ans=0.125 2023-06-15 10:03:23,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=105860.0, ans=0.2 2023-06-15 10:03:32,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105860.0, ans=0.1 2023-06-15 10:03:40,276 INFO [train.py:988] (2/4) Epoch 30, batch 450, loss[loss=0.207, simple_loss=0.2945, pruned_loss=0.05975, over 18950.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3001, pruned_loss=0.07301, over 3387512.87 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:04:11,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=105993.33333333333, ans=0.125 2023-06-15 10:04:28,190 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.828e+02 2.027e+02 2.323e+02 3.183e+02, threshold=4.054e+02, percent-clipped=0.0 2023-06-15 10:04:45,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-06-15 10:04:52,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=106193.33333333333, ans=0.0 2023-06-15 10:05:02,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=15.0 2023-06-15 10:05:04,414 INFO [train.py:988] (2/4) Epoch 30, batch 500, loss[loss=0.2286, simple_loss=0.3026, pruned_loss=0.07727, over 19961.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3007, pruned_loss=0.07321, over 3462875.11 frames. ], batch size: 126, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:05:14,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106260.0, ans=0.1 2023-06-15 10:05:18,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=106260.0, ans=0.0 2023-06-15 10:05:38,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=106393.33333333333, ans=0.0 2023-06-15 10:05:42,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=106393.33333333333, ans=0.0 2023-06-15 10:05:42,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-15 10:05:48,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=106393.33333333333, ans=0.09899494936611666 2023-06-15 10:05:51,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=106460.0, ans=0.0 2023-06-15 10:06:22,311 INFO [train.py:988] (2/4) Epoch 31, batch 0, loss[loss=0.23, simple_loss=0.299, pruned_loss=0.08044, over 20105.00 frames. ], tot_loss[loss=0.23, simple_loss=0.299, pruned_loss=0.08044, over 20105.00 frames. ], batch size: 133, lr: 1.06e-02, grad_scale: 32.0 2023-06-15 10:06:22,311 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 10:06:28,531 INFO [train.py:1020] (2/4) Epoch 31, validation: loss=0.2014, simple_loss=0.3032, pruned_loss=0.0498, over 143649.00 frames. 2023-06-15 10:06:28,532 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 10:06:34,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=106480.0, ans=0.125 2023-06-15 10:06:40,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=106480.0, ans=0.2 2023-06-15 10:06:42,234 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:07:18,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.97 vs. limit=8.0 2023-06-15 10:07:22,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=106680.0, ans=0.2 2023-06-15 10:07:25,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=106680.0, ans=0.05 2023-06-15 10:07:38,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=106746.66666666667, ans=0.04949747468305833 2023-06-15 10:07:46,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106746.66666666667, ans=0.1 2023-06-15 10:07:48,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.719e+02 1.932e+02 2.142e+02 3.238e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-15 10:07:55,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=106813.33333333333, ans=0.95 2023-06-15 10:07:56,259 INFO [train.py:988] (2/4) Epoch 31, batch 50, loss[loss=0.2269, simple_loss=0.3076, pruned_loss=0.0731, over 18908.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3012, pruned_loss=0.07232, over 853161.52 frames. ], batch size: 86, lr: 1.06e-02, grad_scale: 32.0 2023-06-15 10:07:56,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=106813.33333333333, ans=0.0 2023-06-15 10:07:58,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=106813.33333333333, ans=0.0 2023-06-15 10:08:15,186 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:08:41,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106946.66666666667, ans=0.0 2023-06-15 10:08:44,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.59 vs. limit=15.0 2023-06-15 10:08:51,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=107013.33333333333, ans=0.0 2023-06-15 10:09:06,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=107080.0, ans=0.125 2023-06-15 10:09:22,796 INFO [train.py:988] (2/4) Epoch 31, batch 100, loss[loss=0.2132, simple_loss=0.2963, pruned_loss=0.06503, over 19823.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2993, pruned_loss=0.07199, over 1510776.99 frames. ], batch size: 115, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:09:32,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=22.5 2023-06-15 10:09:44,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=107213.33333333333, ans=0.0 2023-06-15 10:09:56,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=107280.0, ans=0.125 2023-06-15 10:10:25,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=107346.66666666667, ans=0.0 2023-06-15 10:10:40,186 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.809e+02 2.076e+02 2.464e+02 3.768e+02, threshold=4.152e+02, percent-clipped=0.0 2023-06-15 10:10:49,496 INFO [train.py:988] (2/4) Epoch 31, batch 150, loss[loss=0.2359, simple_loss=0.3024, pruned_loss=0.08468, over 20662.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2997, pruned_loss=0.07272, over 2022112.46 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:11:06,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=107546.66666666667, ans=0.125 2023-06-15 10:12:05,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=107746.66666666667, ans=0.0 2023-06-15 10:12:12,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=107746.66666666667, ans=0.0 2023-06-15 10:12:15,585 INFO [train.py:988] (2/4) Epoch 31, batch 200, loss[loss=0.2209, simple_loss=0.2898, pruned_loss=0.07598, over 20547.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2987, pruned_loss=0.07246, over 2416756.17 frames. ], batch size: 173, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:12:30,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107813.33333333333, ans=0.1 2023-06-15 10:13:09,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.22 vs. limit=22.5 2023-06-15 10:13:33,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.862e+02 2.218e+02 2.544e+02 4.456e+02, threshold=4.436e+02, percent-clipped=1.0 2023-06-15 10:13:41,908 INFO [train.py:988] (2/4) Epoch 31, batch 250, loss[loss=0.2133, simple_loss=0.2927, pruned_loss=0.06698, over 19990.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2992, pruned_loss=0.07293, over 2721516.55 frames. ], batch size: 126, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:13:43,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=108146.66666666667, ans=0.0 2023-06-15 10:13:48,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-15 10:14:00,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108213.33333333333, ans=0.125 2023-06-15 10:14:23,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108280.0, ans=0.125 2023-06-15 10:14:39,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.78 vs. limit=12.0 2023-06-15 10:14:47,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=108346.66666666667, ans=0.2 2023-06-15 10:14:47,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-06-15 10:15:02,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=108413.33333333333, ans=0.2 2023-06-15 10:15:06,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108413.33333333333, ans=0.125 2023-06-15 10:15:09,066 INFO [train.py:988] (2/4) Epoch 31, batch 300, loss[loss=0.2225, simple_loss=0.2957, pruned_loss=0.0746, over 19641.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2991, pruned_loss=0.07266, over 2963458.62 frames. ], batch size: 110, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:15:35,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=108546.66666666667, ans=0.1 2023-06-15 10:15:55,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=108613.33333333333, ans=0.125 2023-06-15 10:16:07,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=15.0 2023-06-15 10:16:10,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=108680.0, ans=0.2 2023-06-15 10:16:26,650 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.814e+02 2.012e+02 2.294e+02 3.766e+02, threshold=4.023e+02, percent-clipped=0.0 2023-06-15 10:16:34,824 INFO [train.py:988] (2/4) Epoch 31, batch 350, loss[loss=0.2169, simple_loss=0.2865, pruned_loss=0.07366, over 20564.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2989, pruned_loss=0.0722, over 3153316.35 frames. ], batch size: 189, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:16:37,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=108813.33333333333, ans=22.5 2023-06-15 10:16:42,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=108813.33333333333, ans=0.0 2023-06-15 10:16:49,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108813.33333333333, ans=0.1 2023-06-15 10:17:06,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108880.0, ans=0.1 2023-06-15 10:18:00,833 INFO [train.py:988] (2/4) Epoch 31, batch 400, loss[loss=0.2277, simple_loss=0.3067, pruned_loss=0.07434, over 19823.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2993, pruned_loss=0.07231, over 3299836.01 frames. ], batch size: 115, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:18:19,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-15 10:18:45,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=109280.0, ans=0.125 2023-06-15 10:18:55,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=109346.66666666667, ans=0.0 2023-06-15 10:19:06,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=109346.66666666667, ans=0.1 2023-06-15 10:19:20,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.873e+02 2.079e+02 2.420e+02 3.208e+02, threshold=4.158e+02, percent-clipped=0.0 2023-06-15 10:19:27,273 INFO [train.py:988] (2/4) Epoch 31, batch 450, loss[loss=0.2123, simple_loss=0.2932, pruned_loss=0.06566, over 19661.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2989, pruned_loss=0.07186, over 3409090.94 frames. ], batch size: 110, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:19:31,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109480.0, ans=0.1 2023-06-15 10:19:31,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.82 vs. limit=15.0 2023-06-15 10:19:34,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=109480.0, ans=0.125 2023-06-15 10:20:05,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-06-15 10:20:06,252 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:20:20,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109680.0, ans=0.125 2023-06-15 10:20:35,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=109746.66666666667, ans=0.5 2023-06-15 10:20:51,577 INFO [train.py:988] (2/4) Epoch 31, batch 500, loss[loss=0.249, simple_loss=0.3244, pruned_loss=0.08681, over 16400.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2986, pruned_loss=0.07138, over 3482290.61 frames. ], batch size: 52, lr: 1.04e-02, grad_scale: 32.0 2023-06-15 10:21:06,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=109880.0, ans=0.125 2023-06-15 10:21:08,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=109880.0, ans=0.125 2023-06-15 10:21:13,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=109880.0, ans=0.0 2023-06-15 10:21:23,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109946.66666666667, ans=0.1 2023-06-15 10:22:07,555 INFO [train.py:988] (2/4) Epoch 32, batch 0, loss[loss=0.2229, simple_loss=0.2934, pruned_loss=0.07621, over 20599.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2934, pruned_loss=0.07621, over 20599.00 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-15 10:22:07,555 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 10:22:13,531 INFO [train.py:1020] (2/4) Epoch 32, validation: loss=0.1996, simple_loss=0.3022, pruned_loss=0.04853, over 143649.00 frames. 2023-06-15 10:22:13,532 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 10:22:17,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=110026.66666666667, ans=0.2 2023-06-15 10:22:36,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=110093.33333333333, ans=0.0 2023-06-15 10:22:37,962 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.793e+02 2.026e+02 2.443e+02 4.216e+02, threshold=4.052e+02, percent-clipped=1.0 2023-06-15 10:22:47,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2023-06-15 10:22:57,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=110160.0, ans=0.2 2023-06-15 10:22:58,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110160.0, ans=0.125 2023-06-15 10:23:00,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=110160.0, ans=0.125 2023-06-15 10:23:00,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=110160.0, ans=0.0 2023-06-15 10:23:02,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=110160.0, ans=0.125 2023-06-15 10:23:07,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=110226.66666666667, ans=0.125 2023-06-15 10:23:32,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110293.33333333333, ans=0.1 2023-06-15 10:23:35,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=110293.33333333333, ans=0.0 2023-06-15 10:23:35,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110293.33333333333, ans=0.125 2023-06-15 10:23:40,690 INFO [train.py:988] (2/4) Epoch 32, batch 50, loss[loss=0.2144, simple_loss=0.2925, pruned_loss=0.0681, over 19464.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2964, pruned_loss=0.07242, over 867216.10 frames. ], batch size: 105, lr: 1.03e-02, grad_scale: 32.0 2023-06-15 10:23:45,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.93 vs. limit=15.0 2023-06-15 10:23:51,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-06-15 10:24:29,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=110493.33333333333, ans=0.05 2023-06-15 10:25:08,053 INFO [train.py:988] (2/4) Epoch 32, batch 100, loss[loss=0.2136, simple_loss=0.2964, pruned_loss=0.06537, over 19431.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2983, pruned_loss=0.07116, over 1507098.69 frames. ], batch size: 105, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:25:31,498 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.727e+02 1.862e+02 2.037e+02 3.271e+02, threshold=3.724e+02, percent-clipped=0.0 2023-06-15 10:25:40,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=110826.66666666667, ans=0.0 2023-06-15 10:25:46,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=110826.66666666667, ans=0.0 2023-06-15 10:25:53,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=110826.66666666667, ans=0.0 2023-06-15 10:26:26,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=110960.0, ans=0.125 2023-06-15 10:26:30,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.97 vs. limit=6.0 2023-06-15 10:26:34,700 INFO [train.py:988] (2/4) Epoch 32, batch 150, loss[loss=0.2392, simple_loss=0.3228, pruned_loss=0.07778, over 16870.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2999, pruned_loss=0.07109, over 2001515.26 frames. ], batch size: 59, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:26:37,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=111026.66666666667, ans=0.125 2023-06-15 10:27:06,668 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:27:24,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=111160.0, ans=0.125 2023-06-15 10:27:56,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111293.33333333333, ans=0.1 2023-06-15 10:28:00,895 INFO [train.py:988] (2/4) Epoch 32, batch 200, loss[loss=0.2202, simple_loss=0.2998, pruned_loss=0.0703, over 19105.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2999, pruned_loss=0.07102, over 2380888.04 frames. ], batch size: 89, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:28:24,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.849e+02 2.097e+02 2.469e+02 3.862e+02, threshold=4.194e+02, percent-clipped=1.0 2023-06-15 10:28:30,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111426.66666666667, ans=0.1 2023-06-15 10:28:33,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=111493.33333333333, ans=0.07 2023-06-15 10:28:56,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=111560.0, ans=0.125 2023-06-15 10:29:08,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=111626.66666666667, ans=0.125 2023-06-15 10:29:08,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=111626.66666666667, ans=0.05 2023-06-15 10:29:08,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=12.0 2023-06-15 10:29:26,811 INFO [train.py:988] (2/4) Epoch 32, batch 250, loss[loss=0.2252, simple_loss=0.3112, pruned_loss=0.06959, over 18291.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2998, pruned_loss=0.07118, over 2693758.81 frames. ], batch size: 74, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:29:32,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=111693.33333333333, ans=0.125 2023-06-15 10:29:33,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.79 vs. limit=10.0 2023-06-15 10:30:00,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.47 vs. limit=15.0 2023-06-15 10:30:31,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=111893.33333333333, ans=0.0 2023-06-15 10:30:50,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=111960.0, ans=0.05 2023-06-15 10:30:52,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=112026.66666666667, ans=0.125 2023-06-15 10:30:53,806 INFO [train.py:988] (2/4) Epoch 32, batch 300, loss[loss=0.206, simple_loss=0.2919, pruned_loss=0.06004, over 19078.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2991, pruned_loss=0.0714, over 2935479.42 frames. ], batch size: 89, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:31:00,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=112026.66666666667, ans=15.0 2023-06-15 10:31:18,120 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.817e+02 2.017e+02 2.252e+02 3.365e+02, threshold=4.033e+02, percent-clipped=0.0 2023-06-15 10:31:53,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=112226.66666666667, ans=0.1 2023-06-15 10:31:58,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.06 vs. limit=22.5 2023-06-15 10:32:00,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=112293.33333333333, ans=0.2 2023-06-15 10:32:20,598 INFO [train.py:988] (2/4) Epoch 32, batch 350, loss[loss=0.2381, simple_loss=0.3046, pruned_loss=0.0858, over 20703.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2969, pruned_loss=0.07109, over 3140161.63 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:33:14,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-15 10:33:20,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-15 10:33:45,054 INFO [train.py:988] (2/4) Epoch 32, batch 400, loss[loss=0.2154, simple_loss=0.2988, pruned_loss=0.06597, over 19667.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2971, pruned_loss=0.07161, over 3280145.88 frames. ], batch size: 110, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:33:45,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112693.33333333333, ans=0.1 2023-06-15 10:34:01,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=112760.0, ans=0.0 2023-06-15 10:34:09,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.921e+02 2.168e+02 2.475e+02 4.297e+02, threshold=4.337e+02, percent-clipped=1.0 2023-06-15 10:35:04,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=112960.0, ans=0.2 2023-06-15 10:35:11,180 INFO [train.py:988] (2/4) Epoch 32, batch 450, loss[loss=0.2237, simple_loss=0.315, pruned_loss=0.06616, over 18267.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.297, pruned_loss=0.07172, over 3394904.76 frames. ], batch size: 74, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:35:32,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113093.33333333333, ans=0.125 2023-06-15 10:35:34,628 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:35:38,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.79 vs. limit=12.0 2023-06-15 10:36:26,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=113293.33333333333, ans=0.125 2023-06-15 10:36:36,123 INFO [train.py:988] (2/4) Epoch 32, batch 500, loss[loss=0.2093, simple_loss=0.2963, pruned_loss=0.06113, over 18784.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2969, pruned_loss=0.07197, over 3477080.15 frames. ], batch size: 83, lr: 1.01e-02, grad_scale: 32.0 2023-06-15 10:36:46,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=113360.0, ans=0.2 2023-06-15 10:36:49,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113360.0, ans=0.1 2023-06-15 10:36:59,365 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.786e+02 2.074e+02 2.314e+02 3.487e+02, threshold=4.148e+02, percent-clipped=0.0 2023-06-15 10:37:01,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113426.66666666667, ans=0.1 2023-06-15 10:37:45,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=113573.33333333333, ans=0.125 2023-06-15 10:37:52,774 INFO [train.py:988] (2/4) Epoch 33, batch 0, loss[loss=0.242, simple_loss=0.3082, pruned_loss=0.08791, over 20483.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3082, pruned_loss=0.08791, over 20483.00 frames. ], batch size: 160, lr: 9.98e-03, grad_scale: 32.0 2023-06-15 10:37:52,774 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 10:37:58,963 INFO [train.py:1020] (2/4) Epoch 33, validation: loss=0.2021, simple_loss=0.3035, pruned_loss=0.05038, over 143649.00 frames. 2023-06-15 10:37:58,963 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 10:38:33,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=113706.66666666667, ans=0.07 2023-06-15 10:39:02,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113773.33333333333, ans=0.1 2023-06-15 10:39:13,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=113840.0, ans=0.125 2023-06-15 10:39:26,783 INFO [train.py:988] (2/4) Epoch 33, batch 50, loss[loss=0.2462, simple_loss=0.2775, pruned_loss=0.1075, over 16934.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2988, pruned_loss=0.07351, over 865076.51 frames. ], batch size: 391, lr: 9.96e-03, grad_scale: 32.0 2023-06-15 10:39:27,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=113906.66666666667, ans=0.09899494936611666 2023-06-15 10:39:30,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=113906.66666666667, ans=0.2 2023-06-15 10:39:32,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=113906.66666666667, ans=0.0 2023-06-15 10:40:02,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=114040.0, ans=0.0 2023-06-15 10:40:03,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=114040.0, ans=0.125 2023-06-15 10:40:15,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=114040.0, ans=0.125 2023-06-15 10:40:22,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.812e+02 2.020e+02 2.321e+02 4.264e+02, threshold=4.041e+02, percent-clipped=1.0 2023-06-15 10:40:53,378 INFO [train.py:988] (2/4) Epoch 33, batch 100, loss[loss=0.2306, simple_loss=0.3042, pruned_loss=0.07853, over 19098.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2995, pruned_loss=0.07266, over 1483755.03 frames. ], batch size: 94, lr: 9.95e-03, grad_scale: 32.0 2023-06-15 10:41:13,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=114306.66666666667, ans=0.125 2023-06-15 10:41:14,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=114306.66666666667, ans=0.125 2023-06-15 10:41:19,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=114306.66666666667, ans=0.0 2023-06-15 10:41:58,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=114440.0, ans=0.0 2023-06-15 10:42:06,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.25 vs. limit=15.0 2023-06-15 10:42:10,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=114506.66666666667, ans=0.125 2023-06-15 10:42:19,604 INFO [train.py:988] (2/4) Epoch 33, batch 150, loss[loss=0.2084, simple_loss=0.2939, pruned_loss=0.06144, over 18260.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2985, pruned_loss=0.07169, over 1977768.83 frames. ], batch size: 74, lr: 9.94e-03, grad_scale: 32.0 2023-06-15 10:42:33,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=114573.33333333333, ans=0.0 2023-06-15 10:42:44,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=114640.0, ans=0.07 2023-06-15 10:42:47,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=114640.0, ans=0.125 2023-06-15 10:42:49,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=114640.0, ans=0.125 2023-06-15 10:43:00,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=15.0 2023-06-15 10:43:10,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=114773.33333333333, ans=0.2 2023-06-15 10:43:14,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.854e+02 2.036e+02 2.409e+02 3.930e+02, threshold=4.072e+02, percent-clipped=0.0 2023-06-15 10:43:32,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=114840.0, ans=0.2 2023-06-15 10:43:33,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=114840.0, ans=0.0 2023-06-15 10:43:41,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-15 10:43:45,582 INFO [train.py:988] (2/4) Epoch 33, batch 200, loss[loss=0.2118, simple_loss=0.2937, pruned_loss=0.06501, over 18620.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.299, pruned_loss=0.07137, over 2373202.23 frames. ], batch size: 80, lr: 9.93e-03, grad_scale: 32.0 2023-06-15 10:44:09,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.25 vs. limit=22.5 2023-06-15 10:44:24,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-06-15 10:44:31,283 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:44:50,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=115106.66666666667, ans=0.2 2023-06-15 10:45:08,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=115173.33333333333, ans=0.125 2023-06-15 10:45:11,582 INFO [train.py:988] (2/4) Epoch 33, batch 250, loss[loss=0.2067, simple_loss=0.2865, pruned_loss=0.06344, over 19324.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2977, pruned_loss=0.07057, over 2689651.08 frames. ], batch size: 98, lr: 9.92e-03, grad_scale: 32.0 2023-06-15 10:45:39,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=115306.66666666667, ans=0.125 2023-06-15 10:45:39,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=115306.66666666667, ans=0.125 2023-06-15 10:45:46,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=115373.33333333333, ans=10.0 2023-06-15 10:45:48,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-06-15 10:46:06,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.758e+02 1.982e+02 2.404e+02 3.997e+02, threshold=3.964e+02, percent-clipped=0.0 2023-06-15 10:46:07,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=115440.0, ans=0.2 2023-06-15 10:46:37,243 INFO [train.py:988] (2/4) Epoch 33, batch 300, loss[loss=0.2205, simple_loss=0.2905, pruned_loss=0.07525, over 20212.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2968, pruned_loss=0.07078, over 2943852.73 frames. ], batch size: 239, lr: 9.90e-03, grad_scale: 32.0 2023-06-15 10:46:54,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=115640.0, ans=0.125 2023-06-15 10:46:56,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=115640.0, ans=0.125 2023-06-15 10:46:57,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=115640.0, ans=0.0 2023-06-15 10:47:16,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=115706.66666666667, ans=0.125 2023-06-15 10:48:02,182 INFO [train.py:988] (2/4) Epoch 33, batch 350, loss[loss=0.2214, simple_loss=0.2669, pruned_loss=0.08798, over 16717.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2964, pruned_loss=0.0706, over 3120673.75 frames. ], batch size: 391, lr: 9.89e-03, grad_scale: 32.0 2023-06-15 10:48:25,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115973.33333333333, ans=0.125 2023-06-15 10:48:27,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115973.33333333333, ans=0.1 2023-06-15 10:48:44,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=22.5 2023-06-15 10:48:57,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.859e+02 2.087e+02 2.471e+02 4.224e+02, threshold=4.174e+02, percent-clipped=1.0 2023-06-15 10:49:20,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116173.33333333333, ans=0.1 2023-06-15 10:49:23,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=116173.33333333333, ans=0.125 2023-06-15 10:49:28,268 INFO [train.py:988] (2/4) Epoch 33, batch 400, loss[loss=0.2131, simple_loss=0.2958, pruned_loss=0.06517, over 19214.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2962, pruned_loss=0.06991, over 3268537.88 frames. ], batch size: 92, lr: 9.88e-03, grad_scale: 32.0 2023-06-15 10:49:42,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=116240.0, ans=0.125 2023-06-15 10:49:43,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=116240.0, ans=0.09899494936611666 2023-06-15 10:49:48,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2023-06-15 10:50:07,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2023-06-15 10:50:09,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=116373.33333333333, ans=0.125 2023-06-15 10:50:17,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=116373.33333333333, ans=0.0 2023-06-15 10:50:37,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=116506.66666666667, ans=0.125 2023-06-15 10:50:51,316 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:50:54,723 INFO [train.py:988] (2/4) Epoch 33, batch 450, loss[loss=0.2101, simple_loss=0.3042, pruned_loss=0.05805, over 17028.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.297, pruned_loss=0.06982, over 3380712.77 frames. ], batch size: 60, lr: 9.87e-03, grad_scale: 32.0 2023-06-15 10:51:05,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=116573.33333333333, ans=0.0 2023-06-15 10:51:14,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-06-15 10:51:17,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=116640.0, ans=0.125 2023-06-15 10:51:50,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.786e+02 1.951e+02 2.072e+02 2.874e+02, threshold=3.901e+02, percent-clipped=0.0 2023-06-15 10:52:03,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.32 vs. limit=22.5 2023-06-15 10:52:14,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=116840.0, ans=0.125 2023-06-15 10:52:17,731 INFO [train.py:988] (2/4) Epoch 33, batch 500, loss[loss=0.2096, simple_loss=0.2947, pruned_loss=0.06227, over 19128.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2972, pruned_loss=0.06974, over 3470806.68 frames. ], batch size: 94, lr: 9.86e-03, grad_scale: 32.0 2023-06-15 10:52:23,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-15 10:52:26,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=116906.66666666667, ans=0.0 2023-06-15 10:52:54,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117040.0, ans=0.1 2023-06-15 10:53:34,838 INFO [train.py:988] (2/4) Epoch 34, batch 0, loss[loss=0.2123, simple_loss=0.2959, pruned_loss=0.06438, over 18937.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2959, pruned_loss=0.06438, over 18937.00 frames. ], batch size: 86, lr: 9.70e-03, grad_scale: 32.0 2023-06-15 10:53:34,839 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 10:53:41,156 INFO [train.py:1020] (2/4) Epoch 34, validation: loss=0.2011, simple_loss=0.3024, pruned_loss=0.04991, over 143649.00 frames. 2023-06-15 10:53:41,157 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 10:54:01,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=15.0 2023-06-15 10:54:08,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=117193.33333333333, ans=0.0 2023-06-15 10:54:40,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=117326.66666666667, ans=12.0 2023-06-15 10:54:51,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=117393.33333333333, ans=0.2 2023-06-15 10:55:10,953 INFO [train.py:988] (2/4) Epoch 34, batch 50, loss[loss=0.2266, simple_loss=0.2608, pruned_loss=0.09618, over 17114.00 frames. ], tot_loss[loss=0.215, simple_loss=0.291, pruned_loss=0.06945, over 866894.00 frames. ], batch size: 391, lr: 9.69e-03, grad_scale: 16.0 2023-06-15 10:55:12,584 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.849e+02 2.100e+02 2.383e+02 3.120e+02, threshold=4.200e+02, percent-clipped=0.0 2023-06-15 10:55:40,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=117526.66666666667, ans=0.0 2023-06-15 10:56:00,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=117593.33333333333, ans=0.125 2023-06-15 10:56:02,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=117593.33333333333, ans=10.0 2023-06-15 10:56:04,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=117660.0, ans=0.0 2023-06-15 10:56:40,478 INFO [train.py:988] (2/4) Epoch 34, batch 100, loss[loss=0.2184, simple_loss=0.2853, pruned_loss=0.07575, over 20699.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2921, pruned_loss=0.07049, over 1528971.13 frames. ], batch size: 211, lr: 9.68e-03, grad_scale: 16.0 2023-06-15 10:56:44,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=117793.33333333333, ans=0.2 2023-06-15 10:56:54,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=117793.33333333333, ans=0.2 2023-06-15 10:56:56,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=117793.33333333333, ans=0.2 2023-06-15 10:57:08,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=117860.0, ans=0.125 2023-06-15 10:57:54,253 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:58:01,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=118060.0, ans=0.0 2023-06-15 10:58:02,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=118060.0, ans=0.125 2023-06-15 10:58:09,688 INFO [train.py:988] (2/4) Epoch 34, batch 150, loss[loss=0.2265, simple_loss=0.3036, pruned_loss=0.07469, over 19968.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2936, pruned_loss=0.07011, over 2031717.28 frames. ], batch size: 126, lr: 9.67e-03, grad_scale: 16.0 2023-06-15 10:58:11,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.765e+02 1.971e+02 2.282e+02 3.738e+02, threshold=3.942e+02, percent-clipped=0.0 2023-06-15 10:58:55,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=118260.0, ans=0.125 2023-06-15 10:59:18,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=15.0 2023-06-15 10:59:31,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=118393.33333333333, ans=22.5 2023-06-15 10:59:37,552 INFO [train.py:988] (2/4) Epoch 34, batch 200, loss[loss=0.2201, simple_loss=0.3174, pruned_loss=0.06147, over 18326.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.295, pruned_loss=0.07014, over 2398033.04 frames. ], batch size: 72, lr: 9.65e-03, grad_scale: 16.0 2023-06-15 10:59:43,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118460.0, ans=0.125 2023-06-15 10:59:49,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=118460.0, ans=0.0 2023-06-15 10:59:55,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118526.66666666667, ans=0.1 2023-06-15 11:00:05,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=118526.66666666667, ans=0.125 2023-06-15 11:00:19,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=118593.33333333333, ans=0.2 2023-06-15 11:00:42,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=118660.0, ans=0.0 2023-06-15 11:00:47,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2023-06-15 11:00:58,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-06-15 11:01:01,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=118726.66666666667, ans=0.0 2023-06-15 11:01:07,014 INFO [train.py:988] (2/4) Epoch 34, batch 250, loss[loss=0.2406, simple_loss=0.3305, pruned_loss=0.07539, over 15165.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2946, pruned_loss=0.06969, over 2699011.45 frames. ], batch size: 43, lr: 9.64e-03, grad_scale: 16.0 2023-06-15 11:01:09,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.911e+02 2.159e+02 2.494e+02 3.715e+02, threshold=4.319e+02, percent-clipped=0.0 2023-06-15 11:01:18,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=118793.33333333333, ans=0.0 2023-06-15 11:01:29,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=118860.0, ans=0.125 2023-06-15 11:01:38,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.53 vs. limit=6.0 2023-06-15 11:01:51,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=118926.66666666667, ans=0.125 2023-06-15 11:02:18,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=119060.0, ans=0.125 2023-06-15 11:02:28,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=119060.0, ans=0.125 2023-06-15 11:02:33,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=119126.66666666667, ans=0.0 2023-06-15 11:02:34,414 INFO [train.py:988] (2/4) Epoch 34, batch 300, loss[loss=0.2461, simple_loss=0.3293, pruned_loss=0.08142, over 17637.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2953, pruned_loss=0.07016, over 2925611.28 frames. ], batch size: 67, lr: 9.63e-03, grad_scale: 16.0 2023-06-15 11:02:49,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-15 11:03:22,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=119260.0, ans=0.125 2023-06-15 11:03:53,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119393.33333333333, ans=0.1 2023-06-15 11:04:03,801 INFO [train.py:988] (2/4) Epoch 34, batch 350, loss[loss=0.2099, simple_loss=0.2919, pruned_loss=0.06394, over 18633.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2945, pruned_loss=0.06957, over 3132779.85 frames. ], batch size: 80, lr: 9.62e-03, grad_scale: 16.0 2023-06-15 11:04:05,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.947e+02 2.157e+02 2.600e+02 3.674e+02, threshold=4.313e+02, percent-clipped=0.0 2023-06-15 11:04:13,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=119460.0, ans=0.125 2023-06-15 11:04:21,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.22 vs. limit=15.0 2023-06-15 11:04:52,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=119593.33333333333, ans=0.125 2023-06-15 11:05:34,493 INFO [train.py:988] (2/4) Epoch 34, batch 400, loss[loss=0.2346, simple_loss=0.3067, pruned_loss=0.08124, over 20247.00 frames. ], tot_loss[loss=0.217, simple_loss=0.295, pruned_loss=0.06947, over 3286075.59 frames. ], batch size: 141, lr: 9.61e-03, grad_scale: 32.0 2023-06-15 11:05:50,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=119860.0, ans=0.0 2023-06-15 11:06:17,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=119926.66666666667, ans=10.0 2023-06-15 11:06:21,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=119926.66666666667, ans=0.0 2023-06-15 11:06:41,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-15 11:06:43,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=119993.33333333333, ans=0.05 2023-06-15 11:06:46,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=120060.0, ans=0.2 2023-06-15 11:07:03,752 INFO [train.py:988] (2/4) Epoch 34, batch 450, loss[loss=0.2223, simple_loss=0.2995, pruned_loss=0.07253, over 18764.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2948, pruned_loss=0.06937, over 3412023.40 frames. ], batch size: 83, lr: 9.60e-03, grad_scale: 32.0 2023-06-15 11:07:05,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.842e+02 2.161e+02 2.491e+02 3.686e+02, threshold=4.322e+02, percent-clipped=0.0 2023-06-15 11:07:05,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=120126.66666666667, ans=0.125 2023-06-15 11:07:35,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=120193.33333333333, ans=0.125 2023-06-15 11:07:54,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=120326.66666666667, ans=0.0 2023-06-15 11:08:21,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=120393.33333333333, ans=0.125 2023-06-15 11:08:29,417 INFO [train.py:988] (2/4) Epoch 34, batch 500, loss[loss=0.2221, simple_loss=0.2964, pruned_loss=0.07393, over 19964.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2953, pruned_loss=0.06941, over 3494234.84 frames. ], batch size: 126, lr: 9.59e-03, grad_scale: 32.0 2023-06-15 11:08:48,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=120526.66666666667, ans=0.0 2023-06-15 11:09:04,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-15 11:09:06,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=120593.33333333333, ans=0.05 2023-06-15 11:09:43,651 INFO [train.py:988] (2/4) Epoch 35, batch 0, loss[loss=0.2281, simple_loss=0.3026, pruned_loss=0.0768, over 19203.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3026, pruned_loss=0.0768, over 19203.00 frames. ], batch size: 92, lr: 9.44e-03, grad_scale: 32.0 2023-06-15 11:09:43,652 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 11:09:49,768 INFO [train.py:1020] (2/4) Epoch 35, validation: loss=0.2016, simple_loss=0.3016, pruned_loss=0.05077, over 143649.00 frames. 2023-06-15 11:09:49,768 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 11:09:59,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.00 vs. limit=6.0 2023-06-15 11:10:12,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=120746.66666666667, ans=0.0 2023-06-15 11:10:21,596 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.805e+02 2.012e+02 2.315e+02 3.975e+02, threshold=4.025e+02, percent-clipped=0.0 2023-06-15 11:10:43,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=120880.0, ans=0.0 2023-06-15 11:10:49,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120880.0, ans=0.125 2023-06-15 11:10:56,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.16 vs. limit=22.5 2023-06-15 11:11:18,995 INFO [train.py:988] (2/4) Epoch 35, batch 50, loss[loss=0.2073, simple_loss=0.2875, pruned_loss=0.06355, over 18647.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2937, pruned_loss=0.06677, over 867742.72 frames. ], batch size: 80, lr: 9.43e-03, grad_scale: 32.0 2023-06-15 11:11:54,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=121146.66666666667, ans=0.2 2023-06-15 11:12:15,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=121213.33333333333, ans=0.125 2023-06-15 11:12:36,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-15 11:12:44,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=121280.0, ans=0.125 2023-06-15 11:12:47,370 INFO [train.py:988] (2/4) Epoch 35, batch 100, loss[loss=0.2118, simple_loss=0.2966, pruned_loss=0.06352, over 18616.00 frames. ], tot_loss[loss=0.215, simple_loss=0.295, pruned_loss=0.06747, over 1519402.58 frames. ], batch size: 80, lr: 9.42e-03, grad_scale: 32.0 2023-06-15 11:12:53,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=121346.66666666667, ans=0.125 2023-06-15 11:13:06,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-06-15 11:13:07,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=121413.33333333333, ans=0.09899494936611666 2023-06-15 11:13:14,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=22.5 2023-06-15 11:13:18,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.883e+02 2.096e+02 2.428e+02 4.337e+02, threshold=4.193e+02, percent-clipped=1.0 2023-06-15 11:13:48,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=121546.66666666667, ans=0.015 2023-06-15 11:14:15,095 INFO [train.py:988] (2/4) Epoch 35, batch 150, loss[loss=0.2221, simple_loss=0.2893, pruned_loss=0.07749, over 20337.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2959, pruned_loss=0.06876, over 2031940.88 frames. ], batch size: 240, lr: 9.41e-03, grad_scale: 32.0 2023-06-15 11:14:43,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=15.0 2023-06-15 11:14:46,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=121746.66666666667, ans=0.0 2023-06-15 11:14:46,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=121746.66666666667, ans=0.0 2023-06-15 11:15:07,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=121880.0, ans=0.125 2023-06-15 11:15:09,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-15 11:15:19,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=121880.0, ans=0.125 2023-06-15 11:15:19,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=22.5 2023-06-15 11:15:43,554 INFO [train.py:988] (2/4) Epoch 35, batch 200, loss[loss=0.2229, simple_loss=0.291, pruned_loss=0.07735, over 20701.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2937, pruned_loss=0.06897, over 2438735.91 frames. ], batch size: 211, lr: 9.40e-03, grad_scale: 32.0 2023-06-15 11:16:01,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=122080.0, ans=0.0 2023-06-15 11:16:02,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=122080.0, ans=0.125 2023-06-15 11:16:13,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122080.0, ans=0.1 2023-06-15 11:16:14,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.845e+02 2.048e+02 2.405e+02 3.914e+02, threshold=4.095e+02, percent-clipped=0.0 2023-06-15 11:16:20,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=122146.66666666667, ans=0.0 2023-06-15 11:16:43,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-15 11:17:00,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.88 vs. limit=22.5 2023-06-15 11:17:06,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=122280.0, ans=0.2 2023-06-15 11:17:09,619 INFO [train.py:988] (2/4) Epoch 35, batch 250, loss[loss=0.2176, simple_loss=0.2823, pruned_loss=0.07649, over 20410.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2941, pruned_loss=0.0688, over 2745681.15 frames. ], batch size: 239, lr: 9.38e-03, grad_scale: 32.0 2023-06-15 11:17:09,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=122346.66666666667, ans=0.0 2023-06-15 11:17:31,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=122413.33333333333, ans=0.125 2023-06-15 11:17:36,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=122413.33333333333, ans=0.2 2023-06-15 11:17:38,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.88 vs. limit=15.0 2023-06-15 11:17:52,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.43 vs. limit=15.0 2023-06-15 11:17:55,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=122480.0, ans=0.125 2023-06-15 11:17:56,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122480.0, ans=0.1 2023-06-15 11:18:06,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=122546.66666666667, ans=0.0 2023-06-15 11:18:07,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=122546.66666666667, ans=0.125 2023-06-15 11:18:14,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122546.66666666667, ans=0.1 2023-06-15 11:18:27,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122613.33333333333, ans=0.125 2023-06-15 11:18:36,371 INFO [train.py:988] (2/4) Epoch 35, batch 300, loss[loss=0.2064, simple_loss=0.2841, pruned_loss=0.06433, over 18760.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2939, pruned_loss=0.0682, over 2973868.92 frames. ], batch size: 83, lr: 9.37e-03, grad_scale: 32.0 2023-06-15 11:19:02,686 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:19:06,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.736e+02 1.889e+02 2.139e+02 2.972e+02, threshold=3.778e+02, percent-clipped=0.0 2023-06-15 11:19:35,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=12.0 2023-06-15 11:19:39,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.71 vs. limit=15.0 2023-06-15 11:19:51,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.94 vs. limit=15.0 2023-06-15 11:19:52,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=122946.66666666667, ans=0.125 2023-06-15 11:19:57,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122946.66666666667, ans=0.1 2023-06-15 11:20:01,984 INFO [train.py:988] (2/4) Epoch 35, batch 350, loss[loss=0.212, simple_loss=0.2907, pruned_loss=0.06661, over 20325.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2938, pruned_loss=0.06824, over 3165916.28 frames. ], batch size: 149, lr: 9.36e-03, grad_scale: 32.0 2023-06-15 11:21:08,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123213.33333333333, ans=0.1 2023-06-15 11:21:14,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-15 11:21:28,656 INFO [train.py:988] (2/4) Epoch 35, batch 400, loss[loss=0.2153, simple_loss=0.2964, pruned_loss=0.0671, over 19674.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2946, pruned_loss=0.06812, over 3290140.06 frames. ], batch size: 110, lr: 9.35e-03, grad_scale: 32.0 2023-06-15 11:21:59,497 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.852e+02 2.101e+02 2.531e+02 3.269e+02, threshold=4.203e+02, percent-clipped=0.0 2023-06-15 11:22:03,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.44 vs. limit=15.0 2023-06-15 11:22:05,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=123480.0, ans=0.125 2023-06-15 11:22:10,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=123480.0, ans=0.125 2023-06-15 11:22:53,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=123680.0, ans=0.125 2023-06-15 11:22:54,575 INFO [train.py:988] (2/4) Epoch 35, batch 450, loss[loss=0.2215, simple_loss=0.2917, pruned_loss=0.07562, over 20613.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2945, pruned_loss=0.06763, over 3411127.67 frames. ], batch size: 189, lr: 9.34e-03, grad_scale: 32.0 2023-06-15 11:23:03,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=123680.0, ans=0.125 2023-06-15 11:23:06,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=123680.0, ans=0.125 2023-06-15 11:23:21,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123746.66666666667, ans=0.1 2023-06-15 11:23:49,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123880.0, ans=0.1 2023-06-15 11:24:12,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.89 vs. limit=22.5 2023-06-15 11:24:18,938 INFO [train.py:988] (2/4) Epoch 35, batch 500, loss[loss=0.2244, simple_loss=0.2906, pruned_loss=0.07915, over 20209.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2941, pruned_loss=0.06795, over 3492381.47 frames. ], batch size: 239, lr: 9.33e-03, grad_scale: 32.0 2023-06-15 11:24:22,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=124013.33333333333, ans=0.0 2023-06-15 11:24:35,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-06-15 11:24:47,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.833e+02 2.002e+02 2.181e+02 2.864e+02, threshold=4.004e+02, percent-clipped=0.0 2023-06-15 11:24:48,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=124080.0, ans=0.125 2023-06-15 11:24:54,718 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:25:33,808 INFO [train.py:988] (2/4) Epoch 36, batch 0, loss[loss=0.2081, simple_loss=0.2951, pruned_loss=0.06057, over 18287.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2951, pruned_loss=0.06057, over 18287.00 frames. ], batch size: 74, lr: 9.19e-03, grad_scale: 32.0 2023-06-15 11:25:33,808 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 11:25:39,896 INFO [train.py:1020] (2/4) Epoch 36, validation: loss=0.2014, simple_loss=0.3017, pruned_loss=0.05055, over 143649.00 frames. 2023-06-15 11:25:39,897 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 11:26:01,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124293.33333333333, ans=0.1 2023-06-15 11:26:08,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=124293.33333333333, ans=0.125 2023-06-15 11:26:16,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124360.0, ans=0.1 2023-06-15 11:26:54,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-06-15 11:27:05,065 INFO [train.py:988] (2/4) Epoch 36, batch 50, loss[loss=0.2233, simple_loss=0.2946, pruned_loss=0.07603, over 20124.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2936, pruned_loss=0.0682, over 852659.92 frames. ], batch size: 133, lr: 9.18e-03, grad_scale: 32.0 2023-06-15 11:27:27,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124626.66666666667, ans=0.125 2023-06-15 11:27:30,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=124626.66666666667, ans=0.0 2023-06-15 11:27:41,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=124693.33333333333, ans=0.0 2023-06-15 11:28:05,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-15 11:28:07,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.827e+02 2.009e+02 2.333e+02 3.474e+02, threshold=4.018e+02, percent-clipped=0.0 2023-06-15 11:28:14,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=124826.66666666667, ans=0.0 2023-06-15 11:28:23,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=124826.66666666667, ans=0.2 2023-06-15 11:28:31,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=15.0 2023-06-15 11:28:31,564 INFO [train.py:988] (2/4) Epoch 36, batch 100, loss[loss=0.2081, simple_loss=0.294, pruned_loss=0.06114, over 18627.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.292, pruned_loss=0.06711, over 1515053.70 frames. ], batch size: 80, lr: 9.17e-03, grad_scale: 32.0 2023-06-15 11:28:44,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2023-06-15 11:29:02,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=124960.0, ans=0.125 2023-06-15 11:29:16,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=125026.66666666667, ans=0.125 2023-06-15 11:29:20,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=125026.66666666667, ans=0.2 2023-06-15 11:29:21,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=125093.33333333333, ans=0.0 2023-06-15 11:29:22,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-15 11:29:38,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=12.0 2023-06-15 11:29:41,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=125160.0, ans=0.0 2023-06-15 11:29:46,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=125160.0, ans=0.2 2023-06-15 11:29:54,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=125160.0, ans=0.5 2023-06-15 11:29:58,758 INFO [train.py:988] (2/4) Epoch 36, batch 150, loss[loss=0.2031, simple_loss=0.2818, pruned_loss=0.06221, over 19050.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2927, pruned_loss=0.06721, over 2023715.44 frames. ], batch size: 89, lr: 9.16e-03, grad_scale: 16.0 2023-06-15 11:30:01,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.90 vs. limit=10.0 2023-06-15 11:30:13,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125226.66666666667, ans=0.1 2023-06-15 11:30:14,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=125293.33333333333, ans=0.05 2023-06-15 11:30:16,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=125293.33333333333, ans=0.125 2023-06-15 11:30:19,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125293.33333333333, ans=0.1 2023-06-15 11:31:02,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.954e+02 2.200e+02 2.726e+02 5.615e+02, threshold=4.401e+02, percent-clipped=3.0 2023-06-15 11:31:25,660 INFO [train.py:988] (2/4) Epoch 36, batch 200, loss[loss=0.1988, simple_loss=0.2827, pruned_loss=0.05747, over 19689.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2921, pruned_loss=0.06703, over 2420168.87 frames. ], batch size: 110, lr: 9.15e-03, grad_scale: 16.0 2023-06-15 11:31:34,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=125560.0, ans=0.125 2023-06-15 11:32:32,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125826.66666666667, ans=0.125 2023-06-15 11:32:52,603 INFO [train.py:988] (2/4) Epoch 36, batch 250, loss[loss=0.209, simple_loss=0.3039, pruned_loss=0.0571, over 18340.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2921, pruned_loss=0.06654, over 2722853.64 frames. ], batch size: 72, lr: 9.14e-03, grad_scale: 16.0 2023-06-15 11:32:56,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=125893.33333333333, ans=0.04949747468305833 2023-06-15 11:33:56,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.785e+02 1.964e+02 2.188e+02 3.452e+02, threshold=3.927e+02, percent-clipped=0.0 2023-06-15 11:34:18,429 INFO [train.py:988] (2/4) Epoch 36, batch 300, loss[loss=0.2129, simple_loss=0.2927, pruned_loss=0.06651, over 19129.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2915, pruned_loss=0.06651, over 2973977.28 frames. ], batch size: 94, lr: 9.13e-03, grad_scale: 16.0 2023-06-15 11:34:44,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=126293.33333333333, ans=0.125 2023-06-15 11:34:51,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-15 11:35:04,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.24 vs. limit=15.0 2023-06-15 11:35:14,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=126426.66666666667, ans=0.0 2023-06-15 11:35:15,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=126426.66666666667, ans=0.125 2023-06-15 11:35:19,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=126426.66666666667, ans=0.2 2023-06-15 11:35:39,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.79 vs. limit=22.5 2023-06-15 11:35:45,204 INFO [train.py:988] (2/4) Epoch 36, batch 350, loss[loss=0.2045, simple_loss=0.2772, pruned_loss=0.0659, over 20099.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2914, pruned_loss=0.06614, over 3163210.84 frames. ], batch size: 133, lr: 9.12e-03, grad_scale: 16.0 2023-06-15 11:36:50,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.822e+02 2.085e+02 2.267e+02 3.723e+02, threshold=4.169e+02, percent-clipped=0.0 2023-06-15 11:36:52,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=126760.0, ans=0.125 2023-06-15 11:37:13,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-15 11:37:13,988 INFO [train.py:988] (2/4) Epoch 36, batch 400, loss[loss=0.1924, simple_loss=0.2751, pruned_loss=0.05486, over 19697.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2906, pruned_loss=0.06572, over 3322765.78 frames. ], batch size: 110, lr: 9.11e-03, grad_scale: 32.0 2023-06-15 11:37:43,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=126960.0, ans=0.0 2023-06-15 11:37:53,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127026.66666666667, ans=0.1 2023-06-15 11:38:02,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=127026.66666666667, ans=0.125 2023-06-15 11:38:35,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-06-15 11:38:40,678 INFO [train.py:988] (2/4) Epoch 36, batch 450, loss[loss=0.2255, simple_loss=0.3025, pruned_loss=0.07426, over 18623.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2915, pruned_loss=0.06609, over 3435395.47 frames. ], batch size: 80, lr: 9.10e-03, grad_scale: 32.0 2023-06-15 11:39:01,851 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:39:15,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=127360.0, ans=10.0 2023-06-15 11:39:20,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=127360.0, ans=0.125 2023-06-15 11:39:27,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=127360.0, ans=0.5 2023-06-15 11:39:30,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=127426.66666666667, ans=0.2 2023-06-15 11:39:39,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=127426.66666666667, ans=0.2 2023-06-15 11:39:43,804 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.789e+02 1.972e+02 2.291e+02 3.379e+02, threshold=3.945e+02, percent-clipped=0.0 2023-06-15 11:39:52,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2023-06-15 11:40:05,543 INFO [train.py:988] (2/4) Epoch 36, batch 500, loss[loss=0.1912, simple_loss=0.2771, pruned_loss=0.05267, over 18781.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2923, pruned_loss=0.06628, over 3515707.50 frames. ], batch size: 83, lr: 9.09e-03, grad_scale: 32.0 2023-06-15 11:40:17,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127560.0, ans=0.1 2023-06-15 11:40:33,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=127626.66666666667, ans=0.125 2023-06-15 11:40:39,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=127693.33333333333, ans=0.125 2023-06-15 11:40:49,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=127693.33333333333, ans=0.0 2023-06-15 11:41:22,927 INFO [train.py:988] (2/4) Epoch 37, batch 0, loss[loss=0.24, simple_loss=0.2756, pruned_loss=0.1022, over 16921.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2756, pruned_loss=0.1022, over 16921.00 frames. ], batch size: 392, lr: 8.96e-03, grad_scale: 32.0 2023-06-15 11:41:22,928 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 11:41:29,082 INFO [train.py:1020] (2/4) Epoch 37, validation: loss=0.2017, simple_loss=0.3019, pruned_loss=0.05073, over 143649.00 frames. 2023-06-15 11:41:29,083 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 11:41:34,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=127780.0, ans=0.125 2023-06-15 11:41:34,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=127780.0, ans=0.2 2023-06-15 11:41:42,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-15 11:41:55,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=127846.66666666667, ans=0.125 2023-06-15 11:42:00,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127846.66666666667, ans=0.1 2023-06-15 11:42:03,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.94 vs. limit=22.5 2023-06-15 11:42:56,864 INFO [train.py:988] (2/4) Epoch 37, batch 50, loss[loss=0.2135, simple_loss=0.2997, pruned_loss=0.06363, over 18266.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2955, pruned_loss=0.06668, over 852870.15 frames. ], batch size: 74, lr: 8.95e-03, grad_scale: 32.0 2023-06-15 11:43:04,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.676e+02 1.887e+02 2.171e+02 3.433e+02, threshold=3.773e+02, percent-clipped=0.0 2023-06-15 11:43:13,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=128180.0, ans=0.0 2023-06-15 11:43:22,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128180.0, ans=0.1 2023-06-15 11:43:32,576 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:43:34,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128246.66666666667, ans=0.1 2023-06-15 11:43:58,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=128313.33333333333, ans=0.2 2023-06-15 11:44:17,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=128380.0, ans=0.2 2023-06-15 11:44:19,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128380.0, ans=0.1 2023-06-15 11:44:24,272 INFO [train.py:988] (2/4) Epoch 37, batch 100, loss[loss=0.239, simple_loss=0.3244, pruned_loss=0.07678, over 16971.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2935, pruned_loss=0.06541, over 1495444.01 frames. ], batch size: 60, lr: 8.94e-03, grad_scale: 32.0 2023-06-15 11:44:24,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=128446.66666666667, ans=0.2 2023-06-15 11:44:38,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-06-15 11:44:40,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.94 vs. limit=22.5 2023-06-15 11:44:48,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=128513.33333333333, ans=0.125 2023-06-15 11:44:56,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=128513.33333333333, ans=0.0 2023-06-15 11:45:00,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=128580.0, ans=0.2 2023-06-15 11:45:19,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.51 vs. limit=15.0 2023-06-15 11:45:52,372 INFO [train.py:988] (2/4) Epoch 37, batch 150, loss[loss=0.2088, simple_loss=0.2785, pruned_loss=0.06955, over 20479.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2928, pruned_loss=0.0663, over 1997362.32 frames. ], batch size: 189, lr: 8.93e-03, grad_scale: 32.0 2023-06-15 11:45:52,824 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:45:59,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.837e+02 2.114e+02 2.317e+02 3.549e+02, threshold=4.229e+02, percent-clipped=0.0 2023-06-15 11:46:44,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=128980.0, ans=0.125 2023-06-15 11:47:20,592 INFO [train.py:988] (2/4) Epoch 37, batch 200, loss[loss=0.223, simple_loss=0.3087, pruned_loss=0.06866, over 17630.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2937, pruned_loss=0.06689, over 2376374.68 frames. ], batch size: 67, lr: 8.92e-03, grad_scale: 32.0 2023-06-15 11:47:28,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=129113.33333333333, ans=0.125 2023-06-15 11:48:20,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=129313.33333333333, ans=0.5 2023-06-15 11:48:33,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=129380.0, ans=0.07 2023-06-15 11:48:48,307 INFO [train.py:988] (2/4) Epoch 37, batch 250, loss[loss=0.2126, simple_loss=0.2946, pruned_loss=0.06533, over 19218.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2923, pruned_loss=0.06591, over 2699926.33 frames. ], batch size: 92, lr: 8.91e-03, grad_scale: 32.0 2023-06-15 11:48:54,957 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.981e+02 2.345e+02 2.837e+02 3.921e+02, threshold=4.691e+02, percent-clipped=0.0 2023-06-15 11:48:59,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=129446.66666666667, ans=0.0 2023-06-15 11:49:12,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=12.0 2023-06-15 11:49:51,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=129646.66666666667, ans=0.125 2023-06-15 11:50:16,946 INFO [train.py:988] (2/4) Epoch 37, batch 300, loss[loss=0.2259, simple_loss=0.299, pruned_loss=0.07636, over 20104.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2924, pruned_loss=0.06681, over 2944559.40 frames. ], batch size: 133, lr: 8.90e-03, grad_scale: 32.0 2023-06-15 11:50:20,830 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:50:56,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.43 vs. limit=15.0 2023-06-15 11:50:56,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=129913.33333333333, ans=0.0 2023-06-15 11:51:07,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=129913.33333333333, ans=0.125 2023-06-15 11:51:11,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=129980.0, ans=0.2 2023-06-15 11:51:29,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=130046.66666666667, ans=0.125 2023-06-15 11:51:45,588 INFO [train.py:988] (2/4) Epoch 37, batch 350, loss[loss=0.2165, simple_loss=0.2972, pruned_loss=0.06791, over 18615.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2922, pruned_loss=0.06677, over 3126747.82 frames. ], batch size: 80, lr: 8.89e-03, grad_scale: 32.0 2023-06-15 11:51:49,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=130113.33333333333, ans=0.035 2023-06-15 11:51:52,282 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.921e+02 2.094e+02 2.447e+02 3.479e+02, threshold=4.189e+02, percent-clipped=0.0 2023-06-15 11:52:01,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=130180.0, ans=0.07 2023-06-15 11:52:13,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=130180.0, ans=0.125 2023-06-15 11:52:13,139 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:52:40,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130313.33333333333, ans=0.125 2023-06-15 11:52:48,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.61 vs. limit=22.5 2023-06-15 11:53:03,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=130380.0, ans=0.125 2023-06-15 11:53:12,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-06-15 11:53:13,090 INFO [train.py:988] (2/4) Epoch 37, batch 400, loss[loss=0.2057, simple_loss=0.2802, pruned_loss=0.06561, over 20648.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2929, pruned_loss=0.06665, over 3247329.31 frames. ], batch size: 211, lr: 8.88e-03, grad_scale: 32.0 2023-06-15 11:53:34,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=130513.33333333333, ans=0.0 2023-06-15 11:53:44,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130513.33333333333, ans=0.1 2023-06-15 11:53:57,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2023-06-15 11:54:43,367 INFO [train.py:988] (2/4) Epoch 37, batch 450, loss[loss=0.2147, simple_loss=0.2755, pruned_loss=0.07694, over 19943.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2924, pruned_loss=0.06637, over 3366407.72 frames. ], batch size: 294, lr: 8.87e-03, grad_scale: 16.0 2023-06-15 11:54:51,578 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.773e+02 2.041e+02 2.312e+02 3.124e+02, threshold=4.082e+02, percent-clipped=0.0 2023-06-15 11:54:57,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=130780.0, ans=0.2 2023-06-15 11:55:43,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=130980.0, ans=0.125 2023-06-15 11:56:09,212 INFO [train.py:988] (2/4) Epoch 37, batch 500, loss[loss=0.1993, simple_loss=0.2865, pruned_loss=0.05608, over 19787.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2927, pruned_loss=0.066, over 3451276.26 frames. ], batch size: 115, lr: 8.86e-03, grad_scale: 16.0 2023-06-15 11:56:16,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=131113.33333333334, ans=0.125 2023-06-15 11:56:30,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-15 11:56:49,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=131246.66666666666, ans=0.125 2023-06-15 11:57:24,718 INFO [train.py:988] (2/4) Epoch 38, batch 0, loss[loss=0.2134, simple_loss=0.2783, pruned_loss=0.07423, over 20274.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2783, pruned_loss=0.07423, over 20274.00 frames. ], batch size: 239, lr: 8.73e-03, grad_scale: 32.0 2023-06-15 11:57:24,718 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 11:57:31,183 INFO [train.py:1020] (2/4) Epoch 38, validation: loss=0.2046, simple_loss=0.3024, pruned_loss=0.05337, over 143649.00 frames. 2023-06-15 11:57:31,184 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 11:57:40,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131326.66666666666, ans=0.1 2023-06-15 11:58:00,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.63 vs. limit=15.0 2023-06-15 11:58:07,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=131460.0, ans=0.2 2023-06-15 11:58:14,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.971e+02 2.221e+02 2.654e+02 3.969e+02, threshold=4.441e+02, percent-clipped=0.0 2023-06-15 11:58:59,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=15.0 2023-06-15 11:59:00,148 INFO [train.py:988] (2/4) Epoch 38, batch 50, loss[loss=0.2501, simple_loss=0.3318, pruned_loss=0.08416, over 16469.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2925, pruned_loss=0.06614, over 853401.67 frames. ], batch size: 52, lr: 8.72e-03, grad_scale: 16.0 2023-06-15 11:59:04,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=131660.0, ans=0.2 2023-06-15 11:59:23,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=131726.66666666666, ans=0.125 2023-06-15 11:59:24,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-15 11:59:24,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=131726.66666666666, ans=0.2 2023-06-15 11:59:35,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=131793.33333333334, ans=0.125 2023-06-15 11:59:55,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=131860.0, ans=0.0 2023-06-15 12:00:07,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=131926.66666666666, ans=0.0 2023-06-15 12:00:17,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=131926.66666666666, ans=0.05 2023-06-15 12:00:26,386 INFO [train.py:988] (2/4) Epoch 38, batch 100, loss[loss=0.1867, simple_loss=0.2704, pruned_loss=0.0515, over 19546.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2902, pruned_loss=0.06479, over 1512621.17 frames. ], batch size: 102, lr: 8.71e-03, grad_scale: 16.0 2023-06-15 12:00:35,265 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:00:36,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-15 12:00:41,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=132060.0, ans=0.125 2023-06-15 12:01:01,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.81 vs. limit=15.0 2023-06-15 12:01:07,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.839e+02 2.066e+02 2.376e+02 4.240e+02, threshold=4.131e+02, percent-clipped=0.0 2023-06-15 12:01:23,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132193.33333333334, ans=0.1 2023-06-15 12:01:23,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=132193.33333333334, ans=0.125 2023-06-15 12:01:29,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-15 12:01:52,136 INFO [train.py:988] (2/4) Epoch 38, batch 150, loss[loss=0.2307, simple_loss=0.3013, pruned_loss=0.08007, over 20259.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2922, pruned_loss=0.0666, over 2013036.90 frames. ], batch size: 141, lr: 8.70e-03, grad_scale: 16.0 2023-06-15 12:02:00,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132326.66666666666, ans=0.125 2023-06-15 12:02:14,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=22.5 2023-06-15 12:02:28,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=132460.0, ans=0.125 2023-06-15 12:02:35,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=132460.0, ans=0.125 2023-06-15 12:02:45,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=132526.66666666666, ans=0.1 2023-06-15 12:03:17,129 INFO [train.py:988] (2/4) Epoch 38, batch 200, loss[loss=0.2, simple_loss=0.2897, pruned_loss=0.05514, over 18793.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2918, pruned_loss=0.06572, over 2391688.66 frames. ], batch size: 83, lr: 8.69e-03, grad_scale: 16.0 2023-06-15 12:03:43,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=132726.66666666666, ans=0.0 2023-06-15 12:03:58,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.792e+02 1.952e+02 2.227e+02 3.208e+02, threshold=3.904e+02, percent-clipped=0.0 2023-06-15 12:04:17,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=132860.0, ans=0.125 2023-06-15 12:04:17,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=132860.0, ans=0.0 2023-06-15 12:04:43,045 INFO [train.py:988] (2/4) Epoch 38, batch 250, loss[loss=0.2156, simple_loss=0.2935, pruned_loss=0.0688, over 19530.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2902, pruned_loss=0.06517, over 2713498.37 frames. ], batch size: 102, lr: 8.68e-03, grad_scale: 16.0 2023-06-15 12:05:19,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=133126.66666666666, ans=0.125 2023-06-15 12:05:37,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=133193.33333333334, ans=0.0 2023-06-15 12:05:40,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=133193.33333333334, ans=0.125 2023-06-15 12:06:10,262 INFO [train.py:988] (2/4) Epoch 38, batch 300, loss[loss=0.1946, simple_loss=0.2837, pruned_loss=0.05277, over 19826.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2897, pruned_loss=0.06527, over 2949005.85 frames. ], batch size: 115, lr: 8.67e-03, grad_scale: 16.0 2023-06-15 12:06:21,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=133326.66666666666, ans=0.0 2023-06-15 12:06:54,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.862e+02 2.054e+02 2.308e+02 3.545e+02, threshold=4.107e+02, percent-clipped=0.0 2023-06-15 12:07:39,140 INFO [train.py:988] (2/4) Epoch 38, batch 350, loss[loss=0.222, simple_loss=0.3059, pruned_loss=0.06901, over 18468.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2896, pruned_loss=0.06561, over 3115626.04 frames. ], batch size: 77, lr: 8.66e-03, grad_scale: 16.0 2023-06-15 12:07:57,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=133726.66666666666, ans=0.125 2023-06-15 12:08:16,240 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:08:16,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=133793.33333333334, ans=0.0 2023-06-15 12:08:28,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=133793.33333333334, ans=0.125 2023-06-15 12:08:33,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=133860.0, ans=0.125 2023-06-15 12:08:35,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-15 12:08:44,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=133860.0, ans=0.2 2023-06-15 12:08:46,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2023-06-15 12:09:00,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=133926.66666666666, ans=0.125 2023-06-15 12:09:05,466 INFO [train.py:988] (2/4) Epoch 38, batch 400, loss[loss=0.1995, simple_loss=0.2845, pruned_loss=0.05727, over 19525.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2894, pruned_loss=0.06525, over 3275884.02 frames. ], batch size: 102, lr: 8.65e-03, grad_scale: 32.0 2023-06-15 12:09:25,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=134060.0, ans=0.0 2023-06-15 12:09:25,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=134060.0, ans=0.125 2023-06-15 12:09:25,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134060.0, ans=0.1 2023-06-15 12:09:30,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134060.0, ans=0.1 2023-06-15 12:09:42,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=134126.66666666666, ans=0.0 2023-06-15 12:09:49,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.764e+02 2.036e+02 2.337e+02 3.432e+02, threshold=4.071e+02, percent-clipped=0.0 2023-06-15 12:10:14,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134260.0, ans=0.1 2023-06-15 12:10:20,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=134260.0, ans=0.0 2023-06-15 12:10:32,631 INFO [train.py:988] (2/4) Epoch 38, batch 450, loss[loss=0.2083, simple_loss=0.2934, pruned_loss=0.06157, over 18456.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2892, pruned_loss=0.06544, over 3393651.02 frames. ], batch size: 77, lr: 8.65e-03, grad_scale: 16.0 2023-06-15 12:10:34,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=134326.66666666666, ans=0.125 2023-06-15 12:11:03,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=134393.33333333334, ans=0.125 2023-06-15 12:11:09,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=134460.0, ans=0.125 2023-06-15 12:11:17,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=134460.0, ans=0.125 2023-06-15 12:11:28,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=134526.66666666666, ans=0.125 2023-06-15 12:11:35,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=134526.66666666666, ans=0.125 2023-06-15 12:11:40,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=134593.33333333334, ans=0.125 2023-06-15 12:11:42,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=134593.33333333334, ans=0.2 2023-06-15 12:11:51,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=134593.33333333334, ans=0.0 2023-06-15 12:11:56,107 INFO [train.py:988] (2/4) Epoch 38, batch 500, loss[loss=0.2182, simple_loss=0.3092, pruned_loss=0.0636, over 17572.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2887, pruned_loss=0.065, over 3489281.80 frames. ], batch size: 67, lr: 8.64e-03, grad_scale: 16.0 2023-06-15 12:12:04,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=134660.0, ans=0.125 2023-06-15 12:12:09,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=134660.0, ans=0.5 2023-06-15 12:12:37,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.813e+02 2.102e+02 2.344e+02 3.588e+02, threshold=4.203e+02, percent-clipped=0.0 2023-06-15 12:12:39,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=134793.33333333334, ans=0.125 2023-06-15 12:13:08,850 INFO [train.py:988] (2/4) Epoch 39, batch 0, loss[loss=0.2135, simple_loss=0.2986, pruned_loss=0.06423, over 19647.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2986, pruned_loss=0.06423, over 19647.00 frames. ], batch size: 110, lr: 8.52e-03, grad_scale: 32.0 2023-06-15 12:13:08,851 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 12:13:14,989 INFO [train.py:1020] (2/4) Epoch 39, validation: loss=0.2008, simple_loss=0.3008, pruned_loss=0.05042, over 143649.00 frames. 2023-06-15 12:13:14,990 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 12:13:37,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=134940.0, ans=0.1 2023-06-15 12:14:07,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:16,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:18,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:25,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=135140.0, ans=0.0 2023-06-15 12:14:42,560 INFO [train.py:988] (2/4) Epoch 39, batch 50, loss[loss=0.2044, simple_loss=0.2889, pruned_loss=0.05993, over 19470.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2884, pruned_loss=0.06336, over 873392.34 frames. ], batch size: 105, lr: 8.51e-03, grad_scale: 16.0 2023-06-15 12:15:18,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-15 12:15:58,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.823e+02 2.055e+02 2.289e+02 2.991e+02, threshold=4.109e+02, percent-clipped=0.0 2023-06-15 12:16:07,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=135540.0, ans=0.0 2023-06-15 12:16:09,594 INFO [train.py:988] (2/4) Epoch 39, batch 100, loss[loss=0.206, simple_loss=0.2912, pruned_loss=0.06038, over 18288.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2882, pruned_loss=0.0642, over 1524853.37 frames. ], batch size: 74, lr: 8.50e-03, grad_scale: 16.0 2023-06-15 12:16:22,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=135540.0, ans=0.07 2023-06-15 12:17:35,693 INFO [train.py:988] (2/4) Epoch 39, batch 150, loss[loss=0.214, simple_loss=0.3052, pruned_loss=0.06134, over 16285.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2907, pruned_loss=0.06406, over 2004999.70 frames. ], batch size: 52, lr: 8.49e-03, grad_scale: 16.0 2023-06-15 12:17:46,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=135873.33333333334, ans=0.04949747468305833 2023-06-15 12:18:17,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=136006.66666666666, ans=0.0 2023-06-15 12:18:32,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136073.33333333334, ans=0.1 2023-06-15 12:18:52,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.763e+02 2.014e+02 2.292e+02 3.195e+02, threshold=4.028e+02, percent-clipped=0.0 2023-06-15 12:19:03,517 INFO [train.py:988] (2/4) Epoch 39, batch 200, loss[loss=0.197, simple_loss=0.2877, pruned_loss=0.05315, over 19850.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.289, pruned_loss=0.06298, over 2403219.27 frames. ], batch size: 115, lr: 8.48e-03, grad_scale: 16.0 2023-06-15 12:19:12,557 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:20:04,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-15 12:20:31,128 INFO [train.py:988] (2/4) Epoch 39, batch 250, loss[loss=0.1962, simple_loss=0.2847, pruned_loss=0.05388, over 19463.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.29, pruned_loss=0.06319, over 2694138.32 frames. ], batch size: 105, lr: 8.47e-03, grad_scale: 16.0 2023-06-15 12:21:04,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=136673.33333333334, ans=0.125 2023-06-15 12:21:13,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=136673.33333333334, ans=0.0 2023-06-15 12:21:19,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=136673.33333333334, ans=0.0 2023-06-15 12:21:28,323 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:21:48,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.779e+02 1.952e+02 2.172e+02 3.259e+02, threshold=3.903e+02, percent-clipped=0.0 2023-06-15 12:21:59,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-15 12:21:59,985 INFO [train.py:988] (2/4) Epoch 39, batch 300, loss[loss=0.2064, simple_loss=0.2528, pruned_loss=0.07998, over 16606.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2901, pruned_loss=0.06358, over 2940799.26 frames. ], batch size: 392, lr: 8.46e-03, grad_scale: 16.0 2023-06-15 12:22:26,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=136940.0, ans=0.2 2023-06-15 12:22:57,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=137073.33333333334, ans=0.125 2023-06-15 12:23:00,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-06-15 12:23:23,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=137140.0, ans=0.125 2023-06-15 12:23:26,381 INFO [train.py:988] (2/4) Epoch 39, batch 350, loss[loss=0.2087, simple_loss=0.276, pruned_loss=0.07064, over 20234.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2906, pruned_loss=0.06369, over 3118959.90 frames. ], batch size: 239, lr: 8.45e-03, grad_scale: 16.0 2023-06-15 12:23:41,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137206.66666666666, ans=0.1 2023-06-15 12:23:48,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=137273.33333333334, ans=0.0 2023-06-15 12:23:52,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=12.0 2023-06-15 12:24:07,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=137340.0, ans=0.0 2023-06-15 12:24:09,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=137340.0, ans=0.2 2023-06-15 12:24:40,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137473.33333333334, ans=0.1 2023-06-15 12:24:41,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=137473.33333333334, ans=0.125 2023-06-15 12:24:44,645 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.816e+02 2.065e+02 2.366e+02 3.841e+02, threshold=4.130e+02, percent-clipped=0.0 2023-06-15 12:24:55,375 INFO [train.py:988] (2/4) Epoch 39, batch 400, loss[loss=0.2053, simple_loss=0.2804, pruned_loss=0.06505, over 19941.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2898, pruned_loss=0.06393, over 3282593.31 frames. ], batch size: 126, lr: 8.44e-03, grad_scale: 32.0 2023-06-15 12:25:29,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137673.33333333334, ans=0.125 2023-06-15 12:25:42,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=137673.33333333334, ans=0.0 2023-06-15 12:26:23,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=137873.33333333334, ans=0.0 2023-06-15 12:26:24,764 INFO [train.py:988] (2/4) Epoch 39, batch 450, loss[loss=0.2012, simple_loss=0.2736, pruned_loss=0.06434, over 20641.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2892, pruned_loss=0.06444, over 3403202.05 frames. ], batch size: 211, lr: 8.44e-03, grad_scale: 16.0 2023-06-15 12:26:28,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=137873.33333333334, ans=0.125 2023-06-15 12:26:36,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=137873.33333333334, ans=0.125 2023-06-15 12:26:47,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=137940.0, ans=0.125 2023-06-15 12:26:49,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137940.0, ans=0.125 2023-06-15 12:27:18,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=138073.33333333334, ans=0.015 2023-06-15 12:27:20,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=138073.33333333334, ans=0.0 2023-06-15 12:27:41,490 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.826e+02 2.179e+02 2.454e+02 3.798e+02, threshold=4.358e+02, percent-clipped=0.0 2023-06-15 12:27:45,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=138140.0, ans=0.125 2023-06-15 12:27:49,797 INFO [train.py:988] (2/4) Epoch 39, batch 500, loss[loss=0.2062, simple_loss=0.2862, pruned_loss=0.06311, over 18782.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2899, pruned_loss=0.06427, over 3488256.76 frames. ], batch size: 83, lr: 8.43e-03, grad_scale: 16.0 2023-06-15 12:28:22,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=138340.0, ans=0.125 2023-06-15 12:28:35,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=138340.0, ans=0.07 2023-06-15 12:29:07,679 INFO [train.py:988] (2/4) Epoch 40, batch 0, loss[loss=0.2209, simple_loss=0.3026, pruned_loss=0.06958, over 19821.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3026, pruned_loss=0.06958, over 19821.00 frames. ], batch size: 115, lr: 8.31e-03, grad_scale: 32.0 2023-06-15 12:29:07,680 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 12:29:13,804 INFO [train.py:1020] (2/4) Epoch 40, validation: loss=0.2011, simple_loss=0.3008, pruned_loss=0.05073, over 143649.00 frames. 2023-06-15 12:29:13,806 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 12:30:08,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138620.0, ans=0.1 2023-06-15 12:30:38,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.91 vs. limit=22.5 2023-06-15 12:30:42,330 INFO [train.py:988] (2/4) Epoch 40, batch 50, loss[loss=0.1964, simple_loss=0.2809, pruned_loss=0.05597, over 19695.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2889, pruned_loss=0.06393, over 851679.67 frames. ], batch size: 110, lr: 8.31e-03, grad_scale: 32.0 2023-06-15 12:31:01,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=138820.0, ans=0.0 2023-06-15 12:31:04,751 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:31:05,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.755e+02 2.069e+02 2.333e+02 3.346e+02, threshold=4.138e+02, percent-clipped=0.0 2023-06-15 12:31:30,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=138886.66666666666, ans=0.1 2023-06-15 12:31:32,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138886.66666666666, ans=0.1 2023-06-15 12:32:12,092 INFO [train.py:988] (2/4) Epoch 40, batch 100, loss[loss=0.2263, simple_loss=0.3193, pruned_loss=0.06667, over 18300.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2876, pruned_loss=0.06317, over 1502735.30 frames. ], batch size: 72, lr: 8.30e-03, grad_scale: 32.0 2023-06-15 12:32:19,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=15.0 2023-06-15 12:32:25,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=139086.66666666666, ans=0.05 2023-06-15 12:32:37,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=139153.33333333334, ans=10.0 2023-06-15 12:32:47,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2023-06-15 12:32:52,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139220.0, ans=0.1 2023-06-15 12:32:59,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=139220.0, ans=0.125 2023-06-15 12:33:38,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=139353.33333333334, ans=0.2 2023-06-15 12:33:41,081 INFO [train.py:988] (2/4) Epoch 40, batch 150, loss[loss=0.2145, simple_loss=0.2995, pruned_loss=0.06474, over 16363.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2901, pruned_loss=0.06449, over 1998759.22 frames. ], batch size: 52, lr: 8.29e-03, grad_scale: 32.0 2023-06-15 12:33:51,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139420.0, ans=0.1 2023-06-15 12:33:58,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=139486.66666666666, ans=0.5 2023-06-15 12:33:58,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=139486.66666666666, ans=0.0 2023-06-15 12:33:59,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-15 12:34:03,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.849e+02 1.991e+02 2.229e+02 4.188e+02, threshold=3.982e+02, percent-clipped=1.0 2023-06-15 12:34:19,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.15 vs. limit=5.0 2023-06-15 12:34:22,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.20 vs. limit=15.0 2023-06-15 12:34:39,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139620.0, ans=0.125 2023-06-15 12:34:40,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=139620.0, ans=0.0 2023-06-15 12:34:51,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139686.66666666666, ans=0.1 2023-06-15 12:35:09,522 INFO [train.py:988] (2/4) Epoch 40, batch 200, loss[loss=0.1933, simple_loss=0.2822, pruned_loss=0.0522, over 18288.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2897, pruned_loss=0.06433, over 2406605.58 frames. ], batch size: 74, lr: 8.28e-03, grad_scale: 32.0 2023-06-15 12:35:17,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139753.33333333334, ans=0.0 2023-06-15 12:35:19,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.40 vs. limit=22.5 2023-06-15 12:35:24,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139753.33333333334, ans=0.1 2023-06-15 12:35:29,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139820.0, ans=0.1 2023-06-15 12:36:19,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-06-15 12:36:36,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=140086.66666666666, ans=0.2 2023-06-15 12:36:37,806 INFO [train.py:988] (2/4) Epoch 40, batch 250, loss[loss=0.2096, simple_loss=0.2899, pruned_loss=0.06462, over 18614.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2887, pruned_loss=0.06479, over 2704730.08 frames. ], batch size: 80, lr: 8.27e-03, grad_scale: 32.0 2023-06-15 12:36:47,582 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-15 12:37:01,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.820e+02 2.078e+02 2.421e+02 4.152e+02, threshold=4.155e+02, percent-clipped=1.0 2023-06-15 12:37:01,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140153.33333333334, ans=0.1 2023-06-15 12:37:02,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-15 12:37:12,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140220.0, ans=0.1 2023-06-15 12:37:31,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=140286.66666666666, ans=0.125 2023-06-15 12:37:33,692 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:37:46,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=140286.66666666666, ans=0.0 2023-06-15 12:38:08,117 INFO [train.py:988] (2/4) Epoch 40, batch 300, loss[loss=0.1957, simple_loss=0.2772, pruned_loss=0.0571, over 19517.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2877, pruned_loss=0.06417, over 2958610.57 frames. ], batch size: 102, lr: 8.26e-03, grad_scale: 32.0 2023-06-15 12:38:13,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=140420.0, ans=0.125 2023-06-15 12:38:24,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=140486.66666666666, ans=0.2 2023-06-15 12:38:49,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=140553.33333333334, ans=0.1 2023-06-15 12:39:38,324 INFO [train.py:988] (2/4) Epoch 40, batch 350, loss[loss=0.2121, simple_loss=0.3065, pruned_loss=0.0589, over 15586.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2868, pruned_loss=0.06391, over 3147040.05 frames. ], batch size: 44, lr: 8.25e-03, grad_scale: 32.0 2023-06-15 12:40:01,673 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.757e+02 1.917e+02 2.241e+02 2.935e+02, threshold=3.834e+02, percent-clipped=0.0 2023-06-15 12:40:14,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=22.5 2023-06-15 12:40:21,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=140886.66666666666, ans=0.125 2023-06-15 12:40:34,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=140953.33333333334, ans=0.125 2023-06-15 12:41:08,546 INFO [train.py:988] (2/4) Epoch 40, batch 400, loss[loss=0.2133, simple_loss=0.3047, pruned_loss=0.06094, over 16957.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2872, pruned_loss=0.06356, over 3301724.82 frames. ], batch size: 60, lr: 8.24e-03, grad_scale: 32.0 2023-06-15 12:41:08,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=141086.66666666666, ans=0.2 2023-06-15 12:41:10,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=141086.66666666666, ans=0.125 2023-06-15 12:41:14,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=141086.66666666666, ans=0.125 2023-06-15 12:41:28,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2023-06-15 12:41:31,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=141153.33333333334, ans=0.2 2023-06-15 12:41:43,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=141220.0, ans=0.0 2023-06-15 12:41:48,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141220.0, ans=0.1 2023-06-15 12:42:30,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-06-15 12:42:36,477 INFO [train.py:988] (2/4) Epoch 40, batch 450, loss[loss=0.207, simple_loss=0.2825, pruned_loss=0.06578, over 19938.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2879, pruned_loss=0.06393, over 3403580.89 frames. ], batch size: 126, lr: 8.24e-03, grad_scale: 32.0 2023-06-15 12:42:57,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-15 12:42:59,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.768e+02 1.885e+02 2.206e+02 3.327e+02, threshold=3.770e+02, percent-clipped=0.0 2023-06-15 12:43:09,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-15 12:43:19,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=141553.33333333334, ans=10.0 2023-06-15 12:43:21,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=141553.33333333334, ans=0.0 2023-06-15 12:43:36,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141620.0, ans=0.1 2023-06-15 12:43:53,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-15 12:44:03,580 INFO [train.py:988] (2/4) Epoch 40, batch 500, loss[loss=0.2132, simple_loss=0.2722, pruned_loss=0.07709, over 19925.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2881, pruned_loss=0.06388, over 3490759.55 frames. ], batch size: 293, lr: 8.23e-03, grad_scale: 32.0 2023-06-15 12:44:33,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=141820.0, ans=0.125 2023-06-15 12:44:35,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=141886.66666666666, ans=0.125 2023-06-15 12:44:38,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=141886.66666666666, ans=0.125 2023-06-15 12:45:20,978 INFO [train.py:988] (2/4) Epoch 41, batch 0, loss[loss=0.218, simple_loss=0.3031, pruned_loss=0.06642, over 16401.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3031, pruned_loss=0.06642, over 16401.00 frames. ], batch size: 52, lr: 8.12e-03, grad_scale: 32.0 2023-06-15 12:45:20,979 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 12:45:28,024 INFO [train.py:1020] (2/4) Epoch 41, validation: loss=0.2002, simple_loss=0.2999, pruned_loss=0.05026, over 143649.00 frames. 2023-06-15 12:45:28,025 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 12:45:49,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=142040.0, ans=0.125 2023-06-15 12:46:18,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=142106.66666666666, ans=0.2 2023-06-15 12:46:21,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.817e+02 2.110e+02 2.443e+02 3.477e+02, threshold=4.219e+02, percent-clipped=0.0 2023-06-15 12:46:23,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=142173.33333333334, ans=0.125 2023-06-15 12:46:44,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142240.0, ans=0.1 2023-06-15 12:46:50,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2023-06-15 12:46:54,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.54 vs. limit=12.0 2023-06-15 12:46:57,300 INFO [train.py:988] (2/4) Epoch 41, batch 50, loss[loss=0.1978, simple_loss=0.2832, pruned_loss=0.05623, over 19470.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2866, pruned_loss=0.06273, over 859703.42 frames. ], batch size: 105, lr: 8.11e-03, grad_scale: 32.0 2023-06-15 12:47:21,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=142373.33333333334, ans=0.125 2023-06-15 12:47:24,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=142373.33333333334, ans=0.125 2023-06-15 12:47:43,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=142440.0, ans=0.125 2023-06-15 12:47:52,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=142506.66666666666, ans=0.125 2023-06-15 12:48:08,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=142573.33333333334, ans=0.125 2023-06-15 12:48:15,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=142573.33333333334, ans=0.125 2023-06-15 12:48:25,568 INFO [train.py:988] (2/4) Epoch 41, batch 100, loss[loss=0.209, simple_loss=0.2854, pruned_loss=0.06626, over 19943.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2865, pruned_loss=0.06227, over 1516003.58 frames. ], batch size: 126, lr: 8.10e-03, grad_scale: 32.0 2023-06-15 12:48:46,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=142706.66666666666, ans=0.125 2023-06-15 12:48:47,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=142706.66666666666, ans=0.125 2023-06-15 12:48:54,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=142706.66666666666, ans=0.125 2023-06-15 12:48:59,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-15 12:49:07,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=142773.33333333334, ans=0.125 2023-06-15 12:49:07,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=142773.33333333334, ans=0.1 2023-06-15 12:49:08,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=142773.33333333334, ans=0.125 2023-06-15 12:49:14,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=142773.33333333334, ans=0.2 2023-06-15 12:49:18,949 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.848e+02 2.100e+02 2.504e+02 3.647e+02, threshold=4.200e+02, percent-clipped=0.0 2023-06-15 12:49:54,449 INFO [train.py:988] (2/4) Epoch 41, batch 150, loss[loss=0.1877, simple_loss=0.2729, pruned_loss=0.05121, over 19066.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2874, pruned_loss=0.0623, over 2022197.48 frames. ], batch size: 89, lr: 8.09e-03, grad_scale: 32.0 2023-06-15 12:50:36,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=143106.66666666666, ans=0.0 2023-06-15 12:50:37,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.35 vs. limit=6.0 2023-06-15 12:50:47,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=143173.33333333334, ans=0.1 2023-06-15 12:50:50,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=143173.33333333334, ans=0.125 2023-06-15 12:51:18,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=22.5 2023-06-15 12:51:24,292 INFO [train.py:988] (2/4) Epoch 41, batch 200, loss[loss=0.1966, simple_loss=0.2739, pruned_loss=0.05972, over 20571.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2874, pruned_loss=0.06269, over 2393384.80 frames. ], batch size: 189, lr: 8.09e-03, grad_scale: 32.0 2023-06-15 12:51:58,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=143440.0, ans=0.1 2023-06-15 12:52:18,426 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.761e+02 1.969e+02 2.341e+02 3.526e+02, threshold=3.938e+02, percent-clipped=0.0 2023-06-15 12:52:54,202 INFO [train.py:988] (2/4) Epoch 41, batch 250, loss[loss=0.2039, simple_loss=0.2896, pruned_loss=0.05911, over 19102.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2874, pruned_loss=0.06211, over 2702083.19 frames. ], batch size: 89, lr: 8.08e-03, grad_scale: 32.0 2023-06-15 12:52:54,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=143640.0, ans=0.0 2023-06-15 12:53:25,161 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:53:56,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=143840.0, ans=0.0 2023-06-15 12:54:24,725 INFO [train.py:988] (2/4) Epoch 41, batch 300, loss[loss=0.2087, simple_loss=0.2842, pruned_loss=0.06665, over 20284.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2864, pruned_loss=0.06303, over 2951255.96 frames. ], batch size: 141, lr: 8.07e-03, grad_scale: 32.0 2023-06-15 12:55:19,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.842e+02 2.030e+02 2.348e+02 3.359e+02, threshold=4.059e+02, percent-clipped=0.0 2023-06-15 12:55:38,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144240.0, ans=0.125 2023-06-15 12:55:41,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2023-06-15 12:55:54,961 INFO [train.py:988] (2/4) Epoch 41, batch 350, loss[loss=0.238, simple_loss=0.3252, pruned_loss=0.07541, over 16369.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2866, pruned_loss=0.06257, over 3146356.09 frames. ], batch size: 52, lr: 8.06e-03, grad_scale: 32.0 2023-06-15 12:56:32,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=144440.0, ans=0.2 2023-06-15 12:56:43,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=144440.0, ans=0.125 2023-06-15 12:56:52,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=144506.66666666666, ans=0.0 2023-06-15 12:57:10,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=144573.33333333334, ans=0.0 2023-06-15 12:57:18,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=144573.33333333334, ans=0.125 2023-06-15 12:57:20,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=144573.33333333334, ans=0.0 2023-06-15 12:57:20,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=144573.33333333334, ans=0.125 2023-06-15 12:57:25,281 INFO [train.py:988] (2/4) Epoch 41, batch 400, loss[loss=0.2261, simple_loss=0.3, pruned_loss=0.0761, over 20257.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2864, pruned_loss=0.06268, over 3279117.68 frames. ], batch size: 141, lr: 8.05e-03, grad_scale: 32.0 2023-06-15 12:57:25,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=144640.0, ans=0.125 2023-06-15 12:57:35,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-15 12:57:57,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=144706.66666666666, ans=0.125 2023-06-15 12:58:03,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=144773.33333333334, ans=0.07 2023-06-15 12:58:13,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.37 vs. limit=15.0 2023-06-15 12:58:17,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.778e+02 1.933e+02 2.211e+02 3.033e+02, threshold=3.866e+02, percent-clipped=0.0 2023-06-15 12:58:34,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=144906.66666666666, ans=0.125 2023-06-15 12:58:44,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-15 12:58:53,415 INFO [train.py:988] (2/4) Epoch 41, batch 450, loss[loss=0.2226, simple_loss=0.304, pruned_loss=0.07058, over 18509.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2862, pruned_loss=0.06238, over 3393090.26 frames. ], batch size: 77, lr: 8.04e-03, grad_scale: 32.0 2023-06-15 13:00:17,862 INFO [train.py:988] (2/4) Epoch 41, batch 500, loss[loss=0.2085, simple_loss=0.2986, pruned_loss=0.05918, over 18292.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.288, pruned_loss=0.06261, over 3473641.25 frames. ], batch size: 74, lr: 8.04e-03, grad_scale: 32.0 2023-06-15 13:00:19,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145306.66666666666, ans=0.125 2023-06-15 13:00:21,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=145306.66666666666, ans=0.0 2023-06-15 13:00:43,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=145373.33333333334, ans=0.125 2023-06-15 13:01:01,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145440.0, ans=0.125 2023-06-15 13:01:07,587 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.783e+02 1.958e+02 2.209e+02 2.904e+02, threshold=3.915e+02, percent-clipped=0.0 2023-06-15 13:01:34,524 INFO [train.py:988] (2/4) Epoch 42, batch 0, loss[loss=0.1948, simple_loss=0.2744, pruned_loss=0.05755, over 20285.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2744, pruned_loss=0.05755, over 20285.00 frames. ], batch size: 149, lr: 7.93e-03, grad_scale: 32.0 2023-06-15 13:01:34,525 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 13:01:40,651 INFO [train.py:1020] (2/4) Epoch 42, validation: loss=0.1999, simple_loss=0.2992, pruned_loss=0.05028, over 143649.00 frames. 2023-06-15 13:01:40,652 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 13:01:53,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=145520.0, ans=0.125 2023-06-15 13:01:59,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=145586.66666666666, ans=0.125 2023-06-15 13:02:09,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=145586.66666666666, ans=0.125 2023-06-15 13:02:19,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=145653.33333333334, ans=0.125 2023-06-15 13:02:19,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145653.33333333334, ans=0.0 2023-06-15 13:03:01,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.84 vs. limit=22.5 2023-06-15 13:03:06,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-06-15 13:03:10,744 INFO [train.py:988] (2/4) Epoch 42, batch 50, loss[loss=0.2004, simple_loss=0.2849, pruned_loss=0.05799, over 19225.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2858, pruned_loss=0.06259, over 849021.14 frames. ], batch size: 92, lr: 7.93e-03, grad_scale: 32.0 2023-06-15 13:03:19,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=145853.33333333334, ans=0.0 2023-06-15 13:03:31,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=145920.0, ans=0.0 2023-06-15 13:04:08,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=146053.33333333334, ans=0.0 2023-06-15 13:04:12,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.46 vs. limit=10.0 2023-06-15 13:04:14,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.82 vs. limit=22.5 2023-06-15 13:04:17,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=146053.33333333334, ans=0.125 2023-06-15 13:04:35,846 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.815e+02 2.002e+02 2.267e+02 3.167e+02, threshold=4.003e+02, percent-clipped=0.0 2023-06-15 13:04:39,182 INFO [train.py:988] (2/4) Epoch 42, batch 100, loss[loss=0.2013, simple_loss=0.2888, pruned_loss=0.05693, over 19339.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2851, pruned_loss=0.06113, over 1514276.52 frames. ], batch size: 98, lr: 7.92e-03, grad_scale: 32.0 2023-06-15 13:05:30,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=146386.66666666666, ans=0.0 2023-06-15 13:05:56,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.18 vs. limit=15.0 2023-06-15 13:06:08,114 INFO [train.py:988] (2/4) Epoch 42, batch 150, loss[loss=0.1813, simple_loss=0.2676, pruned_loss=0.04753, over 18934.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2853, pruned_loss=0.06094, over 2024373.92 frames. ], batch size: 86, lr: 7.91e-03, grad_scale: 32.0 2023-06-15 13:06:33,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146586.66666666666, ans=0.1 2023-06-15 13:06:38,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=146586.66666666666, ans=0.125 2023-06-15 13:06:50,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=146653.33333333334, ans=0.2 2023-06-15 13:07:05,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=146720.0, ans=0.0 2023-06-15 13:07:10,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=146720.0, ans=0.0 2023-06-15 13:07:33,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.04 vs. limit=22.5 2023-06-15 13:07:34,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.782e+02 1.960e+02 2.224e+02 3.500e+02, threshold=3.921e+02, percent-clipped=0.0 2023-06-15 13:07:37,741 INFO [train.py:988] (2/4) Epoch 42, batch 200, loss[loss=0.2101, simple_loss=0.2864, pruned_loss=0.06684, over 18628.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2856, pruned_loss=0.06145, over 2424211.32 frames. ], batch size: 80, lr: 7.90e-03, grad_scale: 32.0 2023-06-15 13:08:22,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=146986.66666666666, ans=0.07 2023-06-15 13:08:23,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=146986.66666666666, ans=15.0 2023-06-15 13:09:04,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147120.0, ans=0.1 2023-06-15 13:09:07,773 INFO [train.py:988] (2/4) Epoch 42, batch 250, loss[loss=0.1945, simple_loss=0.2746, pruned_loss=0.05718, over 20557.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2852, pruned_loss=0.06169, over 2734043.31 frames. ], batch size: 173, lr: 7.89e-03, grad_scale: 32.0 2023-06-15 13:09:13,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=147186.66666666666, ans=0.125 2023-06-15 13:09:28,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-15 13:09:33,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=147253.33333333334, ans=0.125 2023-06-15 13:09:42,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147320.0, ans=0.1 2023-06-15 13:10:02,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=147386.66666666666, ans=0.2 2023-06-15 13:10:04,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=147386.66666666666, ans=0.125 2023-06-15 13:10:11,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=147386.66666666666, ans=0.05 2023-06-15 13:10:24,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-15 13:10:32,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.737e+02 1.907e+02 2.072e+02 2.821e+02, threshold=3.814e+02, percent-clipped=0.0 2023-06-15 13:10:33,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=147453.33333333334, ans=0.125 2023-06-15 13:10:36,353 INFO [train.py:988] (2/4) Epoch 42, batch 300, loss[loss=0.1939, simple_loss=0.2635, pruned_loss=0.06216, over 20297.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2857, pruned_loss=0.06161, over 2960968.45 frames. ], batch size: 239, lr: 7.88e-03, grad_scale: 32.0 2023-06-15 13:10:37,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=22.5 2023-06-15 13:10:54,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=147586.66666666666, ans=0.125 2023-06-15 13:11:03,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=147586.66666666666, ans=0.05 2023-06-15 13:11:09,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-15 13:12:05,697 INFO [train.py:988] (2/4) Epoch 42, batch 350, loss[loss=0.2269, simple_loss=0.2716, pruned_loss=0.09114, over 17030.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2856, pruned_loss=0.06205, over 3144997.65 frames. ], batch size: 391, lr: 7.88e-03, grad_scale: 32.0 2023-06-15 13:12:20,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=147853.33333333334, ans=0.5 2023-06-15 13:12:26,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=147920.0, ans=0.125 2023-06-15 13:12:54,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-15 13:13:30,812 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.764e+02 1.957e+02 2.256e+02 2.981e+02, threshold=3.914e+02, percent-clipped=0.0 2023-06-15 13:13:32,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-06-15 13:13:34,291 INFO [train.py:988] (2/4) Epoch 42, batch 400, loss[loss=0.1965, simple_loss=0.2721, pruned_loss=0.06044, over 20311.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2861, pruned_loss=0.06167, over 3277649.51 frames. ], batch size: 149, lr: 7.87e-03, grad_scale: 32.0 2023-06-15 13:13:43,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148186.66666666666, ans=0.1 2023-06-15 13:13:43,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=148186.66666666666, ans=0.125 2023-06-15 13:13:49,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-06-15 13:14:40,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=148386.66666666666, ans=0.125 2023-06-15 13:14:50,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=148453.33333333334, ans=0.2 2023-06-15 13:15:03,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-15 13:15:03,359 INFO [train.py:988] (2/4) Epoch 42, batch 450, loss[loss=0.2044, simple_loss=0.2825, pruned_loss=0.0631, over 20274.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2867, pruned_loss=0.06175, over 3391804.87 frames. ], batch size: 141, lr: 7.86e-03, grad_scale: 32.0 2023-06-15 13:15:11,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=148520.0, ans=0.0 2023-06-15 13:15:29,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.22 vs. limit=6.0 2023-06-15 13:15:45,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-15 13:15:50,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=148653.33333333334, ans=0.04949747468305833 2023-06-15 13:15:54,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=148720.0, ans=0.125 2023-06-15 13:16:26,116 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.894e+02 2.076e+02 2.350e+02 3.042e+02, threshold=4.151e+02, percent-clipped=0.0 2023-06-15 13:16:29,395 INFO [train.py:988] (2/4) Epoch 42, batch 500, loss[loss=0.2152, simple_loss=0.3021, pruned_loss=0.0642, over 16336.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2868, pruned_loss=0.06239, over 3492302.42 frames. ], batch size: 52, lr: 7.85e-03, grad_scale: 32.0 2023-06-15 13:16:59,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=148920.0, ans=6.0 2023-06-15 13:17:13,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=148986.66666666666, ans=0.2 2023-06-15 13:17:14,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=148986.66666666666, ans=0.1 2023-06-15 13:17:16,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=148986.66666666666, ans=0.125 2023-06-15 13:17:19,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149053.33333333334, ans=0.125 2023-06-15 13:17:51,439 INFO [train.py:988] (2/4) Epoch 43, batch 0, loss[loss=0.217, simple_loss=0.2646, pruned_loss=0.08468, over 16769.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2646, pruned_loss=0.08468, over 16769.00 frames. ], batch size: 391, lr: 7.76e-03, grad_scale: 32.0 2023-06-15 13:17:51,439 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 13:17:57,701 INFO [train.py:1020] (2/4) Epoch 43, validation: loss=0.2014, simple_loss=0.3004, pruned_loss=0.05115, over 143649.00 frames. 2023-06-15 13:17:57,702 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 13:17:59,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=149073.33333333334, ans=0.2 2023-06-15 13:18:15,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=149140.0, ans=0.025 2023-06-15 13:18:18,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=149140.0, ans=0.0 2023-06-15 13:18:42,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=149206.66666666666, ans=0.125 2023-06-15 13:18:48,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=149206.66666666666, ans=0.1 2023-06-15 13:19:05,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=149273.33333333334, ans=0.125 2023-06-15 13:19:08,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149340.0, ans=0.125 2023-06-15 13:19:26,758 INFO [train.py:988] (2/4) Epoch 43, batch 50, loss[loss=0.2084, simple_loss=0.2527, pruned_loss=0.08208, over 16972.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2868, pruned_loss=0.0613, over 848777.72 frames. ], batch size: 391, lr: 7.75e-03, grad_scale: 32.0 2023-06-15 13:19:41,506 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:19:53,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.770e+02 1.939e+02 2.280e+02 3.061e+02, threshold=3.878e+02, percent-clipped=0.0 2023-06-15 13:20:11,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=149540.0, ans=0.5 2023-06-15 13:20:15,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=22.5 2023-06-15 13:20:22,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=149606.66666666666, ans=0.0 2023-06-15 13:20:33,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=12.0 2023-06-15 13:20:55,037 INFO [train.py:988] (2/4) Epoch 43, batch 100, loss[loss=0.2002, simple_loss=0.2728, pruned_loss=0.06379, over 20705.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2856, pruned_loss=0.06207, over 1498597.18 frames. ], batch size: 211, lr: 7.74e-03, grad_scale: 32.0 2023-06-15 13:21:30,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149873.33333333334, ans=0.125 2023-06-15 13:21:45,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=149873.33333333334, ans=0.2 2023-06-15 13:22:11,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=150006.66666666666, ans=0.0 2023-06-15 13:22:15,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-15 13:22:17,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-15 13:22:23,161 INFO [train.py:988] (2/4) Epoch 43, batch 150, loss[loss=0.1949, simple_loss=0.2816, pruned_loss=0.05404, over 19076.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2857, pruned_loss=0.0615, over 2006763.94 frames. ], batch size: 89, lr: 7.73e-03, grad_scale: 32.0 2023-06-15 13:22:32,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=150073.33333333334, ans=0.125 2023-06-15 13:22:43,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=150140.0, ans=0.2 2023-06-15 13:22:50,681 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.767e+02 1.914e+02 2.114e+02 3.326e+02, threshold=3.828e+02, percent-clipped=0.0 2023-06-15 13:23:15,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=150273.33333333334, ans=10.0 2023-06-15 13:23:51,955 INFO [train.py:988] (2/4) Epoch 43, batch 200, loss[loss=0.199, simple_loss=0.2996, pruned_loss=0.04921, over 15500.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2863, pruned_loss=0.06155, over 2404505.69 frames. ], batch size: 44, lr: 7.72e-03, grad_scale: 32.0 2023-06-15 13:24:05,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=150406.66666666666, ans=0.125 2023-06-15 13:24:07,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.28 vs. limit=12.0 2023-06-15 13:24:08,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=150473.33333333334, ans=0.2 2023-06-15 13:24:24,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2023-06-15 13:24:45,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=150606.66666666666, ans=0.0 2023-06-15 13:24:51,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=150606.66666666666, ans=0.1 2023-06-15 13:25:03,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150673.33333333334, ans=0.1 2023-06-15 13:25:21,386 INFO [train.py:988] (2/4) Epoch 43, batch 250, loss[loss=0.1917, simple_loss=0.2812, pruned_loss=0.05112, over 18644.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2855, pruned_loss=0.06105, over 2727102.49 frames. ], batch size: 80, lr: 7.72e-03, grad_scale: 32.0 2023-06-15 13:25:39,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=150806.66666666666, ans=0.0 2023-06-15 13:25:43,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-15 13:25:47,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.795e+02 2.042e+02 2.211e+02 3.400e+02, threshold=4.084e+02, percent-clipped=0.0 2023-06-15 13:25:50,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=150806.66666666666, ans=0.0 2023-06-15 13:26:29,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=150940.0, ans=0.125 2023-06-15 13:26:31,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=151006.66666666666, ans=0.125 2023-06-15 13:26:45,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=151006.66666666666, ans=0.125 2023-06-15 13:26:49,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=151073.33333333334, ans=0.2 2023-06-15 13:26:50,344 INFO [train.py:988] (2/4) Epoch 43, batch 300, loss[loss=0.1938, simple_loss=0.2763, pruned_loss=0.05567, over 18616.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2859, pruned_loss=0.06144, over 2961985.65 frames. ], batch size: 80, lr: 7.71e-03, grad_scale: 32.0 2023-06-15 13:26:57,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=151073.33333333334, ans=0.125 2023-06-15 13:26:59,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.78 vs. limit=15.0 2023-06-15 13:27:27,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151206.66666666666, ans=0.1 2023-06-15 13:27:31,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=151206.66666666666, ans=0.0 2023-06-15 13:28:02,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=151340.0, ans=0.0 2023-06-15 13:28:02,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.14 vs. limit=15.0 2023-06-15 13:28:17,960 INFO [train.py:988] (2/4) Epoch 43, batch 350, loss[loss=0.2086, simple_loss=0.2871, pruned_loss=0.06499, over 20525.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2861, pruned_loss=0.06079, over 3151496.55 frames. ], batch size: 160, lr: 7.70e-03, grad_scale: 64.0 2023-06-15 13:28:20,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-15 13:28:44,428 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.733e+02 1.920e+02 2.080e+02 2.736e+02, threshold=3.841e+02, percent-clipped=0.0 2023-06-15 13:28:59,637 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:29:27,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=151673.33333333334, ans=0.0 2023-06-15 13:29:41,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.37 vs. limit=5.0 2023-06-15 13:29:44,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151673.33333333334, ans=0.125 2023-06-15 13:29:47,164 INFO [train.py:988] (2/4) Epoch 43, batch 400, loss[loss=0.205, simple_loss=0.2931, pruned_loss=0.05843, over 19253.00 frames. ], tot_loss[loss=0.204, simple_loss=0.286, pruned_loss=0.06096, over 3294018.77 frames. ], batch size: 92, lr: 7.69e-03, grad_scale: 32.0 2023-06-15 13:30:00,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-15 13:30:42,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151940.0, ans=0.125 2023-06-15 13:30:42,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=15.0 2023-06-15 13:30:45,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=151940.0, ans=10.0 2023-06-15 13:31:16,498 INFO [train.py:988] (2/4) Epoch 43, batch 450, loss[loss=0.2074, simple_loss=0.2966, pruned_loss=0.05908, over 13460.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2858, pruned_loss=0.06079, over 3403997.46 frames. ], batch size: 38, lr: 7.69e-03, grad_scale: 32.0 2023-06-15 13:31:28,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152073.33333333334, ans=0.1 2023-06-15 13:31:43,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.899e+02 2.101e+02 2.447e+02 3.803e+02, threshold=4.202e+02, percent-clipped=0.0 2023-06-15 13:31:59,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=152206.66666666666, ans=0.125 2023-06-15 13:32:18,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152273.33333333334, ans=0.1 2023-06-15 13:32:30,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-15 13:32:32,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152340.0, ans=0.125 2023-06-15 13:32:38,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-15 13:32:42,703 INFO [train.py:988] (2/4) Epoch 43, batch 500, loss[loss=0.2198, simple_loss=0.2614, pruned_loss=0.08907, over 17055.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2864, pruned_loss=0.06091, over 3478799.57 frames. ], batch size: 392, lr: 7.68e-03, grad_scale: 32.0 2023-06-15 13:33:07,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=22.5 2023-06-15 13:33:15,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=152540.0, ans=0.0 2023-06-15 13:33:21,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=152540.0, ans=0.125 2023-06-15 13:33:31,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.49 vs. limit=22.5 2023-06-15 13:34:02,655 INFO [train.py:988] (2/4) Epoch 44, batch 0, loss[loss=0.1951, simple_loss=0.2813, pruned_loss=0.05449, over 18649.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2813, pruned_loss=0.05449, over 18649.00 frames. ], batch size: 80, lr: 7.58e-03, grad_scale: 32.0 2023-06-15 13:34:02,656 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 13:34:08,924 INFO [train.py:1020] (2/4) Epoch 44, validation: loss=0.204, simple_loss=0.3011, pruned_loss=0.05343, over 143649.00 frames. 2023-06-15 13:34:08,925 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 13:34:21,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=152626.66666666666, ans=0.125 2023-06-15 13:34:59,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=152826.66666666666, ans=0.125 2023-06-15 13:35:06,268 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.836e+02 2.115e+02 2.307e+02 4.215e+02, threshold=4.230e+02, percent-clipped=1.0 2023-06-15 13:35:32,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=152893.33333333334, ans=0.125 2023-06-15 13:35:35,461 INFO [train.py:988] (2/4) Epoch 44, batch 50, loss[loss=0.2006, simple_loss=0.2868, pruned_loss=0.05715, over 19302.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2869, pruned_loss=0.06217, over 866535.67 frames. ], batch size: 98, lr: 7.58e-03, grad_scale: 32.0 2023-06-15 13:35:40,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152960.0, ans=0.1 2023-06-15 13:35:47,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=152960.0, ans=0.0 2023-06-15 13:35:57,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-15 13:36:07,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-15 13:36:30,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=153160.0, ans=0.0 2023-06-15 13:36:40,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=153160.0, ans=0.5 2023-06-15 13:36:42,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-15 13:36:43,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=153160.0, ans=0.125 2023-06-15 13:36:51,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=153226.66666666666, ans=0.2 2023-06-15 13:37:02,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=22.5 2023-06-15 13:37:03,144 INFO [train.py:988] (2/4) Epoch 44, batch 100, loss[loss=0.2355, simple_loss=0.3233, pruned_loss=0.07383, over 18314.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2886, pruned_loss=0.06195, over 1521265.55 frames. ], batch size: 72, lr: 7.57e-03, grad_scale: 32.0 2023-06-15 13:37:03,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153293.33333333334, ans=0.0 2023-06-15 13:37:03,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=153293.33333333334, ans=0.0 2023-06-15 13:37:21,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=153360.0, ans=0.125 2023-06-15 13:37:32,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=153360.0, ans=0.2 2023-06-15 13:37:42,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=153426.66666666666, ans=0.125 2023-06-15 13:38:00,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=153493.33333333334, ans=0.125 2023-06-15 13:38:03,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.855e+02 2.123e+02 2.461e+02 3.591e+02, threshold=4.246e+02, percent-clipped=0.0 2023-06-15 13:38:03,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=153493.33333333334, ans=0.125 2023-06-15 13:38:23,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=153560.0, ans=0.07 2023-06-15 13:38:26,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=153560.0, ans=0.2 2023-06-15 13:38:30,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=153560.0, ans=0.125 2023-06-15 13:38:32,992 INFO [train.py:988] (2/4) Epoch 44, batch 150, loss[loss=0.1964, simple_loss=0.2851, pruned_loss=0.05383, over 18637.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2856, pruned_loss=0.06161, over 2029557.15 frames. ], batch size: 80, lr: 7.56e-03, grad_scale: 32.0 2023-06-15 13:38:37,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=153626.66666666666, ans=0.0 2023-06-15 13:38:45,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.98 vs. limit=15.0 2023-06-15 13:38:59,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=153693.33333333334, ans=0.125 2023-06-15 13:39:19,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=153760.0, ans=0.0 2023-06-15 13:40:01,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153960.0, ans=0.1 2023-06-15 13:40:02,707 INFO [train.py:988] (2/4) Epoch 44, batch 200, loss[loss=0.2036, simple_loss=0.2808, pruned_loss=0.06314, over 20518.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.285, pruned_loss=0.06067, over 2423101.87 frames. ], batch size: 173, lr: 7.56e-03, grad_scale: 32.0 2023-06-15 13:40:12,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=153960.0, ans=0.125 2023-06-15 13:40:16,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.04 vs. limit=15.0 2023-06-15 13:40:20,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=154026.66666666666, ans=0.125 2023-06-15 13:41:00,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154160.0, ans=0.1 2023-06-15 13:41:02,105 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.737e+02 1.890e+02 2.056e+02 2.866e+02, threshold=3.780e+02, percent-clipped=0.0 2023-06-15 13:41:18,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=154226.66666666666, ans=0.2 2023-06-15 13:41:22,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=154226.66666666666, ans=0.0 2023-06-15 13:41:28,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=154226.66666666666, ans=0.125 2023-06-15 13:41:32,045 INFO [train.py:988] (2/4) Epoch 44, batch 250, loss[loss=0.2116, simple_loss=0.2813, pruned_loss=0.07096, over 20263.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2853, pruned_loss=0.06041, over 2717016.95 frames. ], batch size: 239, lr: 7.55e-03, grad_scale: 32.0 2023-06-15 13:41:37,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=154293.33333333334, ans=0.2 2023-06-15 13:41:46,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=154293.33333333334, ans=0.125 2023-06-15 13:42:06,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-15 13:42:19,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=154426.66666666666, ans=0.0 2023-06-15 13:43:00,345 INFO [train.py:988] (2/4) Epoch 44, batch 300, loss[loss=0.2093, simple_loss=0.2832, pruned_loss=0.0677, over 20663.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2855, pruned_loss=0.06034, over 2946126.24 frames. ], batch size: 211, lr: 7.54e-03, grad_scale: 32.0 2023-06-15 13:43:22,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=154693.33333333334, ans=0.125 2023-06-15 13:43:36,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=154760.0, ans=0.125 2023-06-15 13:44:00,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.826e+02 2.061e+02 2.441e+02 3.264e+02, threshold=4.122e+02, percent-clipped=0.0 2023-06-15 13:44:06,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-15 13:44:30,147 INFO [train.py:988] (2/4) Epoch 44, batch 350, loss[loss=0.1947, simple_loss=0.2814, pruned_loss=0.05399, over 19812.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2852, pruned_loss=0.06006, over 3137568.27 frames. ], batch size: 115, lr: 7.53e-03, grad_scale: 32.0 2023-06-15 13:44:41,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=154960.0, ans=0.1 2023-06-15 13:44:45,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=12.0 2023-06-15 13:45:10,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:15,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:17,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:24,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=155160.0, ans=0.09899494936611666 2023-06-15 13:45:32,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=155160.0, ans=0.2 2023-06-15 13:45:59,548 INFO [train.py:988] (2/4) Epoch 44, batch 400, loss[loss=0.1728, simple_loss=0.2585, pruned_loss=0.04355, over 19639.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2849, pruned_loss=0.05947, over 3280946.78 frames. ], batch size: 110, lr: 7.53e-03, grad_scale: 32.0 2023-06-15 13:46:28,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155360.0, ans=0.1 2023-06-15 13:46:32,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-06-15 13:46:50,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=155493.33333333334, ans=0.125 2023-06-15 13:46:57,030 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.791e+02 1.946e+02 2.236e+02 4.124e+02, threshold=3.892e+02, percent-clipped=1.0 2023-06-15 13:46:59,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-15 13:47:03,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=155493.33333333334, ans=6.0 2023-06-15 13:47:26,935 INFO [train.py:988] (2/4) Epoch 44, batch 450, loss[loss=0.1886, simple_loss=0.275, pruned_loss=0.05107, over 19539.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2853, pruned_loss=0.05991, over 3388585.39 frames. ], batch size: 102, lr: 7.52e-03, grad_scale: 32.0 2023-06-15 13:47:39,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=155626.66666666666, ans=0.125 2023-06-15 13:47:57,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=155693.33333333334, ans=0.0 2023-06-15 13:47:59,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=155693.33333333334, ans=0.125 2023-06-15 13:48:32,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=155826.66666666666, ans=0.125 2023-06-15 13:48:46,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155893.33333333334, ans=0.125 2023-06-15 13:48:52,707 INFO [train.py:988] (2/4) Epoch 44, batch 500, loss[loss=0.2031, simple_loss=0.2816, pruned_loss=0.06231, over 20083.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2847, pruned_loss=0.05998, over 3478519.57 frames. ], batch size: 133, lr: 7.51e-03, grad_scale: 32.0 2023-06-15 13:49:23,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156026.66666666666, ans=0.1 2023-06-15 13:49:28,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=156093.33333333334, ans=0.125 2023-06-15 13:49:43,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-15 13:50:05,168 INFO [train.py:988] (2/4) Epoch 45, batch 0, loss[loss=0.1855, simple_loss=0.2758, pruned_loss=0.04757, over 19106.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2758, pruned_loss=0.04757, over 19106.00 frames. ], batch size: 94, lr: 7.42e-03, grad_scale: 32.0 2023-06-15 13:50:05,169 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 13:50:12,412 INFO [train.py:1020] (2/4) Epoch 45, validation: loss=0.2006, simple_loss=0.2992, pruned_loss=0.05105, over 143649.00 frames. 2023-06-15 13:50:12,413 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 13:50:14,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.838e+02 2.044e+02 2.323e+02 3.630e+02, threshold=4.088e+02, percent-clipped=0.0 2023-06-15 13:50:26,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156173.33333333334, ans=0.1 2023-06-15 13:50:52,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.62 vs. limit=6.0 2023-06-15 13:51:02,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=156306.66666666666, ans=0.015 2023-06-15 13:51:06,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=156373.33333333334, ans=0.0 2023-06-15 13:51:11,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=156373.33333333334, ans=0.0 2023-06-15 13:51:18,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=156373.33333333334, ans=0.025 2023-06-15 13:51:41,661 INFO [train.py:988] (2/4) Epoch 45, batch 50, loss[loss=0.1941, simple_loss=0.2834, pruned_loss=0.05237, over 19466.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2848, pruned_loss=0.05968, over 860765.87 frames. ], batch size: 105, lr: 7.41e-03, grad_scale: 32.0 2023-06-15 13:51:50,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=156506.66666666666, ans=0.125 2023-06-15 13:51:55,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=156506.66666666666, ans=0.0 2023-06-15 13:52:15,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=156640.0, ans=0.125 2023-06-15 13:52:24,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=156640.0, ans=0.125 2023-06-15 13:52:24,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156640.0, ans=0.1 2023-06-15 13:52:32,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=156706.66666666666, ans=0.125 2023-06-15 13:52:51,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-06-15 13:53:06,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=156773.33333333334, ans=0.0 2023-06-15 13:53:10,647 INFO [train.py:988] (2/4) Epoch 45, batch 100, loss[loss=0.1874, simple_loss=0.2742, pruned_loss=0.05031, over 19490.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2842, pruned_loss=0.05888, over 1514492.16 frames. ], batch size: 102, lr: 7.41e-03, grad_scale: 32.0 2023-06-15 13:53:12,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.891e+02 2.087e+02 2.341e+02 3.228e+02, threshold=4.175e+02, percent-clipped=0.0 2023-06-15 13:53:44,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=12.0 2023-06-15 13:53:49,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=156973.33333333334, ans=0.125 2023-06-15 13:54:05,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=157040.0, ans=0.125 2023-06-15 13:54:05,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=157040.0, ans=0.125 2023-06-15 13:54:38,935 INFO [train.py:988] (2/4) Epoch 45, batch 150, loss[loss=0.1901, simple_loss=0.2814, pruned_loss=0.04945, over 19072.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2841, pruned_loss=0.05824, over 2020271.51 frames. ], batch size: 89, lr: 7.40e-03, grad_scale: 32.0 2023-06-15 13:54:46,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=157173.33333333334, ans=0.125 2023-06-15 13:55:14,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-15 13:55:39,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=157373.33333333334, ans=0.2 2023-06-15 13:55:48,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=157440.0, ans=0.0 2023-06-15 13:56:00,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=157440.0, ans=0.125 2023-06-15 13:56:07,427 INFO [train.py:988] (2/4) Epoch 45, batch 200, loss[loss=0.2181, simple_loss=0.3162, pruned_loss=0.06004, over 17609.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2847, pruned_loss=0.05892, over 2416810.09 frames. ], batch size: 67, lr: 7.39e-03, grad_scale: 32.0 2023-06-15 13:56:09,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.822e+02 1.971e+02 2.216e+02 3.677e+02, threshold=3.943e+02, percent-clipped=0.0 2023-06-15 13:56:19,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=157506.66666666666, ans=0.125 2023-06-15 13:56:53,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157640.0, ans=0.1 2023-06-15 13:56:53,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=157640.0, ans=0.125 2023-06-15 13:57:09,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=157706.66666666666, ans=0.0 2023-06-15 13:57:27,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=157773.33333333334, ans=0.125 2023-06-15 13:57:35,488 INFO [train.py:988] (2/4) Epoch 45, batch 250, loss[loss=0.1932, simple_loss=0.2828, pruned_loss=0.05185, over 19098.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2845, pruned_loss=0.05936, over 2713592.36 frames. ], batch size: 94, lr: 7.39e-03, grad_scale: 32.0 2023-06-15 13:57:36,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=157840.0, ans=0.0 2023-06-15 13:57:48,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-15 13:57:55,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=157906.66666666666, ans=0.0 2023-06-15 13:58:08,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.02 vs. limit=12.0 2023-06-15 13:58:30,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=158040.0, ans=0.07 2023-06-15 13:58:52,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=158106.66666666666, ans=0.125 2023-06-15 13:59:02,823 INFO [train.py:988] (2/4) Epoch 45, batch 300, loss[loss=0.188, simple_loss=0.2768, pruned_loss=0.04966, over 19477.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2845, pruned_loss=0.05954, over 2950380.08 frames. ], batch size: 105, lr: 7.38e-03, grad_scale: 32.0 2023-06-15 13:59:04,375 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.818e+02 2.016e+02 2.367e+02 3.194e+02, threshold=4.032e+02, percent-clipped=0.0 2023-06-15 13:59:47,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-15 13:59:50,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=158306.66666666666, ans=0.5 2023-06-15 14:00:31,875 INFO [train.py:988] (2/4) Epoch 45, batch 350, loss[loss=0.2031, simple_loss=0.2814, pruned_loss=0.06245, over 19964.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2841, pruned_loss=0.05939, over 3140090.18 frames. ], batch size: 126, lr: 7.37e-03, grad_scale: 32.0 2023-06-15 14:00:32,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=158506.66666666666, ans=0.125 2023-06-15 14:00:44,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=158506.66666666666, ans=0.125 2023-06-15 14:00:48,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=158573.33333333334, ans=0.125 2023-06-15 14:00:52,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=158573.33333333334, ans=0.2 2023-06-15 14:01:15,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=158640.0, ans=0.125 2023-06-15 14:01:40,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=158706.66666666666, ans=0.0 2023-06-15 14:01:49,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=158773.33333333334, ans=0.125 2023-06-15 14:01:49,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=158773.33333333334, ans=0.125 2023-06-15 14:02:00,909 INFO [train.py:988] (2/4) Epoch 45, batch 400, loss[loss=0.2117, simple_loss=0.2922, pruned_loss=0.06563, over 17002.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2839, pruned_loss=0.05943, over 3282251.03 frames. ], batch size: 60, lr: 7.36e-03, grad_scale: 32.0 2023-06-15 14:02:02,524 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.797e+02 1.967e+02 2.284e+02 3.128e+02, threshold=3.934e+02, percent-clipped=0.0 2023-06-15 14:02:22,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=158906.66666666666, ans=0.0 2023-06-15 14:02:46,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158973.33333333334, ans=0.125 2023-06-15 14:03:20,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=159106.66666666666, ans=0.0 2023-06-15 14:03:28,296 INFO [train.py:988] (2/4) Epoch 45, batch 450, loss[loss=0.2213, simple_loss=0.315, pruned_loss=0.06375, over 18344.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2845, pruned_loss=0.05962, over 3394939.81 frames. ], batch size: 72, lr: 7.36e-03, grad_scale: 16.0 2023-06-15 14:03:28,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=159173.33333333334, ans=0.0 2023-06-15 14:03:32,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.38 vs. limit=22.5 2023-06-15 14:04:04,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=159306.66666666666, ans=0.125 2023-06-15 14:04:46,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=159440.0, ans=0.125 2023-06-15 14:04:52,861 INFO [train.py:988] (2/4) Epoch 45, batch 500, loss[loss=0.2025, simple_loss=0.2836, pruned_loss=0.06071, over 20248.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2844, pruned_loss=0.05991, over 3490879.46 frames. ], batch size: 141, lr: 7.35e-03, grad_scale: 16.0 2023-06-15 14:04:56,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.833e+02 2.042e+02 2.435e+02 3.752e+02, threshold=4.085e+02, percent-clipped=0.0 2023-06-15 14:05:34,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159640.0, ans=0.1 2023-06-15 14:06:05,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159726.66666666666, ans=0.0 2023-06-15 14:06:16,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-06-15 14:06:17,163 INFO [train.py:988] (2/4) Epoch 46, batch 0, loss[loss=0.1803, simple_loss=0.2689, pruned_loss=0.04589, over 19548.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.2689, pruned_loss=0.04589, over 19548.00 frames. ], batch size: 102, lr: 7.27e-03, grad_scale: 32.0 2023-06-15 14:06:17,163 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 14:06:23,568 INFO [train.py:1020] (2/4) Epoch 46, validation: loss=0.2018, simple_loss=0.3001, pruned_loss=0.05177, over 143649.00 frames. 2023-06-15 14:06:23,569 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 14:06:35,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=159726.66666666666, ans=0.125 2023-06-15 14:06:40,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159793.33333333334, ans=0.1 2023-06-15 14:06:45,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=159793.33333333334, ans=0.0 2023-06-15 14:06:47,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=159793.33333333334, ans=0.125 2023-06-15 14:07:39,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=159993.33333333334, ans=0.0 2023-06-15 14:07:53,068 INFO [train.py:988] (2/4) Epoch 46, batch 50, loss[loss=0.1908, simple_loss=0.2667, pruned_loss=0.05742, over 20131.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2827, pruned_loss=0.05682, over 836495.31 frames. ], batch size: 239, lr: 7.26e-03, grad_scale: 32.0 2023-06-15 14:08:25,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.799e+02 2.021e+02 2.409e+02 3.297e+02, threshold=4.042e+02, percent-clipped=0.0 2023-06-15 14:09:01,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-15 14:09:12,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=160326.66666666666, ans=0.125 2023-06-15 14:09:19,624 INFO [train.py:988] (2/4) Epoch 46, batch 100, loss[loss=0.1976, simple_loss=0.2784, pruned_loss=0.0584, over 20302.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2823, pruned_loss=0.05897, over 1497613.79 frames. ], batch size: 149, lr: 7.25e-03, grad_scale: 16.0 2023-06-15 14:09:40,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160460.0, ans=0.125 2023-06-15 14:10:06,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=160526.66666666666, ans=0.0 2023-06-15 14:10:19,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.48 vs. limit=15.0 2023-06-15 14:10:21,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160593.33333333334, ans=0.1 2023-06-15 14:10:45,280 INFO [train.py:988] (2/4) Epoch 46, batch 150, loss[loss=0.1974, simple_loss=0.2449, pruned_loss=0.07499, over 16988.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2831, pruned_loss=0.05912, over 2003538.79 frames. ], batch size: 391, lr: 7.24e-03, grad_scale: 16.0 2023-06-15 14:10:52,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=160726.66666666666, ans=0.125 2023-06-15 14:11:03,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=160793.33333333334, ans=0.125 2023-06-15 14:11:19,617 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.724e+02 1.880e+02 2.096e+02 2.711e+02, threshold=3.761e+02, percent-clipped=0.0 2023-06-15 14:11:39,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=12.0 2023-06-15 14:12:11,680 INFO [train.py:988] (2/4) Epoch 46, batch 200, loss[loss=0.1985, simple_loss=0.2854, pruned_loss=0.05584, over 19075.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2826, pruned_loss=0.05906, over 2411806.79 frames. ], batch size: 89, lr: 7.24e-03, grad_scale: 16.0 2023-06-15 14:12:12,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161060.0, ans=0.1 2023-06-15 14:12:26,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=161060.0, ans=0.125 2023-06-15 14:12:38,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=161126.66666666666, ans=0.125 2023-06-15 14:12:59,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=161193.33333333334, ans=0.2 2023-06-15 14:13:32,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.25 vs. limit=15.0 2023-06-15 14:13:33,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=161326.66666666666, ans=0.2 2023-06-15 14:13:40,034 INFO [train.py:988] (2/4) Epoch 46, batch 250, loss[loss=0.1924, simple_loss=0.2724, pruned_loss=0.05623, over 20299.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2825, pruned_loss=0.05962, over 2723088.31 frames. ], batch size: 141, lr: 7.23e-03, grad_scale: 16.0 2023-06-15 14:13:46,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=161393.33333333334, ans=0.125 2023-06-15 14:14:15,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.802e+02 2.040e+02 2.477e+02 3.551e+02, threshold=4.080e+02, percent-clipped=0.0 2023-06-15 14:14:39,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=161593.33333333334, ans=0.125 2023-06-15 14:14:57,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=161660.0, ans=0.0 2023-06-15 14:15:07,647 INFO [train.py:988] (2/4) Epoch 46, batch 300, loss[loss=0.2095, simple_loss=0.2847, pruned_loss=0.06718, over 20312.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2837, pruned_loss=0.05976, over 2959404.16 frames. ], batch size: 149, lr: 7.22e-03, grad_scale: 16.0 2023-06-15 14:15:07,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=161726.66666666666, ans=0.125 2023-06-15 14:15:08,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=161726.66666666666, ans=0.95 2023-06-15 14:15:25,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=161793.33333333334, ans=0.2 2023-06-15 14:15:31,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=161793.33333333334, ans=0.125 2023-06-15 14:15:38,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2023-06-15 14:15:40,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=161860.0, ans=0.0 2023-06-15 14:16:02,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.41 vs. limit=15.0 2023-06-15 14:16:19,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=161993.33333333334, ans=0.125 2023-06-15 14:16:35,656 INFO [train.py:988] (2/4) Epoch 46, batch 350, loss[loss=0.1798, simple_loss=0.2648, pruned_loss=0.04741, over 19892.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.284, pruned_loss=0.05948, over 3131915.28 frames. ], batch size: 120, lr: 7.22e-03, grad_scale: 16.0 2023-06-15 14:16:35,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=162060.0, ans=0.125 2023-06-15 14:16:45,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=162060.0, ans=0.125 2023-06-15 14:17:10,331 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.820e+02 1.979e+02 2.259e+02 3.140e+02, threshold=3.959e+02, percent-clipped=0.0 2023-06-15 14:17:14,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=162193.33333333334, ans=0.0 2023-06-15 14:18:04,275 INFO [train.py:988] (2/4) Epoch 46, batch 400, loss[loss=0.2285, simple_loss=0.3131, pruned_loss=0.07191, over 18276.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2842, pruned_loss=0.05964, over 3278565.98 frames. ], batch size: 74, lr: 7.21e-03, grad_scale: 32.0 2023-06-15 14:18:41,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=162526.66666666666, ans=0.0 2023-06-15 14:19:07,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=162593.33333333334, ans=0.0 2023-06-15 14:19:25,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=162660.0, ans=0.05 2023-06-15 14:19:33,234 INFO [train.py:988] (2/4) Epoch 46, batch 450, loss[loss=0.2236, simple_loss=0.3136, pruned_loss=0.06677, over 18320.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2837, pruned_loss=0.05906, over 3411872.74 frames. ], batch size: 72, lr: 7.20e-03, grad_scale: 32.0 2023-06-15 14:19:47,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=162726.66666666666, ans=0.0 2023-06-15 14:19:47,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=162726.66666666666, ans=0.125 2023-06-15 14:20:07,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.857e+02 2.090e+02 2.387e+02 3.299e+02, threshold=4.180e+02, percent-clipped=0.0 2023-06-15 14:20:44,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=162993.33333333334, ans=0.125 2023-06-15 14:20:57,270 INFO [train.py:988] (2/4) Epoch 46, batch 500, loss[loss=0.1969, simple_loss=0.2723, pruned_loss=0.0607, over 20709.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2839, pruned_loss=0.05877, over 3502644.47 frames. ], batch size: 211, lr: 7.20e-03, grad_scale: 32.0 2023-06-15 14:21:07,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=15.0 2023-06-15 14:22:12,874 INFO [train.py:988] (2/4) Epoch 47, batch 0, loss[loss=0.2098, simple_loss=0.3019, pruned_loss=0.05887, over 19792.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.3019, pruned_loss=0.05887, over 19792.00 frames. ], batch size: 115, lr: 7.11e-03, grad_scale: 32.0 2023-06-15 14:22:12,875 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 14:22:19,327 INFO [train.py:1020] (2/4) Epoch 47, validation: loss=0.2046, simple_loss=0.3006, pruned_loss=0.05427, over 143649.00 frames. 2023-06-15 14:22:19,328 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 14:22:24,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163280.0, ans=0.1 2023-06-15 14:22:27,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.00 vs. limit=22.5 2023-06-15 14:22:35,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=163346.66666666666, ans=0.0 2023-06-15 14:23:20,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=163480.0, ans=0.125 2023-06-15 14:23:23,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.765e+02 1.978e+02 2.218e+02 3.606e+02, threshold=3.956e+02, percent-clipped=0.0 2023-06-15 14:23:46,076 INFO [train.py:988] (2/4) Epoch 47, batch 50, loss[loss=0.2178, simple_loss=0.2665, pruned_loss=0.08458, over 17014.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2848, pruned_loss=0.0615, over 839983.27 frames. ], batch size: 392, lr: 7.11e-03, grad_scale: 32.0 2023-06-15 14:23:51,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=163613.33333333334, ans=0.125 2023-06-15 14:24:23,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=163746.66666666666, ans=0.0 2023-06-15 14:24:27,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-06-15 14:24:37,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2023-06-15 14:24:45,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=163813.33333333334, ans=0.125 2023-06-15 14:24:53,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163813.33333333334, ans=0.125 2023-06-15 14:25:13,278 INFO [train.py:988] (2/4) Epoch 47, batch 100, loss[loss=0.2225, simple_loss=0.2989, pruned_loss=0.07304, over 19983.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2834, pruned_loss=0.05945, over 1488427.26 frames. ], batch size: 126, lr: 7.10e-03, grad_scale: 32.0 2023-06-15 14:25:26,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.26 vs. limit=15.0 2023-06-15 14:25:59,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=164080.0, ans=0.0 2023-06-15 14:26:05,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:12,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=164146.66666666666, ans=0.0 2023-06-15 14:26:16,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:17,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.736e+02 1.928e+02 2.284e+02 3.416e+02, threshold=3.856e+02, percent-clipped=0.0 2023-06-15 14:26:24,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=164213.33333333334, ans=0.2 2023-06-15 14:26:39,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=164280.0, ans=0.125 2023-06-15 14:26:40,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.39 vs. limit=22.5 2023-06-15 14:26:41,347 INFO [train.py:988] (2/4) Epoch 47, batch 150, loss[loss=0.2269, simple_loss=0.3203, pruned_loss=0.06675, over 16393.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2831, pruned_loss=0.05848, over 2015294.29 frames. ], batch size: 52, lr: 7.10e-03, grad_scale: 32.0 2023-06-15 14:26:45,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=164280.0, ans=0.2 2023-06-15 14:28:08,852 INFO [train.py:988] (2/4) Epoch 47, batch 200, loss[loss=0.1856, simple_loss=0.2649, pruned_loss=0.05312, over 20570.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2825, pruned_loss=0.05836, over 2411082.66 frames. ], batch size: 189, lr: 7.09e-03, grad_scale: 32.0 2023-06-15 14:28:51,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=164746.66666666666, ans=0.125 2023-06-15 14:29:14,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.808e+02 2.037e+02 2.330e+02 4.045e+02, threshold=4.073e+02, percent-clipped=1.0 2023-06-15 14:29:17,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164880.0, ans=0.1 2023-06-15 14:29:36,286 INFO [train.py:988] (2/4) Epoch 47, batch 250, loss[loss=0.1831, simple_loss=0.2709, pruned_loss=0.04764, over 19695.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2817, pruned_loss=0.05783, over 2715565.34 frames. ], batch size: 110, lr: 7.08e-03, grad_scale: 32.0 2023-06-15 14:29:37,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-06-15 14:29:46,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=164946.66666666666, ans=0.125 2023-06-15 14:29:48,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=164946.66666666666, ans=0.0 2023-06-15 14:29:50,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.68 vs. limit=15.0 2023-06-15 14:29:53,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165013.33333333334, ans=0.1 2023-06-15 14:30:20,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=165080.0, ans=0.125 2023-06-15 14:30:31,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=165146.66666666666, ans=0.2 2023-06-15 14:30:45,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.50 vs. limit=22.5 2023-06-15 14:30:45,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=165213.33333333334, ans=0.125 2023-06-15 14:30:59,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=165213.33333333334, ans=0.125 2023-06-15 14:31:04,899 INFO [train.py:988] (2/4) Epoch 47, batch 300, loss[loss=0.1911, simple_loss=0.2726, pruned_loss=0.05478, over 19209.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2818, pruned_loss=0.058, over 2970676.25 frames. ], batch size: 92, lr: 7.08e-03, grad_scale: 32.0 2023-06-15 14:31:16,010 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:31:23,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=165346.66666666666, ans=0.125 2023-06-15 14:31:25,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=165346.66666666666, ans=0.2 2023-06-15 14:31:43,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165413.33333333334, ans=0.1 2023-06-15 14:32:11,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.863e+02 2.079e+02 2.322e+02 4.081e+02, threshold=4.157e+02, percent-clipped=1.0 2023-06-15 14:32:25,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165546.66666666666, ans=0.125 2023-06-15 14:32:33,703 INFO [train.py:988] (2/4) Epoch 47, batch 350, loss[loss=0.1929, simple_loss=0.2742, pruned_loss=0.05583, over 19076.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.282, pruned_loss=0.05811, over 3145142.62 frames. ], batch size: 89, lr: 7.07e-03, grad_scale: 32.0 2023-06-15 14:33:15,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-15 14:33:31,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165813.33333333334, ans=0.1 2023-06-15 14:33:38,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-06-15 14:33:39,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165813.33333333334, ans=0.1 2023-06-15 14:34:00,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=165880.0, ans=0.125 2023-06-15 14:34:01,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=165946.66666666666, ans=0.125 2023-06-15 14:34:03,143 INFO [train.py:988] (2/4) Epoch 47, batch 400, loss[loss=0.1758, simple_loss=0.2692, pruned_loss=0.04126, over 19836.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2818, pruned_loss=0.05836, over 3292482.97 frames. ], batch size: 115, lr: 7.06e-03, grad_scale: 32.0 2023-06-15 14:34:12,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=165946.66666666666, ans=0.125 2023-06-15 14:34:12,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=165946.66666666666, ans=0.125 2023-06-15 14:34:28,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=166013.33333333334, ans=0.0 2023-06-15 14:34:30,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=166013.33333333334, ans=0.0 2023-06-15 14:35:08,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.781e+02 2.023e+02 2.326e+02 3.205e+02, threshold=4.047e+02, percent-clipped=0.0 2023-06-15 14:35:22,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166213.33333333334, ans=0.1 2023-06-15 14:35:32,252 INFO [train.py:988] (2/4) Epoch 47, batch 450, loss[loss=0.2012, simple_loss=0.2793, pruned_loss=0.06152, over 20586.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2822, pruned_loss=0.059, over 3391731.47 frames. ], batch size: 189, lr: 7.06e-03, grad_scale: 32.0 2023-06-15 14:35:34,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-06-15 14:36:08,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.55 vs. limit=22.5 2023-06-15 14:36:40,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=166546.66666666666, ans=0.125 2023-06-15 14:36:46,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=166546.66666666666, ans=0.0 2023-06-15 14:36:47,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=166546.66666666666, ans=0.125 2023-06-15 14:36:49,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=166546.66666666666, ans=0.125 2023-06-15 14:36:53,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=166546.66666666666, ans=0.0 2023-06-15 14:36:56,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=166613.33333333334, ans=0.0 2023-06-15 14:36:58,243 INFO [train.py:988] (2/4) Epoch 47, batch 500, loss[loss=0.1969, simple_loss=0.2803, pruned_loss=0.05675, over 20294.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2822, pruned_loss=0.0589, over 3492617.99 frames. ], batch size: 141, lr: 7.05e-03, grad_scale: 32.0 2023-06-15 14:38:08,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=166826.66666666666, ans=0.125 2023-06-15 14:38:12,616 INFO [train.py:988] (2/4) Epoch 48, batch 0, loss[loss=0.1903, simple_loss=0.2795, pruned_loss=0.05058, over 19707.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2795, pruned_loss=0.05058, over 19707.00 frames. ], batch size: 110, lr: 6.97e-03, grad_scale: 32.0 2023-06-15 14:38:12,616 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 14:38:18,657 INFO [train.py:1020] (2/4) Epoch 48, validation: loss=0.1998, simple_loss=0.298, pruned_loss=0.05082, over 143649.00 frames. 2023-06-15 14:38:18,658 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 14:38:21,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.30 vs. limit=6.0 2023-06-15 14:38:26,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.746e+02 1.946e+02 2.285e+02 3.541e+02, threshold=3.892e+02, percent-clipped=0.0 2023-06-15 14:38:48,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=166893.33333333334, ans=0.0 2023-06-15 14:39:08,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=166960.0, ans=0.125 2023-06-15 14:39:47,298 INFO [train.py:988] (2/4) Epoch 48, batch 50, loss[loss=0.1833, simple_loss=0.2732, pruned_loss=0.04669, over 18948.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2796, pruned_loss=0.05709, over 854403.50 frames. ], batch size: 86, lr: 6.96e-03, grad_scale: 32.0 2023-06-15 14:39:58,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=167160.0, ans=0.125 2023-06-15 14:40:28,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=167293.33333333334, ans=0.0 2023-06-15 14:40:53,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=167360.0, ans=0.125 2023-06-15 14:41:01,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=167426.66666666666, ans=0.0 2023-06-15 14:41:15,611 INFO [train.py:988] (2/4) Epoch 48, batch 100, loss[loss=0.2014, simple_loss=0.2768, pruned_loss=0.06301, over 20602.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2822, pruned_loss=0.05723, over 1504622.72 frames. ], batch size: 189, lr: 6.96e-03, grad_scale: 32.0 2023-06-15 14:41:25,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.839e+02 2.012e+02 2.249e+02 3.194e+02, threshold=4.023e+02, percent-clipped=0.0 2023-06-15 14:41:38,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167560.0, ans=0.0 2023-06-15 14:41:44,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167560.0, ans=0.1 2023-06-15 14:42:42,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=167826.66666666666, ans=0.125 2023-06-15 14:42:43,942 INFO [train.py:988] (2/4) Epoch 48, batch 150, loss[loss=0.184, simple_loss=0.267, pruned_loss=0.05055, over 19094.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2826, pruned_loss=0.05773, over 1991024.11 frames. ], batch size: 89, lr: 6.95e-03, grad_scale: 32.0 2023-06-15 14:42:55,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167826.66666666666, ans=0.125 2023-06-15 14:42:59,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.32 vs. limit=22.5 2023-06-15 14:43:22,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=167960.0, ans=0.125 2023-06-15 14:43:28,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=167960.0, ans=0.125 2023-06-15 14:43:34,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=167960.0, ans=0.125 2023-06-15 14:43:41,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=168026.66666666666, ans=0.1 2023-06-15 14:43:52,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=168026.66666666666, ans=0.125 2023-06-15 14:44:12,985 INFO [train.py:988] (2/4) Epoch 48, batch 200, loss[loss=0.1703, simple_loss=0.2621, pruned_loss=0.03925, over 19436.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2811, pruned_loss=0.05732, over 2401588.74 frames. ], batch size: 105, lr: 6.95e-03, grad_scale: 32.0 2023-06-15 14:44:21,894 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.752e+02 1.958e+02 2.186e+02 2.989e+02, threshold=3.915e+02, percent-clipped=0.0 2023-06-15 14:44:33,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.40 vs. limit=15.0 2023-06-15 14:44:40,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=168226.66666666666, ans=0.125 2023-06-15 14:45:04,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=168360.0, ans=0.125 2023-06-15 14:45:08,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=12.0 2023-06-15 14:45:41,719 INFO [train.py:988] (2/4) Epoch 48, batch 250, loss[loss=0.2065, simple_loss=0.2752, pruned_loss=0.06889, over 20228.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2814, pruned_loss=0.05688, over 2719743.84 frames. ], batch size: 239, lr: 6.94e-03, grad_scale: 32.0 2023-06-15 14:45:53,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=168493.33333333334, ans=0.125 2023-06-15 14:46:09,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.06 vs. limit=22.5 2023-06-15 14:46:17,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=168626.66666666666, ans=0.125 2023-06-15 14:46:17,746 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:47:10,242 INFO [train.py:988] (2/4) Epoch 48, batch 300, loss[loss=0.2043, simple_loss=0.2873, pruned_loss=0.06072, over 19830.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2819, pruned_loss=0.05734, over 2969144.14 frames. ], batch size: 115, lr: 6.93e-03, grad_scale: 32.0 2023-06-15 14:47:19,384 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.766e+02 2.086e+02 2.478e+02 4.078e+02, threshold=4.173e+02, percent-clipped=1.0 2023-06-15 14:47:23,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=168826.66666666666, ans=0.125 2023-06-15 14:47:54,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-15 14:47:58,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=168960.0, ans=0.2 2023-06-15 14:48:05,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169026.66666666666, ans=0.125 2023-06-15 14:48:21,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=169093.33333333334, ans=0.2 2023-06-15 14:48:38,575 INFO [train.py:988] (2/4) Epoch 48, batch 350, loss[loss=0.2006, simple_loss=0.2901, pruned_loss=0.05553, over 19449.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2817, pruned_loss=0.05745, over 3155947.29 frames. ], batch size: 105, lr: 6.93e-03, grad_scale: 32.0 2023-06-15 14:48:52,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=169160.0, ans=0.125 2023-06-15 14:49:53,946 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:50:05,733 INFO [train.py:988] (2/4) Epoch 48, batch 400, loss[loss=0.1848, simple_loss=0.2756, pruned_loss=0.04704, over 19212.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2824, pruned_loss=0.05741, over 3287543.70 frames. ], batch size: 92, lr: 6.92e-03, grad_scale: 32.0 2023-06-15 14:50:13,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.792e+02 1.982e+02 2.238e+02 3.664e+02, threshold=3.964e+02, percent-clipped=0.0 2023-06-15 14:50:34,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=169560.0, ans=0.0 2023-06-15 14:51:09,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.23 vs. limit=15.0 2023-06-15 14:51:12,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=169693.33333333334, ans=0.125 2023-06-15 14:51:32,759 INFO [train.py:988] (2/4) Epoch 48, batch 450, loss[loss=0.1972, simple_loss=0.2841, pruned_loss=0.05512, over 19051.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2817, pruned_loss=0.05721, over 3414639.15 frames. ], batch size: 89, lr: 6.91e-03, grad_scale: 32.0 2023-06-15 14:51:33,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=169826.66666666666, ans=0.125 2023-06-15 14:51:55,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=169893.33333333334, ans=0.125 2023-06-15 14:52:20,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=22.5 2023-06-15 14:52:26,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=170026.66666666666, ans=0.0 2023-06-15 14:52:36,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=170026.66666666666, ans=0.125 2023-06-15 14:52:37,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=22.5 2023-06-15 14:52:41,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=170093.33333333334, ans=0.125 2023-06-15 14:52:56,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=170160.0, ans=0.0 2023-06-15 14:52:57,834 INFO [train.py:988] (2/4) Epoch 48, batch 500, loss[loss=0.1851, simple_loss=0.2616, pruned_loss=0.05426, over 20675.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2816, pruned_loss=0.05695, over 3501157.75 frames. ], batch size: 211, lr: 6.91e-03, grad_scale: 32.0 2023-06-15 14:53:06,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.838e+02 2.045e+02 2.451e+02 3.381e+02, threshold=4.090e+02, percent-clipped=0.0 2023-06-15 14:53:07,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.96 vs. limit=22.5 2023-06-15 14:53:18,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170226.66666666666, ans=0.1 2023-06-15 14:53:26,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=170226.66666666666, ans=0.125 2023-06-15 14:53:42,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=170293.33333333334, ans=0.0 2023-06-15 14:54:12,517 INFO [train.py:988] (2/4) Epoch 49, batch 0, loss[loss=0.1873, simple_loss=0.2659, pruned_loss=0.05434, over 20543.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2659, pruned_loss=0.05434, over 20543.00 frames. ], batch size: 189, lr: 6.83e-03, grad_scale: 32.0 2023-06-15 14:54:12,517 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 14:54:19,025 INFO [train.py:1020] (2/4) Epoch 49, validation: loss=0.2025, simple_loss=0.2999, pruned_loss=0.05253, over 143649.00 frames. 2023-06-15 14:54:19,026 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 14:54:34,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=170373.33333333334, ans=0.2 2023-06-15 14:54:42,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=170440.0, ans=0.125 2023-06-15 14:54:49,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.70 vs. limit=15.0 2023-06-15 14:55:14,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-06-15 14:55:17,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=170573.33333333334, ans=0.2 2023-06-15 14:55:28,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=22.5 2023-06-15 14:55:44,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=170640.0, ans=0.0 2023-06-15 14:55:47,876 INFO [train.py:988] (2/4) Epoch 49, batch 50, loss[loss=0.1907, simple_loss=0.2829, pruned_loss=0.04923, over 18312.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2778, pruned_loss=0.05823, over 869648.83 frames. ], batch size: 74, lr: 6.83e-03, grad_scale: 32.0 2023-06-15 14:56:31,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.705e+02 1.886e+02 2.163e+02 3.210e+02, threshold=3.772e+02, percent-clipped=0.0 2023-06-15 14:56:58,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.18 vs. limit=22.5 2023-06-15 14:57:16,262 INFO [train.py:988] (2/4) Epoch 49, batch 100, loss[loss=0.2111, simple_loss=0.2984, pruned_loss=0.06194, over 17036.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2809, pruned_loss=0.05778, over 1509065.61 frames. ], batch size: 60, lr: 6.82e-03, grad_scale: 32.0 2023-06-15 14:57:33,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.78 vs. limit=22.5 2023-06-15 14:57:34,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171106.66666666666, ans=0.1 2023-06-15 14:57:41,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=171106.66666666666, ans=0.0 2023-06-15 14:58:31,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=171306.66666666666, ans=0.125 2023-06-15 14:58:44,439 INFO [train.py:988] (2/4) Epoch 49, batch 150, loss[loss=0.2065, simple_loss=0.2956, pruned_loss=0.05869, over 14853.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2797, pruned_loss=0.05711, over 2001804.40 frames. ], batch size: 42, lr: 6.81e-03, grad_scale: 32.0 2023-06-15 14:58:48,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=171373.33333333334, ans=0.0 2023-06-15 14:58:51,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=171373.33333333334, ans=0.0 2023-06-15 14:59:14,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.06 vs. limit=22.5 2023-06-15 14:59:27,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.782e+02 1.923e+02 2.190e+02 3.146e+02, threshold=3.845e+02, percent-clipped=0.0 2023-06-15 14:59:29,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=171506.66666666666, ans=0.0 2023-06-15 15:00:13,681 INFO [train.py:988] (2/4) Epoch 49, batch 200, loss[loss=0.1855, simple_loss=0.2706, pruned_loss=0.05015, over 19725.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2791, pruned_loss=0.05723, over 2397361.48 frames. ], batch size: 110, lr: 6.81e-03, grad_scale: 32.0 2023-06-15 15:00:23,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=171706.66666666666, ans=0.2 2023-06-15 15:00:25,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=171706.66666666666, ans=0.0 2023-06-15 15:00:37,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.04 vs. limit=22.5 2023-06-15 15:01:04,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=171906.66666666666, ans=0.125 2023-06-15 15:01:08,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=171906.66666666666, ans=0.07 2023-06-15 15:01:38,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=171973.33333333334, ans=0.0 2023-06-15 15:01:41,446 INFO [train.py:988] (2/4) Epoch 49, batch 250, loss[loss=0.2009, simple_loss=0.2862, pruned_loss=0.0578, over 18431.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2798, pruned_loss=0.05716, over 2712408.95 frames. ], batch size: 77, lr: 6.80e-03, grad_scale: 32.0 2023-06-15 15:02:08,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=172106.66666666666, ans=0.0 2023-06-15 15:02:22,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.787e+02 2.024e+02 2.612e+02 4.231e+02, threshold=4.048e+02, percent-clipped=3.0 2023-06-15 15:02:42,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=172240.0, ans=0.1 2023-06-15 15:03:09,742 INFO [train.py:988] (2/4) Epoch 49, batch 300, loss[loss=0.2108, simple_loss=0.2814, pruned_loss=0.07008, over 20299.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2803, pruned_loss=0.05737, over 2936044.06 frames. ], batch size: 149, lr: 6.80e-03, grad_scale: 32.0 2023-06-15 15:03:47,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=172506.66666666666, ans=0.125 2023-06-15 15:04:38,792 INFO [train.py:988] (2/4) Epoch 49, batch 350, loss[loss=0.1892, simple_loss=0.277, pruned_loss=0.05072, over 19230.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2802, pruned_loss=0.0567, over 3128170.38 frames. ], batch size: 92, lr: 6.79e-03, grad_scale: 32.0 2023-06-15 15:04:41,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-15 15:04:48,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172706.66666666666, ans=0.125 2023-06-15 15:04:59,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2023-06-15 15:05:22,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.759e+02 1.916e+02 2.182e+02 3.623e+02, threshold=3.831e+02, percent-clipped=0.0 2023-06-15 15:05:30,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.52 vs. limit=10.0 2023-06-15 15:06:04,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-15 15:06:06,534 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:06:07,881 INFO [train.py:988] (2/4) Epoch 49, batch 400, loss[loss=0.1786, simple_loss=0.2667, pruned_loss=0.04523, over 18477.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2797, pruned_loss=0.05643, over 3282163.61 frames. ], batch size: 77, lr: 6.78e-03, grad_scale: 32.0 2023-06-15 15:06:08,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=173040.0, ans=0.0 2023-06-15 15:06:46,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.53 vs. limit=15.0 2023-06-15 15:06:57,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=173173.33333333334, ans=0.0 2023-06-15 15:07:22,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=173306.66666666666, ans=0.0 2023-06-15 15:07:22,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=173306.66666666666, ans=0.125 2023-06-15 15:07:37,478 INFO [train.py:988] (2/4) Epoch 49, batch 450, loss[loss=0.1938, simple_loss=0.2783, pruned_loss=0.05471, over 19876.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2802, pruned_loss=0.05603, over 3382758.79 frames. ], batch size: 120, lr: 6.78e-03, grad_scale: 32.0 2023-06-15 15:08:00,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-15 15:08:13,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=173506.66666666666, ans=0.125 2023-06-15 15:08:20,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.766e+02 1.952e+02 2.304e+02 4.249e+02, threshold=3.904e+02, percent-clipped=1.0 2023-06-15 15:08:21,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-15 15:08:47,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=173640.0, ans=0.0 2023-06-15 15:08:57,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=173640.0, ans=0.2 2023-06-15 15:08:58,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-15 15:09:01,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=173640.0, ans=0.0 2023-06-15 15:09:02,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=173706.66666666666, ans=0.125 2023-06-15 15:09:04,000 INFO [train.py:988] (2/4) Epoch 49, batch 500, loss[loss=0.2077, simple_loss=0.2853, pruned_loss=0.06503, over 20513.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2795, pruned_loss=0.05628, over 3471574.67 frames. ], batch size: 160, lr: 6.77e-03, grad_scale: 32.0 2023-06-15 15:09:06,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=173706.66666666666, ans=0.125 2023-06-15 15:09:49,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=173840.0, ans=0.0 2023-06-15 15:09:49,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=173840.0, ans=0.0 2023-06-15 15:10:17,335 INFO [train.py:988] (2/4) Epoch 50, batch 0, loss[loss=0.1975, simple_loss=0.2866, pruned_loss=0.05419, over 19677.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2866, pruned_loss=0.05419, over 19677.00 frames. ], batch size: 110, lr: 6.70e-03, grad_scale: 32.0 2023-06-15 15:10:17,335 INFO [train.py:1011] (2/4) Computing validation loss 2023-06-15 15:10:23,503 INFO [train.py:1020] (2/4) Epoch 50, validation: loss=0.202, simple_loss=0.299, pruned_loss=0.05252, over 143649.00 frames. 2023-06-15 15:10:23,504 INFO [train.py:1021] (2/4) Maximum memory allocated so far is 13702MB 2023-06-15 15:11:20,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=174126.66666666666, ans=0.0 2023-06-15 15:11:32,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=174193.33333333334, ans=0.0 2023-06-15 15:11:33,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.754e+02 1.986e+02 2.353e+02 3.317e+02, threshold=3.972e+02, percent-clipped=0.0 2023-06-15 15:11:34,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174193.33333333334, ans=0.125 2023-06-15 15:11:50,942 INFO [train.py:988] (2/4) Epoch 50, batch 50, loss[loss=0.2026, simple_loss=0.278, pruned_loss=0.06356, over 20119.00 frames. ], tot_loss[loss=0.198, simple_loss=0.283, pruned_loss=0.05646, over 862491.83 frames. ], batch size: 133, lr: 6.69e-03, grad_scale: 32.0 2023-06-15 15:11:53,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=174260.0, ans=0.04949747468305833 2023-06-15 15:12:05,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-15 15:12:17,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=15.0 2023-06-15 15:12:48,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=174460.0, ans=0.125 2023-06-15 15:12:51,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.84 vs. limit=10.0 2023-06-15 15:13:11,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.61 vs. limit=6.0 2023-06-15 15:13:17,577 INFO [train.py:988] (2/4) Epoch 50, batch 100, loss[loss=0.1836, simple_loss=0.2763, pruned_loss=0.0454, over 19869.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2814, pruned_loss=0.05632, over 1507340.79 frames. ], batch size: 120, lr: 6.69e-03, grad_scale: 32.0 2023-06-15 15:13:58,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=174726.66666666666, ans=0.125 2023-06-15 15:13:58,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174726.66666666666, ans=0.1 2023-06-15 15:13:58,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=174726.66666666666, ans=0.0 2023-06-15 15:13:59,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.02 vs. limit=15.0 2023-06-15 15:14:28,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.801e+02 1.997e+02 2.286e+02 3.614e+02, threshold=3.994e+02, percent-clipped=0.0 2023-06-15 15:14:28,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=174860.0, ans=0.0 2023-06-15 15:14:31,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=174860.0, ans=0.0 2023-06-15 15:14:35,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174860.0, ans=0.1 2023-06-15 15:14:43,481 INFO [train.py:988] (2/4) Epoch 50, batch 150, loss[loss=0.2153, simple_loss=0.3061, pruned_loss=0.06223, over 16688.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2816, pruned_loss=0.05562, over 2010501.93 frames. ], batch size: 59, lr: 6.68e-03, grad_scale: 32.0 2023-06-15 15:14:44,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=174926.66666666666, ans=0.5 2023-06-15 15:14:54,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=174926.66666666666, ans=0.0 2023-06-15 15:14:58,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=174926.66666666666, ans=0.0 2023-06-15 15:15:01,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=174993.33333333334, ans=0.125 2023-06-15 15:15:07,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.57 vs. limit=22.5 2023-06-15 15:16:09,941 INFO [train.py:988] (2/4) Epoch 50, batch 200, loss[loss=0.1981, simple_loss=0.2802, pruned_loss=0.05799, over 18599.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2811, pruned_loss=0.05623, over 2398853.01 frames. ], batch size: 80, lr: 6.68e-03, grad_scale: 32.0 2023-06-15 15:16:30,634 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:16:32,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.73 vs. limit=10.0 2023-06-15 15:16:35,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-15 15:16:53,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=175393.33333333334, ans=0.125 2023-06-15 15:17:11,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-06-15 15:17:23,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.770e+02 1.950e+02 2.283e+02 3.292e+02, threshold=3.901e+02, percent-clipped=0.0 2023-06-15 15:17:30,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=175526.66666666666, ans=0.125 2023-06-15 15:17:38,333 INFO [train.py:988] (2/4) Epoch 50, batch 250, loss[loss=0.197, simple_loss=0.2847, pruned_loss=0.05463, over 18444.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2819, pruned_loss=0.05603, over 2698047.20 frames. ], batch size: 77, lr: 6.67e-03, grad_scale: 32.0 2023-06-15 15:18:00,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=175660.0, ans=0.0 2023-06-15 15:18:06,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.28 vs. limit=15.0 2023-06-15 15:18:10,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=175660.0, ans=0.2 2023-06-15 15:18:31,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=175793.33333333334, ans=0.2 2023-06-15 15:18:31,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=175793.33333333334, ans=0.125 2023-06-15 15:18:32,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=175793.33333333334, ans=0.125 2023-06-15 15:18:56,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=175860.0, ans=0.125 2023-06-15 15:19:04,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=175926.66666666666, ans=0.125 2023-06-15 15:19:06,265 INFO [train.py:988] (2/4) Epoch 50, batch 300, loss[loss=0.1982, simple_loss=0.2728, pruned_loss=0.06176, over 20679.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2817, pruned_loss=0.05663, over 2932655.60 frames. ], batch size: 211, lr: 6.66e-03, grad_scale: 16.0 2023-06-15 15:19:14,175 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:19:18,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.79 vs. limit=12.0 2023-06-15 15:19:28,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=175993.33333333334, ans=0.0 2023-06-15 15:19:40,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=176060.0, ans=0.2 2023-06-15 15:19:52,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=176060.0, ans=0.125 2023-06-15 15:20:14,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=176126.66666666666, ans=0.125 2023-06-15 15:20:21,164 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.741e+02 2.011e+02 2.290e+02 4.002e+02, threshold=4.022e+02, percent-clipped=1.0 2023-06-15 15:20:34,810 INFO [train.py:988] (2/4) Epoch 50, batch 350, loss[loss=0.1923, simple_loss=0.2844, pruned_loss=0.05012, over 19530.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2807, pruned_loss=0.0566, over 3101199.39 frames. ], batch size: 102, lr: 6.66e-03, grad_scale: 16.0 2023-06-15 15:20:35,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176260.0, ans=0.1 2023-06-15 15:20:45,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=176260.0, ans=0.125 2023-06-15 15:20:54,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=176326.66666666666, ans=0.125 2023-06-15 15:21:06,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.35 vs. limit=15.0 2023-06-15 15:21:21,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=176393.33333333334, ans=0.125 2023-06-15 15:21:44,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=176526.66666666666, ans=0.2 2023-06-15 15:22:02,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=176593.33333333334, ans=0.0 2023-06-15 15:22:03,799 INFO [train.py:988] (2/4) Epoch 50, batch 400, loss[loss=0.1964, simple_loss=0.2711, pruned_loss=0.06082, over 20648.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.28, pruned_loss=0.05643, over 3272410.86 frames. ], batch size: 173, lr: 6.65e-03, grad_scale: 32.0 2023-06-15 15:22:07,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176593.33333333334, ans=0.1 2023-06-15 15:22:25,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=176660.0, ans=0.2 2023-06-15 15:22:59,811 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:23:14,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=176860.0, ans=0.125 2023-06-15 15:23:17,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.749e+02 1.909e+02 2.121e+02 2.900e+02, threshold=3.818e+02, percent-clipped=0.0 2023-06-15 15:23:32,314 INFO [train.py:988] (2/4) Epoch 50, batch 450, loss[loss=0.1961, simple_loss=0.2731, pruned_loss=0.05953, over 19971.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.28, pruned_loss=0.05615, over 3394734.77 frames. ], batch size: 126, lr: 6.65e-03, grad_scale: 32.0 2023-06-15 15:23:39,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=176926.66666666666, ans=0.0 2023-06-15 15:23:39,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2023-06-15 15:24:06,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=177060.0, ans=0.0 2023-06-15 15:24:11,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=177060.0, ans=0.125 2023-06-15 15:24:20,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=177060.0, ans=0.0 2023-06-15 15:24:28,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=177126.66666666666, ans=0.0 2023-06-15 15:24:57,651 INFO [train.py:988] (2/4) Epoch 50, batch 500, loss[loss=0.2029, simple_loss=0.2922, pruned_loss=0.05679, over 16313.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2804, pruned_loss=0.05618, over 3479927.76 frames. ], batch size: 52, lr: 6.64e-03, grad_scale: 32.0 2023-06-15 15:25:36,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=177393.33333333334, ans=0.125 2023-06-15 15:25:51,680 INFO [train.py:1201] (2/4) Done!