2023-06-15 01:56:32,531 INFO [train.py:1056] (1/4) Training started 2023-06-15 01:56:32,532 INFO [train.py:1066] (1/4) Device: cuda:1 2023-06-15 01:56:32,536 INFO [train.py:1075] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Debug', 'k2-with-cuda': True, 'k2-git-sha1': '38211604d6a24b15f320578a1a38f6c12d7a711c', 'k2-git-date': 'Mon Jun 12 10:59:44 2023', 'lhotse-version': '1.15.0.dev+git.f1fd23d.clean', 'torch-version': '2.0.0+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.8', 'icefall-git-branch': 'ted/zipformer', 'icefall-git-sha1': '323a299-dirty', 'icefall-git-date': 'Tue Jun 13 04:47:15 2023', 'icefall-path': '/exp/draj/jsalt2023/icefall', 'k2-path': '/exp/draj/jsalt2023/k2/k2/python/k2/__init__.py', 'lhotse-path': '/exp/draj/jsalt2023/lhotse/lhotse/__init__.py', 'hostname': 'r2n01', 'IP address': '10.1.2.1'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp/v5'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/manifests'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-06-15 01:56:32,536 INFO [train.py:1077] (1/4) About to create model 2023-06-15 01:56:33,465 INFO [train.py:1081] (1/4) Number of model parameters: 65549011 2023-06-15 01:56:46,301 INFO [train.py:1096] (1/4) Using DDP 2023-06-15 01:56:46,643 INFO [asr_datamodule.py:356] (1/4) About to get train cuts 2023-06-15 01:56:46,700 INFO [asr_datamodule.py:185] (1/4) Enable SpecAugment 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:186] (1/4) Time warp factor: 80 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:202] (1/4) About to get Musan cuts 2023-06-15 01:56:46,701 INFO [asr_datamodule.py:205] (1/4) Enable MUSAN 2023-06-15 01:56:48,486 INFO [asr_datamodule.py:227] (1/4) About to create train dataset 2023-06-15 01:56:48,487 INFO [asr_datamodule.py:253] (1/4) Using DynamicBucketingSampler. 2023-06-15 01:56:50,674 INFO [asr_datamodule.py:274] (1/4) About to create train dataloader 2023-06-15 01:56:50,674 INFO [asr_datamodule.py:361] (1/4) About to get dev cuts 2023-06-15 01:56:50,676 INFO [asr_datamodule.py:295] (1/4) About to create dev dataset 2023-06-15 01:56:50,696 INFO [asr_datamodule.py:314] (1/4) About to create dev dataloader 2023-06-15 01:56:50,696 INFO [train.py:1249] (1/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-15 01:57:37,207 INFO [scaling.py:962] (1/4) Whitening: name=None, num_groups=4, num_channels=128, metric=13.66 vs. limit=3.0 2023-06-15 01:57:37,529 INFO [scaling.py:962] (1/4) Whitening: name=None, num_groups=1, num_channels=256, metric=45.72 vs. limit=5.0 2023-06-15 01:57:37,877 INFO [train.py:1277] (1/4) Maximum memory allocated so far is 8701MB 2023-06-15 01:57:40,148 INFO [train.py:1277] (1/4) Maximum memory allocated so far is 8824MB 2023-06-15 01:57:51,999 INFO [train.py:1277] (1/4) Maximum memory allocated so far is 11517MB 2023-06-15 01:57:58,260 INFO [train.py:1277] (1/4) Maximum memory allocated so far is 11850MB 2023-06-15 01:58:17,634 INFO [train.py:1277] (1/4) Maximum memory allocated so far is 11850MB 2023-06-15 01:58:27,471 INFO [train.py:1277] (1/4) Maximum memory allocated so far is 11982MB 2023-06-15 01:58:50,930 INFO [train.py:988] (1/4) Epoch 1, batch 0, loss[loss=7.741, simple_loss=7.05, pruned_loss=6.89, over 19778.00 frames. ], tot_loss[loss=7.741, simple_loss=7.05, pruned_loss=6.89, over 19778.00 frames. ], batch size: 115, lr: 2.00e-02, grad_scale: 1.0 2023-06-15 01:58:50,931 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 01:58:57,878 INFO [train.py:1020] (1/4) Epoch 1, validation: loss=7.824, simple_loss=7.131, pruned_loss=6.914, over 143649.00 frames. 2023-06-15 01:58:57,879 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 11982MB 2023-06-15 01:59:08,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=0.0, ans=0.5 2023-06-15 01:59:18,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=0.0, ans=0.3 2023-06-15 01:59:23,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.65 vs. limit=5.033333333333333 2023-06-15 01:59:28,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=7.525 2023-06-15 01:59:43,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.07 vs. limit=5.033333333333333 2023-06-15 01:59:44,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=66.66666666666667, ans=0.496875 2023-06-15 01:59:55,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=39.81 vs. limit=4.026666666666666 2023-06-15 02:00:18,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=316.11 vs. limit=7.55 2023-06-15 02:00:46,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=294.49 vs. limit=7.575 2023-06-15 02:01:12,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=4.1066666666666665 2023-06-15 02:01:16,034 INFO [train.py:988] (1/4) Epoch 1, batch 50, loss[loss=1.48, simple_loss=1.32, pruned_loss=1.435, over 19062.00 frames. ], tot_loss[loss=3.406, simple_loss=3.136, pruned_loss=2.64, over 863355.97 frames. ], batch size: 89, lr: 2.20e-02, grad_scale: 0.25 2023-06-15 02:01:20,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=322.64 vs. limit=7.625 2023-06-15 02:01:30,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=135.85 vs. limit=5.166666666666667 2023-06-15 02:01:36,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=41.93 vs. limit=7.625 2023-06-15 02:01:39,936 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=91.82 vs. limit=7.8 2023-06-15 02:01:43,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=400.0, ans=0.1963 2023-06-15 02:01:43,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=400.0, ans=0.0975 2023-06-15 02:02:04,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=466.6666666666667, ans=0.0895 2023-06-15 02:02:08,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=466.6666666666667, ans=0.1825 2023-06-15 02:02:11,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=327.75 vs. limit=7.675 2023-06-15 02:02:19,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=27.31 vs. limit=5.233333333333333 2023-06-15 02:02:25,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533.3333333333334, ans=0.29466666666666663 2023-06-15 02:02:28,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=533.3333333333334, ans=0.8813333333333333 2023-06-15 02:02:29,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=533.3333333333334, ans=0.475 2023-06-15 02:02:47,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=7.95 2023-06-15 02:02:52,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=600.0, ans=7.725 2023-06-15 02:02:55,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=600.0, ans=0.08650000000000001 2023-06-15 02:02:56,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=134.13 vs. limit=7.725 2023-06-15 02:03:04,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=30.76 vs. limit=7.725 2023-06-15 02:03:07,773 INFO [train.py:988] (1/4) Epoch 1, batch 100, loss[loss=1.393, simple_loss=1.203, pruned_loss=1.516, over 18317.00 frames. ], tot_loss[loss=2.248, simple_loss=2.04, pruned_loss=1.918, over 1514018.25 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 0.5 2023-06-15 02:03:08,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.60 vs. limit=5.166666666666667 2023-06-15 02:03:14,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.068e+02 1.535e+02 6.068e+02 3.361e+03 1.967e+04, threshold=1.214e+03, percent-clipped=0.0 2023-06-15 02:03:17,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=228.20 vs. limit=7.75 2023-06-15 02:03:19,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=7.75 2023-06-15 02:03:26,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.25 vs. limit=5.333333333333333 2023-06-15 02:03:31,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=733.3333333333334, ans=0.465625 2023-06-15 02:03:51,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=54.08 vs. limit=7.8 2023-06-15 02:03:57,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.52 vs. limit=7.8 2023-06-15 02:04:01,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=7.8 2023-06-15 02:04:03,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=53.23 vs. limit=7.8 2023-06-15 02:04:24,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.80 vs. limit=5.216666666666667 2023-06-15 02:04:37,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=933.3333333333334, ans=0.45625 2023-06-15 02:04:56,168 INFO [train.py:988] (1/4) Epoch 1, batch 150, loss[loss=1.177, simple_loss=1.003, pruned_loss=1.265, over 16711.00 frames. ], tot_loss[loss=1.795, simple_loss=1.606, pruned_loss=1.632, over 2002865.29 frames. ], batch size: 59, lr: 2.60e-02, grad_scale: 0.5 2023-06-15 02:04:56,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1000.0, ans=0.453125 2023-06-15 02:04:59,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.81 vs. limit=7.875 2023-06-15 02:05:04,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1000.0, ans=0.453125 2023-06-15 02:05:05,092 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:05:24,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.60 vs. limit=8.3 2023-06-15 02:05:36,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.49 vs. limit=8.3 2023-06-15 02:05:49,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=72.79 vs. limit=7.925 2023-06-15 02:05:54,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.80 vs. limit=5.283333333333333 2023-06-15 02:05:56,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1133.3333333333333, ans=0.35833333333333334 2023-06-15 02:06:05,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.69 vs. limit=5.3 2023-06-15 02:06:05,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=184.20 vs. limit=7.95 2023-06-15 02:06:09,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=49.06 vs. limit=8.4 2023-06-15 02:06:20,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=60.98 vs. limit=7.975 2023-06-15 02:06:21,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1266.6666666666667, ans=0.07150000000000001 2023-06-15 02:06:25,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.99 vs. limit=8.45 2023-06-15 02:06:33,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1266.6666666666667, ans=5.791666666666667 2023-06-15 02:06:39,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=167.76 vs. limit=7.975 2023-06-15 02:06:41,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=25.06 vs. limit=5.633333333333334 2023-06-15 02:06:44,812 INFO [train.py:988] (1/4) Epoch 1, batch 200, loss[loss=0.9546, simple_loss=0.8142, pruned_loss=0.9517, over 20090.00 frames. ], tot_loss[loss=1.528, simple_loss=1.353, pruned_loss=1.434, over 2393446.56 frames. ], batch size: 133, lr: 2.80e-02, grad_scale: 1.0 2023-06-15 02:06:45,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=249.06 vs. limit=8.0 2023-06-15 02:06:50,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.982e+01 9.369e+01 1.058e+02 1.160e+02 1.378e+03, threshold=2.115e+02, percent-clipped=1.0 2023-06-15 02:06:53,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1333.3333333333333, ans=0.15 2023-06-15 02:06:55,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1333.3333333333333, ans=0.33333333333333337 2023-06-15 02:07:22,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=56.50 vs. limit=8.025 2023-06-15 02:07:22,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=8.55 2023-06-15 02:07:30,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1466.6666666666667, ans=0.145 2023-06-15 02:07:34,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1466.6666666666667, ans=0.31666666666666665 2023-06-15 02:07:36,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=45.75 vs. limit=8.05 2023-06-15 02:07:38,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.00 vs. limit=5.366666666666667 2023-06-15 02:07:59,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.15 vs. limit=5.383333333333333 2023-06-15 02:08:15,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.64 vs. limit=5.4 2023-06-15 02:08:21,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=24.57 vs. limit=8.1 2023-06-15 02:08:21,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=37.65 vs. limit=8.1 2023-06-15 02:08:23,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.56 vs. limit=5.8 2023-06-15 02:08:25,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.53 vs. limit=8.7 2023-06-15 02:08:30,664 INFO [train.py:988] (1/4) Epoch 1, batch 250, loss[loss=1.061, simple_loss=0.8945, pruned_loss=1.047, over 18323.00 frames. ], tot_loss[loss=1.357, simple_loss=1.19, pruned_loss=1.288, over 2707396.18 frames. ], batch size: 72, lr: 3.00e-02, grad_scale: 1.0 2023-06-15 02:08:33,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1666.6666666666667, ans=0.421875 2023-06-15 02:08:48,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.53 vs. limit=5.416666666666667 2023-06-15 02:08:48,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=8.125 2023-06-15 02:08:50,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=38.19 vs. limit=8.15 2023-06-15 02:08:57,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1733.3333333333333, ans=0.7673333333333333 2023-06-15 02:09:16,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=24.26 vs. limit=8.175 2023-06-15 02:09:26,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1800.0, ans=0.415625 2023-06-15 02:09:28,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=36.51 vs. limit=8.175 2023-06-15 02:09:30,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=20.35 vs. limit=8.175 2023-06-15 02:09:32,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1866.6666666666667, ans=0.13 2023-06-15 02:09:34,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=27.61 vs. limit=8.2 2023-06-15 02:09:37,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=2.38 vs. limit=4.746666666666667 2023-06-15 02:10:07,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=21.74 vs. limit=8.225 2023-06-15 02:10:14,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2000.0, ans=0.04949747468305833 2023-06-15 02:10:16,481 INFO [train.py:988] (1/4) Epoch 1, batch 300, loss[loss=0.8844, simple_loss=0.747, pruned_loss=0.8222, over 20544.00 frames. ], tot_loss[loss=1.244, simple_loss=1.082, pruned_loss=1.183, over 2941102.18 frames. ], batch size: 189, lr: 3.20e-02, grad_scale: 2.0 2023-06-15 02:10:16,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2000.0, ans=0.27999999999999997 2023-06-15 02:10:22,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.963e+01 1.052e+02 1.248e+02 1.628e+02 2.864e+02, threshold=2.496e+02, percent-clipped=3.0 2023-06-15 02:10:25,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2000.0, ans=0.125 2023-06-15 02:10:31,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2000.0, ans=0.043750000000000004 2023-06-15 02:10:32,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=46.69 vs. limit=8.25 2023-06-15 02:10:47,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=2066.6666666666665, ans=0.22933333333333333 2023-06-15 02:10:58,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.46 vs. limit=5.533333333333333 2023-06-15 02:11:08,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=28.06 vs. limit=8.3 2023-06-15 02:11:18,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.74 vs. limit=9.15 2023-06-15 02:11:19,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=2200.0, ans=0.050499999999999996 2023-06-15 02:11:24,052 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:11:24,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.69 vs. limit=9.15 2023-06-15 02:11:42,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.25 vs. limit=8.35 2023-06-15 02:12:02,179 INFO [train.py:988] (1/4) Epoch 1, batch 350, loss[loss=0.8942, simple_loss=0.747, pruned_loss=0.8248, over 20554.00 frames. ], tot_loss[loss=1.162, simple_loss=1.003, pruned_loss=1.1, over 3126909.69 frames. ], batch size: 160, lr: 3.40e-02, grad_scale: 2.0 2023-06-15 02:12:03,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=9.25 2023-06-15 02:12:03,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=9.25 2023-06-15 02:12:05,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.77 vs. limit=9.25 2023-06-15 02:12:08,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2333.3333333333335, ans=0.1125 2023-06-15 02:12:16,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2333.3333333333335, ans=0.390625 2023-06-15 02:12:22,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=2400.0, ans=0.11 2023-06-15 02:12:23,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=9.3 2023-06-15 02:12:23,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.20 vs. limit=8.4 2023-06-15 02:12:25,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.47 vs. limit=8.4 2023-06-15 02:12:28,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2400.0, ans=0.27599999999999997 2023-06-15 02:12:32,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2400.0, ans=0.27599999999999997 2023-06-15 02:13:00,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=28.63 vs. limit=8.425 2023-06-15 02:13:06,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=27.48 vs. limit=8.45 2023-06-15 02:13:21,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=8.45 2023-06-15 02:13:28,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=9.45 2023-06-15 02:13:39,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2600.0, ans=6.3 2023-06-15 02:13:42,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=2600.0, ans=0.23900000000000002 2023-06-15 02:13:46,997 INFO [train.py:988] (1/4) Epoch 1, batch 400, loss[loss=0.8913, simple_loss=0.7453, pruned_loss=0.787, over 19444.00 frames. ], tot_loss[loss=1.105, simple_loss=0.9468, pruned_loss=1.035, over 3268977.90 frames. ], batch size: 105, lr: 3.60e-02, grad_scale: 4.0 2023-06-15 02:13:52,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.424e+01 1.333e+02 1.553e+02 2.129e+02 3.991e+02, threshold=3.107e+02, percent-clipped=15.0 2023-06-15 02:14:04,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.23 vs. limit=9.5 2023-06-15 02:14:13,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2733.3333333333335, ans=0.0385 2023-06-15 02:14:30,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=9.6 2023-06-15 02:15:13,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2933.3333333333335, ans=0.7973333333333333 2023-06-15 02:15:13,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=8.6 2023-06-15 02:15:13,825 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.95 vs. limit=9.7 2023-06-15 02:15:33,102 INFO [train.py:988] (1/4) Epoch 1, batch 450, loss[loss=0.8961, simple_loss=0.7513, pruned_loss=0.7576, over 19665.00 frames. ], tot_loss[loss=1.058, simple_loss=0.902, pruned_loss=0.9746, over 3365722.28 frames. ], batch size: 110, lr: 3.80e-02, grad_scale: 4.0 2023-06-15 02:15:33,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=3000.0, ans=0.0875 2023-06-15 02:15:46,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=3000.0, ans=8.625 2023-06-15 02:16:15,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=3133.3333333333335, ans=6.958333333333334 2023-06-15 02:16:18,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.30 vs. limit=6.566666666666666 2023-06-15 02:16:53,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.61 vs. limit=8.725 2023-06-15 02:17:01,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=3266.6666666666665, ans=0.02650000000000001 2023-06-15 02:17:12,964 INFO [train.py:988] (1/4) Epoch 1, batch 500, loss[loss=0.8042, simple_loss=0.6825, pruned_loss=0.6365, over 20288.00 frames. ], tot_loss[loss=1.017, simple_loss=0.865, pruned_loss=0.9149, over 3459353.06 frames. ], batch size: 149, lr: 4.00e-02, grad_scale: 8.0 2023-06-15 02:17:13,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.09 vs. limit=10.0 2023-06-15 02:17:17,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=3333.3333333333335, ans=0.34375 2023-06-15 02:17:18,742 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.868e+01 1.531e+02 1.902e+02 2.695e+02 6.993e+02, threshold=3.804e+02, percent-clipped=16.0 2023-06-15 02:17:26,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=3333.3333333333335, ans=0.34375 2023-06-15 02:17:33,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=8.775 2023-06-15 02:17:34,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3400.0, ans=0.340625 2023-06-15 02:17:44,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3400.0, ans=0.266 2023-06-15 02:17:45,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=8.775 2023-06-15 02:17:56,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=8.8 2023-06-15 02:17:56,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.69 vs. limit=8.8 2023-06-15 02:17:59,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=3466.6666666666665, ans=0.03916666666666667 2023-06-15 02:18:02,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=10.1 2023-06-15 02:18:03,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=8.8 2023-06-15 02:18:03,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=8.8 2023-06-15 02:18:09,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=10.15 2023-06-15 02:18:43,118 INFO [train.py:988] (1/4) Epoch 2, batch 0, loss[loss=0.8033, simple_loss=0.6867, pruned_loss=0.6121, over 19187.00 frames. ], tot_loss[loss=0.8033, simple_loss=0.6867, pruned_loss=0.6121, over 19187.00 frames. ], batch size: 92, lr: 3.96e-02, grad_scale: 16.0 2023-06-15 02:18:43,119 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 02:18:49,229 INFO [train.py:1020] (1/4) Epoch 2, validation: loss=0.7911, simple_loss=0.6884, pruned_loss=0.5718, over 143649.00 frames. 2023-06-15 02:18:49,230 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 02:18:49,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=3553.3333333333335, ans=0.05583333333333329 2023-06-15 02:19:13,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=8.8575 2023-06-15 02:19:29,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=3686.6666666666665, ans=0.32718749999999996 2023-06-15 02:19:31,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=3686.6666666666665, ans=0.32718749999999996 2023-06-15 02:19:38,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=5.474666666666667 2023-06-15 02:19:39,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=3686.6666666666665, ans=0.7709666666666667 2023-06-15 02:19:50,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=3753.3333333333335, ans=0.32406250000000003 2023-06-15 02:20:04,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=10.315 2023-06-15 02:20:15,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=3820.0, ans=0.056749999999999995 2023-06-15 02:20:22,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=8.932500000000001 2023-06-15 02:20:28,466 INFO [train.py:988] (1/4) Epoch 2, batch 50, loss[loss=0.7088, simple_loss=0.6106, pruned_loss=0.5163, over 20648.00 frames. ], tot_loss[loss=0.7795, simple_loss=0.6693, pruned_loss=0.5795, over 856118.09 frames. ], batch size: 211, lr: 3.95e-02, grad_scale: 16.0 2023-06-15 02:20:32,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=3886.6666666666665, ans=0.3178125 2023-06-15 02:20:38,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=3886.6666666666665, ans=0.3178125 2023-06-15 02:20:59,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=3953.3333333333335, ans=5.581333333333333 2023-06-15 02:21:07,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.379e+02 3.485e+02 5.608e+02 1.271e+03, threshold=6.971e+02, percent-clipped=46.0 2023-06-15 02:21:14,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=4020.0, ans=0.04991666666666667 2023-06-15 02:21:18,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=4020.0, ans=0.31156249999999996 2023-06-15 02:21:37,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=4086.6666666666665, ans=0.7569666666666667 2023-06-15 02:21:40,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.59 vs. limit=5.634666666666667 2023-06-15 02:21:44,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.66 vs. limit=10.565 2023-06-15 02:21:58,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=4153.333333333333, ans=0.3053125 2023-06-15 02:22:00,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=4153.333333333333, ans=0.04936111111111111 2023-06-15 02:22:06,299 INFO [train.py:988] (1/4) Epoch 2, batch 100, loss[loss=0.707, simple_loss=0.6192, pruned_loss=0.4826, over 18627.00 frames. ], tot_loss[loss=0.745, simple_loss=0.6433, pruned_loss=0.5392, over 1522814.28 frames. ], batch size: 80, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:22:13,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=4220.0, ans=0.2633 2023-06-15 02:22:16,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=10.665 2023-06-15 02:22:24,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=4286.666666666667, ans=0.03660416666666667 2023-06-15 02:22:26,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=4286.666666666667, ans=0.2990625 2023-06-15 02:22:48,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=4353.333333333333, ans=0.29593749999999996 2023-06-15 02:23:03,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4353.333333333333, ans=0.2564666666666667 2023-06-15 02:23:36,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.39 vs. limit=6.121666666666667 2023-06-15 02:23:37,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=4486.666666666667, ans=0.2896875 2023-06-15 02:23:45,682 INFO [train.py:988] (1/4) Epoch 2, batch 150, loss[loss=0.628, simple_loss=0.5567, pruned_loss=0.4083, over 19822.00 frames. ], tot_loss[loss=0.7163, simple_loss=0.6222, pruned_loss=0.5047, over 2034110.17 frames. ], batch size: 115, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:23:46,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.74 vs. limit=7.276666666666666 2023-06-15 02:24:11,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=10.965 2023-06-15 02:24:15,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.32 vs. limit=5.848 2023-06-15 02:24:25,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.805e+02 4.216e+02 6.424e+02 2.276e+03, threshold=8.432e+02, percent-clipped=19.0 2023-06-15 02:24:30,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.28 vs. limit=6.171666666666667 2023-06-15 02:24:37,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4686.666666666667, ans=0.28031249999999996 2023-06-15 02:24:52,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=6.1883333333333335 2023-06-15 02:25:04,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=4820.0, ans=0.7313000000000001 2023-06-15 02:25:09,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=11.115 2023-06-15 02:25:20,903 INFO [train.py:988] (1/4) Epoch 2, batch 200, loss[loss=0.6097, simple_loss=0.5422, pruned_loss=0.3881, over 19490.00 frames. ], tot_loss[loss=0.6928, simple_loss=0.6052, pruned_loss=0.4763, over 2422128.29 frames. ], batch size: 105, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:25:52,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=9.3575 2023-06-15 02:26:36,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.58 vs. limit=11.365 2023-06-15 02:26:56,711 INFO [train.py:988] (1/4) Epoch 2, batch 250, loss[loss=0.6135, simple_loss=0.5503, pruned_loss=0.3779, over 19313.00 frames. ], tot_loss[loss=0.6719, simple_loss=0.5904, pruned_loss=0.4506, over 2720076.79 frames. ], batch size: 98, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:27:13,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5220.0, ans=0.2478 2023-06-15 02:27:18,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-15 02:27:20,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-15 02:27:26,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=5286.666666666667, ans=0.04463888888888889 2023-06-15 02:27:30,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-15 02:27:37,764 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.822e+02 4.791e+02 8.752e+02 2.397e+03, threshold=9.582e+02, percent-clipped=28.0 2023-06-15 02:27:46,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.97 vs. limit=11.515 2023-06-15 02:27:55,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=5420.0, ans=0.24593749999999998 2023-06-15 02:28:13,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=5486.666666666667, ans=0.009676811594202899 2023-06-15 02:28:29,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=5486.666666666667, ans=0.2823 2023-06-15 02:28:32,183 INFO [train.py:988] (1/4) Epoch 2, batch 300, loss[loss=0.5905, simple_loss=0.5375, pruned_loss=0.3473, over 19091.00 frames. ], tot_loss[loss=0.6507, simple_loss=0.5751, pruned_loss=0.4262, over 2956150.09 frames. ], batch size: 94, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:28:34,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=5553.333333333333, ans=0.7056333333333333 2023-06-15 02:28:50,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=5620.0, ans=0.2365625 2023-06-15 02:29:46,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=5820.0, ans=0.04241666666666667 2023-06-15 02:30:06,991 INFO [train.py:988] (1/4) Epoch 2, batch 350, loss[loss=0.5617, simple_loss=0.5099, pruned_loss=0.3305, over 19317.00 frames. ], tot_loss[loss=0.6312, simple_loss=0.5614, pruned_loss=0.4037, over 3137379.47 frames. ], batch size: 98, lr: 3.95e-02, grad_scale: 8.0 2023-06-15 02:30:15,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5886.666666666667, ans=0.2411333333333333 2023-06-15 02:30:39,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=5953.333333333333, ans=0.04186111111111111 2023-06-15 02:30:46,023 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 3.241e+02 6.004e+02 9.437e+02 1.791e+03, threshold=1.201e+03, percent-clipped=24.0 2023-06-15 02:31:18,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=6086.666666666667, ans=0.21468749999999998 2023-06-15 02:31:29,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=6153.333333333333, ans=0.009531884057971014 2023-06-15 02:31:34,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=6153.333333333333, ans=0.04949747468305833 2023-06-15 02:31:40,064 INFO [train.py:988] (1/4) Epoch 2, batch 400, loss[loss=0.5644, simple_loss=0.5028, pruned_loss=0.3442, over 20114.00 frames. ], tot_loss[loss=0.6143, simple_loss=0.5494, pruned_loss=0.3846, over 3290732.08 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 16.0 2023-06-15 02:31:58,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=6286.666666666667, ans=0.6799666666666666 2023-06-15 02:32:05,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=6286.666666666667, ans=0.02 2023-06-15 02:32:10,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=6286.666666666667, ans=12.215 2023-06-15 02:32:35,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=6420.0, ans=0.009473913043478261 2023-06-15 02:32:55,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.21 vs. limit=9.932500000000001 2023-06-15 02:33:09,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.81 vs. limit=8.243333333333334 2023-06-15 02:33:12,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6553.333333333333, ans=0.23446666666666666 2023-06-15 02:33:13,465 INFO [train.py:988] (1/4) Epoch 2, batch 450, loss[loss=0.5838, simple_loss=0.5458, pruned_loss=0.3175, over 18316.00 frames. ], tot_loss[loss=0.5989, simple_loss=0.5389, pruned_loss=0.3671, over 3399397.22 frames. ], batch size: 72, lr: 3.94e-02, grad_scale: 8.0 2023-06-15 02:33:14,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=12.415 2023-06-15 02:33:30,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=6620.0, ans=0.1896875 2023-06-15 02:33:30,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6620.0, ans=0.1896875 2023-06-15 02:33:54,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 3.705e+02 5.386e+02 8.050e+02 1.837e+03, threshold=1.077e+03, percent-clipped=8.0 2023-06-15 02:34:00,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=6686.666666666667, ans=0.18656250000000002 2023-06-15 02:34:12,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=6753.333333333333, ans=0.03852777777777778 2023-06-15 02:34:43,323 INFO [train.py:988] (1/4) Epoch 2, batch 500, loss[loss=0.5204, simple_loss=0.4849, pruned_loss=0.2844, over 19470.00 frames. ], tot_loss[loss=0.5826, simple_loss=0.5274, pruned_loss=0.3499, over 3489734.37 frames. ], batch size: 105, lr: 3.94e-02, grad_scale: 8.0 2023-06-15 02:34:54,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6886.666666666667, ans=0.0 2023-06-15 02:35:00,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=6953.333333333333, ans=0.04949747468305833 2023-06-15 02:35:01,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=10.1075 2023-06-15 02:35:17,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=6.808 2023-06-15 02:35:35,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=7086.666666666667, ans=0.16781249999999998 2023-06-15 02:35:56,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=7100.0, ans=0.1671875 2023-06-15 02:36:00,643 INFO [train.py:988] (1/4) Epoch 3, batch 0, loss[loss=0.5533, simple_loss=0.5179, pruned_loss=0.2991, over 16710.00 frames. ], tot_loss[loss=0.5533, simple_loss=0.5179, pruned_loss=0.2991, over 16710.00 frames. ], batch size: 59, lr: 3.84e-02, grad_scale: 16.0 2023-06-15 02:36:00,643 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 02:36:06,803 INFO [train.py:1020] (1/4) Epoch 3, validation: loss=0.4219, simple_loss=0.4383, pruned_loss=0.1731, over 143649.00 frames. 2023-06-15 02:36:06,804 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 02:36:15,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=7100.0, ans=0.1671875 2023-06-15 02:36:37,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=7166.666666666667, ans=0.009311594202898552 2023-06-15 02:36:37,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=7166.666666666667, ans=0.1640625 2023-06-15 02:36:38,786 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:36:43,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=7233.333333333333, ans=12.925 2023-06-15 02:36:47,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7233.333333333333, ans=0.0 2023-06-15 02:37:08,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7300.0, ans=0.22699999999999998 2023-06-15 02:37:13,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7300.0, ans=0.15781250000000002 2023-06-15 02:37:17,508 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 3.334e+02 5.511e+02 7.843e+02 1.620e+03, threshold=1.102e+03, percent-clipped=11.0 2023-06-15 02:37:36,276 INFO [train.py:988] (1/4) Epoch 3, batch 50, loss[loss=0.4961, simple_loss=0.4643, pruned_loss=0.268, over 18600.00 frames. ], tot_loss[loss=0.5175, simple_loss=0.482, pruned_loss=0.2825, over 839287.11 frames. ], batch size: 80, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:37:36,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=7433.333333333333, ans=0.6398333333333334 2023-06-15 02:37:40,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=4.115 2023-06-15 02:37:50,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=7433.333333333333, ans=0.6398333333333334 2023-06-15 02:38:05,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=7500.0, ans=0.00923913043478261 2023-06-15 02:38:11,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=7566.666666666667, ans=0.1453125 2023-06-15 02:38:36,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=13.225 2023-06-15 02:39:06,068 INFO [train.py:988] (1/4) Epoch 3, batch 100, loss[loss=0.5135, simple_loss=0.4794, pruned_loss=0.2782, over 20294.00 frames. ], tot_loss[loss=0.5109, simple_loss=0.4774, pruned_loss=0.2769, over 1495939.35 frames. ], batch size: 149, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:39:40,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=7900.0, ans=0.1296875 2023-06-15 02:39:54,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=7900.0, ans=0.0 2023-06-15 02:40:00,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=13.475 2023-06-15 02:40:19,695 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 2.598e+02 5.024e+02 8.933e+02 2.174e+03, threshold=1.005e+03, percent-clipped=11.0 2023-06-15 02:40:27,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=13.525 2023-06-15 02:40:35,844 INFO [train.py:988] (1/4) Epoch 3, batch 150, loss[loss=0.4879, simple_loss=0.4669, pruned_loss=0.2517, over 19831.00 frames. ], tot_loss[loss=0.5043, simple_loss=0.4732, pruned_loss=0.2707, over 1985001.14 frames. ], batch size: 120, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:40:58,143 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:41:21,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=8233.333333333334, ans=10.0 2023-06-15 02:41:29,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=8300.0, ans=0.03208333333333334 2023-06-15 02:41:33,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.20 vs. limit=7.075 2023-06-15 02:42:04,815 INFO [train.py:988] (1/4) Epoch 3, batch 200, loss[loss=0.4656, simple_loss=0.4522, pruned_loss=0.2337, over 18311.00 frames. ], tot_loss[loss=0.5, simple_loss=0.4711, pruned_loss=0.2662, over 2392627.70 frames. ], batch size: 74, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:42:11,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=8433.333333333334, ans=0.125 2023-06-15 02:42:26,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=8500.0, ans=0.009021739130434782 2023-06-15 02:42:30,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.01 vs. limit=7.125 2023-06-15 02:43:13,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8700.0, ans=0.125 2023-06-15 02:43:15,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=8700.0, ans=0.125 2023-06-15 02:43:16,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.306e+02 5.233e+02 8.261e+02 1.948e+03, threshold=1.047e+03, percent-clipped=15.0 2023-06-15 02:43:29,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=8700.0, ans=10.7625 2023-06-15 02:43:33,619 INFO [train.py:988] (1/4) Epoch 3, batch 250, loss[loss=0.4866, simple_loss=0.4733, pruned_loss=0.2441, over 18310.00 frames. ], tot_loss[loss=0.4959, simple_loss=0.4691, pruned_loss=0.2621, over 2707902.76 frames. ], batch size: 74, lr: 3.83e-02, grad_scale: 8.0 2023-06-15 02:43:56,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=8833.333333333334, ans=0.125 2023-06-15 02:44:05,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.86 vs. limit=4.325 2023-06-15 02:44:29,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=8966.666666666666, ans=0.125 2023-06-15 02:44:29,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=8966.666666666666, ans=0.5861666666666667 2023-06-15 02:45:03,152 INFO [train.py:988] (1/4) Epoch 3, batch 300, loss[loss=0.4552, simple_loss=0.4245, pruned_loss=0.2458, over 19889.00 frames. ], tot_loss[loss=0.4889, simple_loss=0.4644, pruned_loss=0.2564, over 2950851.79 frames. ], batch size: 294, lr: 3.82e-02, grad_scale: 8.0 2023-06-15 02:45:05,352 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:45:15,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=9100.0, ans=0.008891304347826087 2023-06-15 02:45:16,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.23 vs. limit=10.9125 2023-06-15 02:46:11,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9300.0, ans=0.125 2023-06-15 02:46:11,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=9300.0, ans=0.02791666666666667 2023-06-15 02:46:16,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.844e+02 4.784e+02 7.568e+02 1.827e+03, threshold=9.568e+02, percent-clipped=19.0 2023-06-15 02:46:25,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=9366.666666666666, ans=0.20633333333333334 2023-06-15 02:46:32,397 INFO [train.py:988] (1/4) Epoch 3, batch 350, loss[loss=0.4623, simple_loss=0.4546, pruned_loss=0.2287, over 19509.00 frames. ], tot_loss[loss=0.4836, simple_loss=0.4613, pruned_loss=0.2519, over 3137010.22 frames. ], batch size: 105, lr: 3.82e-02, grad_scale: 8.0 2023-06-15 02:46:32,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9433.333333333334, ans=0.20566666666666666 2023-06-15 02:46:35,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9433.333333333334, ans=0.20566666666666666 2023-06-15 02:47:13,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=9566.666666666666, ans=0.20433333333333334 2023-06-15 02:47:22,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=9566.666666666666, ans=0.125 2023-06-15 02:47:31,850 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.55 vs. limit=14.725 2023-06-15 02:47:38,043 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.909e-02 2023-06-15 02:47:55,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=9700.0, ans=0.203 2023-06-15 02:48:02,104 INFO [train.py:988] (1/4) Epoch 3, batch 400, loss[loss=0.4454, simple_loss=0.4367, pruned_loss=0.2222, over 19834.00 frames. ], tot_loss[loss=0.4782, simple_loss=0.4582, pruned_loss=0.2474, over 3268909.84 frames. ], batch size: 115, lr: 3.82e-02, grad_scale: 16.0 2023-06-15 02:48:08,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=9766.666666666666, ans=0.125 2023-06-15 02:48:29,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9833.333333333334, ans=0.20166666666666666 2023-06-15 02:49:07,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9966.666666666666, ans=0.20033333333333334 2023-06-15 02:49:15,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.793e+02 4.517e+02 6.574e+02 1.219e+03, threshold=9.033e+02, percent-clipped=7.0 2023-06-15 02:49:16,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=10033.333333333334, ans=0.125 2023-06-15 02:49:24,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=10033.333333333334, ans=0.125 2023-06-15 02:49:25,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10033.333333333334, ans=0.19966666666666666 2023-06-15 02:49:31,992 INFO [train.py:988] (1/4) Epoch 3, batch 450, loss[loss=0.4384, simple_loss=0.4423, pruned_loss=0.2093, over 18604.00 frames. ], tot_loss[loss=0.4724, simple_loss=0.455, pruned_loss=0.2424, over 3386415.35 frames. ], batch size: 80, lr: 3.82e-02, grad_scale: 16.0 2023-06-15 02:49:36,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=10100.0, ans=0.125 2023-06-15 02:49:42,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.075 2023-06-15 02:49:56,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=10166.666666666666, ans=0.02430555555555556 2023-06-15 02:49:59,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=10166.666666666666, ans=0.125 2023-06-15 02:50:18,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.23 vs. limit=15.175 2023-06-15 02:50:20,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=10233.333333333334, ans=0.125 2023-06-15 02:50:22,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=11.3375 2023-06-15 02:50:25,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=10300.0, ans=0.05 2023-06-15 02:50:30,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=10300.0, ans=0.5395000000000001 2023-06-15 02:50:37,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=10300.0, ans=0.5395000000000001 2023-06-15 02:50:46,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.57 vs. limit=15.275 2023-06-15 02:50:54,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=10366.666666666666, ans=0.125 2023-06-15 02:50:59,009 INFO [train.py:988] (1/4) Epoch 3, batch 500, loss[loss=0.4279, simple_loss=0.4322, pruned_loss=0.2049, over 19876.00 frames. ], tot_loss[loss=0.4663, simple_loss=0.4515, pruned_loss=0.2375, over 3475614.48 frames. ], batch size: 120, lr: 3.81e-02, grad_scale: 16.0 2023-06-15 02:52:16,081 INFO [train.py:988] (1/4) Epoch 4, batch 0, loss[loss=0.4251, simple_loss=0.4282, pruned_loss=0.2051, over 19225.00 frames. ], tot_loss[loss=0.4251, simple_loss=0.4282, pruned_loss=0.2051, over 19225.00 frames. ], batch size: 92, lr: 3.66e-02, grad_scale: 32.0 2023-06-15 02:52:16,082 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 02:52:22,228 INFO [train.py:1020] (1/4) Epoch 4, validation: loss=0.3338, simple_loss=0.3946, pruned_loss=0.1182, over 143649.00 frames. 2023-06-15 02:52:22,228 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 02:52:38,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.821e+02 4.565e+02 6.318e+02 1.774e+03, threshold=9.130e+02, percent-clipped=10.0 2023-06-15 02:52:39,538 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.535 2023-06-15 02:52:44,450 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 02:53:11,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=10780.0, ans=0.125 2023-06-15 02:53:40,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=10913.333333333334, ans=0.125 2023-06-15 02:53:52,050 INFO [train.py:988] (1/4) Epoch 4, batch 50, loss[loss=0.4774, simple_loss=0.4847, pruned_loss=0.2287, over 17638.00 frames. ], tot_loss[loss=0.4368, simple_loss=0.4368, pruned_loss=0.2135, over 848960.90 frames. ], batch size: 67, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:54:14,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=11046.666666666666, ans=0.020638888888888894 2023-06-15 02:54:30,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=11113.333333333334, ans=0.125 2023-06-15 02:54:32,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=15.835 2023-06-15 02:54:34,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.78 vs. limit=11.6675 2023-06-15 02:54:41,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=11113.333333333334, ans=0.008453623188405797 2023-06-15 02:54:53,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.70 vs. limit=15.885 2023-06-15 02:55:09,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=11246.666666666666, ans=0.07 2023-06-15 02:55:20,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=11246.666666666666, ans=0.008424637681159421 2023-06-15 02:55:23,239 INFO [train.py:988] (1/4) Epoch 4, batch 100, loss[loss=0.444, simple_loss=0.4483, pruned_loss=0.2153, over 18289.00 frames. ], tot_loss[loss=0.4351, simple_loss=0.4354, pruned_loss=0.2128, over 1499404.09 frames. ], batch size: 74, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:55:27,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=11313.333333333334, ans=0.008410144927536231 2023-06-15 02:55:32,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=11313.333333333334, ans=0.008410144927536231 2023-06-15 02:55:37,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=11313.333333333334, ans=0.019527777777777776 2023-06-15 02:55:42,249 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.113e+02 4.609e+02 7.608e+02 1.612e+03, threshold=9.219e+02, percent-clipped=13.0 2023-06-15 02:55:48,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=11.7675 2023-06-15 02:56:15,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=11513.333333333334, ans=0.37270000000000003 2023-06-15 02:56:39,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11580.0, ans=0.18419999999999997 2023-06-15 02:56:47,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=11580.0, ans=0.125 2023-06-15 02:56:53,605 INFO [train.py:988] (1/4) Epoch 4, batch 150, loss[loss=0.4458, simple_loss=0.454, pruned_loss=0.2145, over 16322.00 frames. ], tot_loss[loss=0.4307, simple_loss=0.4328, pruned_loss=0.2099, over 2014017.73 frames. ], batch size: 52, lr: 3.66e-02, grad_scale: 16.0 2023-06-15 02:57:00,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=11646.666666666666, ans=0.125 2023-06-15 02:57:19,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=11713.333333333334, ans=0.008323188405797101 2023-06-15 02:57:23,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.65 vs. limit=4.757 2023-06-15 02:57:42,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=11780.0, ans=0.4877 2023-06-15 02:57:45,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=11.942499999999999 2023-06-15 02:57:48,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=11846.666666666666, ans=16.384999999999998 2023-06-15 02:58:07,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=11913.333333333334, ans=0.4830333333333333 2023-06-15 02:58:08,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.59 vs. limit=16.435000000000002 2023-06-15 02:58:23,370 INFO [train.py:988] (1/4) Epoch 4, batch 200, loss[loss=0.3844, simple_loss=0.4028, pruned_loss=0.1789, over 19218.00 frames. ], tot_loss[loss=0.4267, simple_loss=0.4305, pruned_loss=0.2075, over 2408853.40 frames. ], batch size: 92, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 02:58:30,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=11980.0, ans=0.125 2023-06-15 02:58:37,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=11980.0, ans=0.008265217391304348 2023-06-15 02:58:37,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=11980.0, ans=0.05 2023-06-15 02:58:40,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.791e+02 4.180e+02 6.641e+02 1.358e+03, threshold=8.360e+02, percent-clipped=6.0 2023-06-15 02:58:50,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=12046.666666666666, ans=0.008250724637681159 2023-06-15 02:59:07,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=12113.333333333334, ans=0.008236231884057971 2023-06-15 02:59:19,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=12180.0, ans=0.17819999999999997 2023-06-15 02:59:38,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=12246.666666666666, ans=0.125 2023-06-15 02:59:55,811 INFO [train.py:988] (1/4) Epoch 4, batch 250, loss[loss=0.409, simple_loss=0.4137, pruned_loss=0.2001, over 20598.00 frames. ], tot_loss[loss=0.4238, simple_loss=0.428, pruned_loss=0.2063, over 2710934.62 frames. ], batch size: 189, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 03:00:08,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=8.925333333333334 2023-06-15 03:00:31,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=12446.666666666666, ans=0.014805555555555558 2023-06-15 03:00:33,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=12446.666666666666, ans=0.125 2023-06-15 03:00:35,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.72 vs. limit=16.835 2023-06-15 03:00:42,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=6.489333333333333 2023-06-15 03:00:59,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12513.333333333334, ans=0.17486666666666667 2023-06-15 03:01:25,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=12646.666666666666, ans=0.013972222222222226 2023-06-15 03:01:26,972 INFO [train.py:988] (1/4) Epoch 4, batch 300, loss[loss=0.3956, simple_loss=0.3875, pruned_loss=0.2013, over 19809.00 frames. ], tot_loss[loss=0.42, simple_loss=0.4267, pruned_loss=0.2035, over 2947871.29 frames. ], batch size: 293, lr: 3.65e-02, grad_scale: 16.0 2023-06-15 03:01:36,712 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.41 vs. limit=16.985 2023-06-15 03:01:39,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=12646.666666666666, ans=0.4573666666666667 2023-06-15 03:01:44,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.980e+02 4.845e+02 6.504e+02 1.050e+03, threshold=9.691e+02, percent-clipped=10.0 2023-06-15 03:02:06,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=12.2925 2023-06-15 03:02:16,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.2925 2023-06-15 03:02:19,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.25 vs. limit=12.2925 2023-06-15 03:02:29,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=12846.666666666666, ans=0.125 2023-06-15 03:02:59,267 INFO [train.py:988] (1/4) Epoch 4, batch 350, loss[loss=0.3813, simple_loss=0.4054, pruned_loss=0.1774, over 19667.00 frames. ], tot_loss[loss=0.4165, simple_loss=0.4255, pruned_loss=0.2011, over 3135488.92 frames. ], batch size: 110, lr: 3.64e-02, grad_scale: 16.0 2023-06-15 03:03:04,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=12980.0, ans=0.1702 2023-06-15 03:03:08,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=12980.0, ans=0.125 2023-06-15 03:03:24,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=13046.666666666666, ans=0.44336666666666674 2023-06-15 03:03:46,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=13113.333333333334, ans=0.44103333333333333 2023-06-15 03:04:05,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.20 vs. limit=11.59 2023-06-15 03:04:10,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.18 vs. limit=8.311666666666667 2023-06-15 03:04:19,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=13246.666666666666, ans=0.011472222222222224 2023-06-15 03:04:29,348 INFO [train.py:988] (1/4) Epoch 4, batch 400, loss[loss=0.4026, simple_loss=0.4147, pruned_loss=0.1952, over 19957.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.4232, pruned_loss=0.1985, over 3291666.81 frames. ], batch size: 126, lr: 3.64e-02, grad_scale: 32.0 2023-06-15 03:04:46,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.163e+02 4.789e+02 6.291e+02 1.274e+03, threshold=9.578e+02, percent-clipped=4.0 2023-06-15 03:05:12,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=9.378666666666668 2023-06-15 03:05:54,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.592500000000001 2023-06-15 03:05:59,055 INFO [train.py:988] (1/4) Epoch 4, batch 450, loss[loss=0.3542, simple_loss=0.3899, pruned_loss=0.1592, over 19468.00 frames. ], tot_loss[loss=0.4084, simple_loss=0.4214, pruned_loss=0.1962, over 3411245.01 frames. ], batch size: 105, lr: 3.64e-02, grad_scale: 16.0 2023-06-15 03:06:02,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=13646.666666666666, ans=0.007902898550724638 2023-06-15 03:06:13,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.25 vs. limit=11.823333333333334 2023-06-15 03:06:25,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.55 vs. limit=9.485333333333333 2023-06-15 03:06:43,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=13780.0, ans=10.0 2023-06-15 03:06:54,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=12.692499999999999 2023-06-15 03:07:24,282 INFO [train.py:988] (1/4) Epoch 4, batch 500, loss[loss=0.3971, simple_loss=0.419, pruned_loss=0.1876, over 19084.00 frames. ], tot_loss[loss=0.4043, simple_loss=0.4203, pruned_loss=0.193, over 3494577.65 frames. ], batch size: 89, lr: 3.63e-02, grad_scale: 16.0 2023-06-15 03:07:29,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=13980.0, ans=0.125 2023-06-15 03:07:34,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=13980.0, ans=0.00841666666666667 2023-06-15 03:07:42,934 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.873e+02 4.186e+02 6.544e+02 1.200e+03, threshold=8.372e+02, percent-clipped=10.0 2023-06-15 03:08:03,819 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:08:07,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.63 vs. limit=12.056666666666668 2023-06-15 03:08:08,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=14113.333333333334, ans=0.00786111111111111 2023-06-15 03:08:43,614 INFO [train.py:988] (1/4) Epoch 5, batch 0, loss[loss=0.3902, simple_loss=0.4153, pruned_loss=0.1826, over 20118.00 frames. ], tot_loss[loss=0.3902, simple_loss=0.4153, pruned_loss=0.1826, over 20118.00 frames. ], batch size: 133, lr: 3.47e-02, grad_scale: 32.0 2023-06-15 03:08:43,615 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 03:08:49,787 INFO [train.py:1020] (1/4) Epoch 5, validation: loss=0.2868, simple_loss=0.3756, pruned_loss=0.09898, over 143649.00 frames. 2023-06-15 03:08:49,787 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 03:09:16,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14260.0, ans=0.125 2023-06-15 03:09:35,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.45 vs. limit=12.163333333333334 2023-06-15 03:09:49,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=14393.333333333334, ans=0.125 2023-06-15 03:10:18,929 INFO [train.py:988] (1/4) Epoch 5, batch 50, loss[loss=0.3831, simple_loss=0.403, pruned_loss=0.1816, over 20560.00 frames. ], tot_loss[loss=0.3846, simple_loss=0.4081, pruned_loss=0.1805, over 868083.59 frames. ], batch size: 189, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:10:22,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=14526.666666666666, ans=0.006138888888888888 2023-06-15 03:10:49,086 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:11:10,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.824e+02 3.862e+02 4.906e+02 1.527e+03, threshold=7.724e+02, percent-clipped=12.0 2023-06-15 03:11:27,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=14726.666666666666, ans=0.0 2023-06-15 03:11:31,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=14793.333333333334, ans=0.3822333333333333 2023-06-15 03:11:37,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=13.0475 2023-06-15 03:11:48,660 INFO [train.py:988] (1/4) Epoch 5, batch 100, loss[loss=0.3932, simple_loss=0.3978, pruned_loss=0.1943, over 20270.00 frames. ], tot_loss[loss=0.3809, simple_loss=0.407, pruned_loss=0.1774, over 1524242.29 frames. ], batch size: 239, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:12:11,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=14926.666666666666, ans=0.3775666666666667 2023-06-15 03:12:19,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=14926.666666666666, ans=0.05 2023-06-15 03:12:24,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=14993.333333333334, ans=0.4249 2023-06-15 03:12:56,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15060.0, ans=0.1494 2023-06-15 03:13:17,972 INFO [train.py:988] (1/4) Epoch 5, batch 150, loss[loss=0.3569, simple_loss=0.3945, pruned_loss=0.1597, over 19109.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.4086, pruned_loss=0.1779, over 2017128.69 frames. ], batch size: 94, lr: 3.46e-02, grad_scale: 32.0 2023-06-15 03:13:28,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=15193.333333333334, ans=0.125 2023-06-15 03:14:10,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.930e+02 4.326e+02 6.625e+02 9.040e+02, threshold=8.653e+02, percent-clipped=9.0 2023-06-15 03:14:14,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=15393.333333333334, ans=0.007523188405797102 2023-06-15 03:14:18,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.34 vs. limit=13.2725 2023-06-15 03:14:19,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=15393.333333333334, ans=0.125 2023-06-15 03:14:26,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=15393.333333333334, ans=0.125 2023-06-15 03:14:34,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.184000000000001 2023-06-15 03:14:47,999 INFO [train.py:988] (1/4) Epoch 5, batch 200, loss[loss=0.393, simple_loss=0.4208, pruned_loss=0.1826, over 19318.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.4083, pruned_loss=0.1772, over 2403929.70 frames. ], batch size: 98, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:15:05,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15593.333333333334, ans=0.14406666666666668 2023-06-15 03:15:24,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=15660.0, ans=0.125 2023-06-15 03:15:34,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=13.372499999999999 2023-06-15 03:15:40,642 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:16:06,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15793.333333333334, ans=0.14206666666666667 2023-06-15 03:16:18,869 INFO [train.py:988] (1/4) Epoch 5, batch 250, loss[loss=0.3959, simple_loss=0.4116, pruned_loss=0.1901, over 20515.00 frames. ], tot_loss[loss=0.3794, simple_loss=0.4066, pruned_loss=0.1761, over 2722641.58 frames. ], batch size: 189, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:17:11,395 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.885e+02 4.184e+02 6.160e+02 1.201e+03, threshold=8.369e+02, percent-clipped=9.0 2023-06-15 03:17:33,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=16126.666666666666, ans=0.0 2023-06-15 03:17:49,369 INFO [train.py:988] (1/4) Epoch 5, batch 300, loss[loss=0.3613, simple_loss=0.4028, pruned_loss=0.1599, over 19874.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.406, pruned_loss=0.1747, over 2955957.19 frames. ], batch size: 120, lr: 3.45e-02, grad_scale: 32.0 2023-06-15 03:18:23,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=16326.666666666666, ans=0.3285666666666667 2023-06-15 03:19:19,001 INFO [train.py:988] (1/4) Epoch 5, batch 350, loss[loss=0.3796, simple_loss=0.4042, pruned_loss=0.1775, over 20339.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4056, pruned_loss=0.174, over 3130739.73 frames. ], batch size: 149, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:19:51,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=16593.333333333332, ans=0.125 2023-06-15 03:20:09,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.834e+02 3.544e+02 5.201e+02 9.187e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-15 03:20:24,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=16726.666666666668, ans=0.007233333333333333 2023-06-15 03:20:47,506 INFO [train.py:988] (1/4) Epoch 5, batch 400, loss[loss=0.3759, simple_loss=0.4065, pruned_loss=0.1726, over 19773.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.4032, pruned_loss=0.1727, over 3284798.95 frames. ], batch size: 115, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:20:57,193 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:21:26,988 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:21:31,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=16993.333333333332, ans=0.125 2023-06-15 03:21:33,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.57 vs. limit=20.244999999999997 2023-06-15 03:22:12,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=17126.666666666668, ans=0.007146376811594203 2023-06-15 03:22:17,704 INFO [train.py:988] (1/4) Epoch 5, batch 450, loss[loss=0.3425, simple_loss=0.3885, pruned_loss=0.1482, over 18799.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4028, pruned_loss=0.1711, over 3397780.69 frames. ], batch size: 83, lr: 3.44e-02, grad_scale: 32.0 2023-06-15 03:22:37,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=17260.0, ans=0.29590000000000005 2023-06-15 03:23:08,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.161e+02 3.951e+02 6.319e+02 1.120e+03, threshold=7.903e+02, percent-clipped=17.0 2023-06-15 03:23:14,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=17393.333333333332, ans=0.125 2023-06-15 03:23:20,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=17393.333333333332, ans=0.07 2023-06-15 03:23:25,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.28 vs. limit=20.544999999999998 2023-06-15 03:23:45,490 INFO [train.py:988] (1/4) Epoch 5, batch 500, loss[loss=0.3491, simple_loss=0.3867, pruned_loss=0.1558, over 19330.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4017, pruned_loss=0.1696, over 3494690.19 frames. ], batch size: 98, lr: 3.43e-02, grad_scale: 32.0 2023-06-15 03:23:50,820 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:24:00,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=17593.333333333332, ans=0.07 2023-06-15 03:24:09,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=17593.333333333332, ans=0.0 2023-06-15 03:24:20,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17660.0, ans=0.12340000000000001 2023-06-15 03:25:04,643 INFO [train.py:988] (1/4) Epoch 6, batch 0, loss[loss=0.3514, simple_loss=0.3865, pruned_loss=0.1582, over 20304.00 frames. ], tot_loss[loss=0.3514, simple_loss=0.3865, pruned_loss=0.1582, over 20304.00 frames. ], batch size: 141, lr: 3.27e-02, grad_scale: 32.0 2023-06-15 03:25:04,643 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 03:25:10,743 INFO [train.py:1020] (1/4) Epoch 6, validation: loss=0.268, simple_loss=0.365, pruned_loss=0.08554, over 143649.00 frames. 2023-06-15 03:25:10,744 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 03:25:32,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=17813.333333333332, ans=0.006997101449275362 2023-06-15 03:25:40,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17813.333333333332, ans=0.0 2023-06-15 03:25:56,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=17880.0, ans=0.125 2023-06-15 03:26:07,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17946.666666666668, ans=0.12053333333333333 2023-06-15 03:26:25,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18013.333333333332, ans=0.11986666666666668 2023-06-15 03:26:27,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=18013.333333333332, ans=0.125 2023-06-15 03:26:30,210 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.488e+02 3.063e+02 4.314e+02 9.185e+02, threshold=6.126e+02, percent-clipped=4.0 2023-06-15 03:26:37,319 INFO [train.py:988] (1/4) Epoch 6, batch 50, loss[loss=0.3422, simple_loss=0.3817, pruned_loss=0.1513, over 20279.00 frames. ], tot_loss[loss=0.3587, simple_loss=0.3951, pruned_loss=0.1612, over 864048.44 frames. ], batch size: 141, lr: 3.26e-02, grad_scale: 32.0 2023-06-15 03:26:39,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=18080.0, ans=0.0 2023-06-15 03:27:25,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=18213.333333333332, ans=0.006910144927536232 2023-06-15 03:27:32,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=18280.0, ans=0.125 2023-06-15 03:27:44,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=18346.666666666668, ans=0.125 2023-06-15 03:27:56,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18346.666666666668, ans=0.0 2023-06-15 03:28:02,259 INFO [train.py:988] (1/4) Epoch 6, batch 100, loss[loss=0.3526, simple_loss=0.4002, pruned_loss=0.1525, over 19218.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.3966, pruned_loss=0.1618, over 1500917.49 frames. ], batch size: 92, lr: 3.26e-02, grad_scale: 16.0 2023-06-15 03:28:05,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=18413.333333333332, ans=0.0 2023-06-15 03:28:33,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=18480.0, ans=0.0 2023-06-15 03:29:06,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=18613.333333333332, ans=0.125 2023-06-15 03:29:09,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=18680.0, ans=0.006808695652173913 2023-06-15 03:29:22,868 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.570e+02 3.710e+02 4.834e+02 1.052e+03, threshold=7.420e+02, percent-clipped=12.0 2023-06-15 03:29:27,805 INFO [train.py:988] (1/4) Epoch 6, batch 150, loss[loss=0.3586, simple_loss=0.4035, pruned_loss=0.1568, over 18257.00 frames. ], tot_loss[loss=0.3571, simple_loss=0.3943, pruned_loss=0.16, over 2000802.33 frames. ], batch size: 74, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:29:28,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=18746.666666666668, ans=0.125 2023-06-15 03:29:28,783 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=14.530000000000001 2023-06-15 03:29:29,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18746.666666666668, ans=0.11253333333333332 2023-06-15 03:29:39,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=18746.666666666668, ans=0.006794202898550724 2023-06-15 03:30:33,422 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:30:54,543 INFO [train.py:988] (1/4) Epoch 6, batch 200, loss[loss=0.3485, simple_loss=0.3811, pruned_loss=0.158, over 20567.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.393, pruned_loss=0.1587, over 2395228.65 frames. ], batch size: 173, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:31:08,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19080.0, ans=0.0 2023-06-15 03:31:38,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=5.882 2023-06-15 03:31:39,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=19213.333333333332, ans=0.006692753623188406 2023-06-15 03:31:44,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.18 vs. limit=21.96 2023-06-15 03:32:02,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19346.666666666668, ans=0.125 2023-06-15 03:32:15,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=14.754999999999999 2023-06-15 03:32:15,537 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.813e+02 3.502e+02 4.540e+02 9.537e+02, threshold=7.003e+02, percent-clipped=3.0 2023-06-15 03:32:20,492 INFO [train.py:988] (1/4) Epoch 6, batch 250, loss[loss=0.3462, simple_loss=0.388, pruned_loss=0.1522, over 19656.00 frames. ], tot_loss[loss=0.3548, simple_loss=0.3924, pruned_loss=0.1586, over 2698093.90 frames. ], batch size: 110, lr: 3.25e-02, grad_scale: 16.0 2023-06-15 03:32:33,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=19413.333333333332, ans=0.025 2023-06-15 03:32:54,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=14.83 2023-06-15 03:33:02,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=19546.666666666668, ans=0.006620289855072464 2023-06-15 03:33:10,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.73 vs. limit=14.806666666666665 2023-06-15 03:33:11,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=19613.333333333332, ans=0.125 2023-06-15 03:33:18,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=19613.333333333332, ans=0.07 2023-06-15 03:33:23,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=19613.333333333332, ans=0.0 2023-06-15 03:33:46,375 INFO [train.py:988] (1/4) Epoch 6, batch 300, loss[loss=0.3252, simple_loss=0.3682, pruned_loss=0.1411, over 19125.00 frames. ], tot_loss[loss=0.354, simple_loss=0.392, pruned_loss=0.158, over 2920432.72 frames. ], batch size: 94, lr: 3.24e-02, grad_scale: 16.0 2023-06-15 03:34:11,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=19813.333333333332, ans=0.0 2023-06-15 03:34:47,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=19946.666666666668, ans=0.0 2023-06-15 03:34:57,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=20013.333333333332, ans=0.125 2023-06-15 03:34:58,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=20013.333333333332, ans=0.125 2023-06-15 03:35:08,279 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.795e+02 3.555e+02 5.115e+02 9.485e+02, threshold=7.110e+02, percent-clipped=8.0 2023-06-15 03:35:10,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=20013.333333333332, ans=0.2 2023-06-15 03:35:13,425 INFO [train.py:988] (1/4) Epoch 6, batch 350, loss[loss=0.3556, simple_loss=0.398, pruned_loss=0.1565, over 19058.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3912, pruned_loss=0.1575, over 3120321.46 frames. ], batch size: 89, lr: 3.24e-02, grad_scale: 16.0 2023-06-15 03:35:15,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.21 vs. limit=10.0 2023-06-15 03:35:46,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-06-15 03:35:51,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-15 03:36:11,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=20280.0, ans=22.5 2023-06-15 03:36:12,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=20280.0, ans=0.0 2023-06-15 03:36:22,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=20346.666666666668, ans=0.125 2023-06-15 03:36:29,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=20346.666666666668, ans=0.125 2023-06-15 03:36:40,616 INFO [train.py:988] (1/4) Epoch 6, batch 400, loss[loss=0.3235, simple_loss=0.3765, pruned_loss=0.1353, over 19829.00 frames. ], tot_loss[loss=0.3517, simple_loss=0.3911, pruned_loss=0.1562, over 3275019.40 frames. ], batch size: 115, lr: 3.24e-02, grad_scale: 32.0 2023-06-15 03:37:10,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=20480.0, ans=0.006417391304347826 2023-06-15 03:37:32,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=20613.333333333332, ans=0.1 2023-06-15 03:37:39,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=20613.333333333332, ans=0.2 2023-06-15 03:38:01,338 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.580e+02 3.237e+02 4.539e+02 1.447e+03, threshold=6.474e+02, percent-clipped=10.0 2023-06-15 03:38:06,902 INFO [train.py:988] (1/4) Epoch 6, batch 450, loss[loss=0.3314, simple_loss=0.3747, pruned_loss=0.144, over 18804.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3902, pruned_loss=0.1555, over 3370802.32 frames. ], batch size: 83, lr: 3.23e-02, grad_scale: 32.0 2023-06-15 03:38:28,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2023-06-15 03:38:58,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=20946.666666666668, ans=0.125 2023-06-15 03:39:18,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=21013.333333333332, ans=0.0 2023-06-15 03:39:31,071 INFO [train.py:988] (1/4) Epoch 6, batch 500, loss[loss=0.383, simple_loss=0.4292, pruned_loss=0.1684, over 16097.00 frames. ], tot_loss[loss=0.3501, simple_loss=0.3898, pruned_loss=0.1552, over 3464126.09 frames. ], batch size: 51, lr: 3.23e-02, grad_scale: 32.0 2023-06-15 03:39:34,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=21080.0, ans=0.125 2023-06-15 03:39:49,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=21146.666666666668, ans=0.125 2023-06-15 03:39:52,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21146.666666666668, ans=0.1 2023-06-15 03:40:14,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=21213.333333333332, ans=0.125 2023-06-15 03:40:21,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-15 03:40:50,455 INFO [train.py:988] (1/4) Epoch 7, batch 0, loss[loss=0.3615, simple_loss=0.3978, pruned_loss=0.1626, over 19206.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.3978, pruned_loss=0.1626, over 19206.00 frames. ], batch size: 92, lr: 3.07e-02, grad_scale: 32.0 2023-06-15 03:40:50,455 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 03:40:56,361 INFO [train.py:1020] (1/4) Epoch 7, validation: loss=0.2561, simple_loss=0.3562, pruned_loss=0.07803, over 143649.00 frames. 2023-06-15 03:40:56,362 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 03:41:01,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=21300.0, ans=0.125 2023-06-15 03:41:19,474 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.547e+02 3.083e+02 4.575e+02 1.238e+03, threshold=6.165e+02, percent-clipped=14.0 2023-06-15 03:41:56,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=21500.0, ans=0.2 2023-06-15 03:41:56,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.10 vs. limit=22.5 2023-06-15 03:42:00,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2023-06-15 03:42:20,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=21633.333333333332, ans=0.125 2023-06-15 03:42:21,400 INFO [train.py:988] (1/4) Epoch 7, batch 50, loss[loss=0.334, simple_loss=0.3783, pruned_loss=0.1448, over 18448.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.3834, pruned_loss=0.1447, over 850573.47 frames. ], batch size: 77, lr: 3.07e-02, grad_scale: 32.0 2023-06-15 03:42:28,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21633.333333333332, ans=0.125 2023-06-15 03:42:57,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.58 vs. limit=12.0 2023-06-15 03:43:28,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21833.333333333332, ans=0.125 2023-06-15 03:43:35,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-06-15 03:43:49,171 INFO [train.py:988] (1/4) Epoch 7, batch 100, loss[loss=0.318, simple_loss=0.3694, pruned_loss=0.1333, over 18621.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3793, pruned_loss=0.1471, over 1493929.36 frames. ], batch size: 80, lr: 3.06e-02, grad_scale: 32.0 2023-06-15 03:43:59,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-15 03:44:06,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=22033.333333333332, ans=0.125 2023-06-15 03:44:13,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 2.359e+02 2.777e+02 3.733e+02 1.026e+03, threshold=5.554e+02, percent-clipped=6.0 2023-06-15 03:44:18,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22033.333333333332, ans=0.1 2023-06-15 03:44:24,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=22100.0, ans=0.0 2023-06-15 03:45:13,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=22233.333333333332, ans=0.015 2023-06-15 03:45:16,406 INFO [train.py:988] (1/4) Epoch 7, batch 150, loss[loss=0.317, simple_loss=0.3636, pruned_loss=0.1352, over 10878.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3804, pruned_loss=0.1479, over 1992199.61 frames. ], batch size: 30, lr: 3.06e-02, grad_scale: 32.0 2023-06-15 03:45:23,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.19 vs. limit=22.5 2023-06-15 03:45:30,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-15 03:45:33,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=22366.666666666668, ans=0.006007246376811594 2023-06-15 03:46:21,412 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:46:23,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-15 03:46:44,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=22633.333333333332, ans=0.005949275362318841 2023-06-15 03:46:45,792 INFO [train.py:988] (1/4) Epoch 7, batch 200, loss[loss=0.355, simple_loss=0.4083, pruned_loss=0.1509, over 16413.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3815, pruned_loss=0.1473, over 2387775.62 frames. ], batch size: 52, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:47:07,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-15 03:47:08,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22700.0, ans=0.1 2023-06-15 03:47:11,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.670e+02 3.468e+02 4.313e+02 7.598e+02, threshold=6.936e+02, percent-clipped=8.0 2023-06-15 03:47:19,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.99 vs. limit=22.5 2023-06-15 03:47:20,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=22766.666666666668, ans=0.07 2023-06-15 03:47:51,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=22833.333333333332, ans=0.2 2023-06-15 03:47:58,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=22900.0, ans=0.2 2023-06-15 03:48:06,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=22900.0, ans=0.2 2023-06-15 03:48:12,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=22900.0, ans=0.005891304347826087 2023-06-15 03:48:15,528 INFO [train.py:988] (1/4) Epoch 7, batch 250, loss[loss=0.3206, simple_loss=0.3706, pruned_loss=0.1353, over 19111.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.381, pruned_loss=0.148, over 2708295.83 frames. ], batch size: 94, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:48:44,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=23033.333333333332, ans=0.2 2023-06-15 03:48:49,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=23100.0, ans=0.125 2023-06-15 03:49:07,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=23166.666666666668, ans=0.125 2023-06-15 03:49:32,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=23233.333333333332, ans=0.125 2023-06-15 03:49:38,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23233.333333333332, ans=0.1 2023-06-15 03:49:44,559 INFO [train.py:988] (1/4) Epoch 7, batch 300, loss[loss=0.3432, simple_loss=0.3839, pruned_loss=0.1513, over 20298.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3818, pruned_loss=0.1484, over 2938626.13 frames. ], batch size: 141, lr: 3.05e-02, grad_scale: 32.0 2023-06-15 03:50:00,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=23366.666666666668, ans=0.125 2023-06-15 03:50:03,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=12.0 2023-06-15 03:50:08,800 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.398e+02 2.879e+02 3.819e+02 6.544e+02, threshold=5.757e+02, percent-clipped=0.0 2023-06-15 03:50:11,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.39 vs. limit=12.0 2023-06-15 03:50:14,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=23366.666666666668, ans=0.005789855072463768 2023-06-15 03:50:21,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=23433.333333333332, ans=0.125 2023-06-15 03:50:39,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=23500.0, ans=0.04949747468305833 2023-06-15 03:51:12,756 INFO [train.py:988] (1/4) Epoch 7, batch 350, loss[loss=0.3347, simple_loss=0.3636, pruned_loss=0.1529, over 20288.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3795, pruned_loss=0.1475, over 3122003.88 frames. ], batch size: 239, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:51:23,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=23633.333333333332, ans=0.125 2023-06-15 03:51:42,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=23700.0, ans=0.125 2023-06-15 03:51:43,034 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:51:43,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.95 vs. limit=15.0 2023-06-15 03:52:41,269 INFO [train.py:988] (1/4) Epoch 7, batch 400, loss[loss=0.3066, simple_loss=0.3656, pruned_loss=0.1238, over 18309.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3799, pruned_loss=0.1474, over 3272747.90 frames. ], batch size: 74, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:52:44,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=23966.666666666668, ans=0.0 2023-06-15 03:53:01,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=24033.333333333332, ans=0.0 2023-06-15 03:53:06,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 3.077e+02 3.837e+02 5.079e+02 9.527e+02, threshold=7.674e+02, percent-clipped=15.0 2023-06-15 03:53:31,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=24100.0, ans=0.125 2023-06-15 03:53:34,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=24166.666666666668, ans=0.125 2023-06-15 03:53:41,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=24166.666666666668, ans=0.125 2023-06-15 03:53:43,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=24166.666666666668, ans=0.005615942028985507 2023-06-15 03:53:59,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=24233.333333333332, ans=10.0 2023-06-15 03:54:02,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=15.0 2023-06-15 03:54:09,648 INFO [train.py:988] (1/4) Epoch 7, batch 450, loss[loss=0.3349, simple_loss=0.3816, pruned_loss=0.1441, over 18910.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.381, pruned_loss=0.1467, over 3375877.00 frames. ], batch size: 86, lr: 3.04e-02, grad_scale: 32.0 2023-06-15 03:54:13,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=24300.0, ans=0.125 2023-06-15 03:54:29,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=24366.666666666668, ans=0.125 2023-06-15 03:54:35,317 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:54:37,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=24366.666666666668, ans=0.0 2023-06-15 03:54:58,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=24433.333333333332, ans=0.125 2023-06-15 03:55:35,137 INFO [train.py:988] (1/4) Epoch 7, batch 500, loss[loss=0.3539, simple_loss=0.386, pruned_loss=0.1609, over 20082.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.379, pruned_loss=0.1457, over 3470415.14 frames. ], batch size: 133, lr: 3.03e-02, grad_scale: 32.0 2023-06-15 03:55:58,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.744e+02 3.251e+02 4.322e+02 7.093e+02, threshold=6.501e+02, percent-clipped=0.0 2023-06-15 03:56:02,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.18 vs. limit=6.0 2023-06-15 03:56:03,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=24700.0, ans=0.2 2023-06-15 03:56:13,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=24766.666666666668, ans=0.035 2023-06-15 03:56:21,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=24766.666666666668, ans=10.0 2023-06-15 03:56:53,610 INFO [train.py:988] (1/4) Epoch 8, batch 0, loss[loss=0.3227, simple_loss=0.3632, pruned_loss=0.1411, over 20475.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.3632, pruned_loss=0.1411, over 20475.00 frames. ], batch size: 189, lr: 2.89e-02, grad_scale: 32.0 2023-06-15 03:56:53,610 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 03:56:59,696 INFO [train.py:1020] (1/4) Epoch 8, validation: loss=0.2482, simple_loss=0.3483, pruned_loss=0.0741, over 143649.00 frames. 2023-06-15 03:56:59,697 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 03:57:31,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=24913.333333333332, ans=0.125 2023-06-15 03:57:31,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.73 vs. limit=22.5 2023-06-15 03:57:33,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-06-15 03:57:52,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=25046.666666666668, ans=0.0 2023-06-15 03:58:12,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=15.0 2023-06-15 03:58:26,351 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 03:58:28,211 INFO [train.py:988] (1/4) Epoch 8, batch 50, loss[loss=0.3288, simple_loss=0.3748, pruned_loss=0.1414, over 19485.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3735, pruned_loss=0.1385, over 864848.42 frames. ], batch size: 105, lr: 2.88e-02, grad_scale: 32.0 2023-06-15 03:58:40,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=25180.0, ans=0.125 2023-06-15 03:59:25,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.660e+02 2.942e+02 3.457e+02 5.575e+02, threshold=5.885e+02, percent-clipped=0.0 2023-06-15 03:59:30,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=25380.0, ans=0.0053521739130434785 2023-06-15 03:59:36,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=25380.0, ans=0.125 2023-06-15 03:59:57,951 INFO [train.py:988] (1/4) Epoch 8, batch 100, loss[loss=0.3342, simple_loss=0.3956, pruned_loss=0.1364, over 16524.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3729, pruned_loss=0.1405, over 1508596.49 frames. ], batch size: 52, lr: 2.88e-02, grad_scale: 32.0 2023-06-15 04:01:13,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=25780.0, ans=0.005265217391304347 2023-06-15 04:01:16,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=25780.0, ans=0.2 2023-06-15 04:01:27,366 INFO [train.py:988] (1/4) Epoch 8, batch 150, loss[loss=0.3318, simple_loss=0.3661, pruned_loss=0.1487, over 20678.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3734, pruned_loss=0.1402, over 2029627.12 frames. ], batch size: 211, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:01:42,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=25913.333333333332, ans=0.125 2023-06-15 04:01:55,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=25913.333333333332, ans=0.125 2023-06-15 04:02:10,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=25980.0, ans=0.125 2023-06-15 04:02:12,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=15.0 2023-06-15 04:02:23,610 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 2.617e+02 3.387e+02 4.495e+02 9.103e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-15 04:02:27,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-15 04:02:42,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=15.0 2023-06-15 04:02:46,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=26113.333333333332, ans=0.0 2023-06-15 04:02:55,972 INFO [train.py:988] (1/4) Epoch 8, batch 200, loss[loss=0.331, simple_loss=0.3798, pruned_loss=0.1411, over 20119.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3729, pruned_loss=0.1397, over 2426601.11 frames. ], batch size: 133, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:03:27,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26246.666666666668, ans=0.0 2023-06-15 04:03:44,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=26313.333333333332, ans=0.005149275362318841 2023-06-15 04:03:49,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26380.0, ans=0.0 2023-06-15 04:03:55,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=26380.0, ans=0.125 2023-06-15 04:04:05,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=26446.666666666668, ans=0.025 2023-06-15 04:04:25,054 INFO [train.py:988] (1/4) Epoch 8, batch 250, loss[loss=0.3331, simple_loss=0.393, pruned_loss=0.1365, over 18333.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.3726, pruned_loss=0.1394, over 2710457.55 frames. ], batch size: 72, lr: 2.87e-02, grad_scale: 32.0 2023-06-15 04:04:33,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.73 vs. limit=22.5 2023-06-15 04:05:25,744 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.237e+02 2.856e+02 3.826e+02 6.923e+02, threshold=5.713e+02, percent-clipped=1.0 2023-06-15 04:05:29,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=26713.333333333332, ans=0.125 2023-06-15 04:05:44,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=26780.0, ans=15.0 2023-06-15 04:05:45,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=26780.0, ans=0.0 2023-06-15 04:05:52,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=26780.0, ans=0.2 2023-06-15 04:05:57,932 INFO [train.py:988] (1/4) Epoch 8, batch 300, loss[loss=0.3397, simple_loss=0.3771, pruned_loss=0.1512, over 20004.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3716, pruned_loss=0.1382, over 2958084.97 frames. ], batch size: 126, lr: 2.86e-02, grad_scale: 32.0 2023-06-15 04:05:58,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26846.666666666668, ans=0.1 2023-06-15 04:06:09,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=26846.666666666668, ans=0.025 2023-06-15 04:06:56,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=27046.666666666668, ans=0.2 2023-06-15 04:07:04,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27046.666666666668, ans=0.1 2023-06-15 04:07:26,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=27180.0, ans=0.125 2023-06-15 04:07:27,554 INFO [train.py:988] (1/4) Epoch 8, batch 350, loss[loss=0.3371, simple_loss=0.3912, pruned_loss=0.1415, over 18332.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3709, pruned_loss=0.1377, over 3147150.59 frames. ], batch size: 72, lr: 2.86e-02, grad_scale: 32.0 2023-06-15 04:08:09,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=27313.333333333332, ans=0.1 2023-06-15 04:08:09,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=27313.333333333332, ans=0.2 2023-06-15 04:08:20,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=27380.0, ans=0.0 2023-06-15 04:08:23,470 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:08:24,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.602e+02 3.097e+02 4.121e+02 7.485e+02, threshold=6.195e+02, percent-clipped=4.0 2023-06-15 04:08:56,013 INFO [train.py:988] (1/4) Epoch 8, batch 400, loss[loss=0.3481, simple_loss=0.3668, pruned_loss=0.1647, over 19905.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3708, pruned_loss=0.1376, over 3297970.30 frames. ], batch size: 293, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:08:59,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=27513.333333333332, ans=0.0 2023-06-15 04:09:00,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.38 vs. limit=10.0 2023-06-15 04:09:12,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=27513.333333333332, ans=0.0 2023-06-15 04:09:24,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=27580.0, ans=0.125 2023-06-15 04:09:29,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=27580.0, ans=0.0 2023-06-15 04:09:35,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.58 vs. limit=22.5 2023-06-15 04:09:51,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-15 04:09:53,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=27713.333333333332, ans=0.07 2023-06-15 04:10:07,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=27780.0, ans=0.025 2023-06-15 04:10:19,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.49 vs. limit=15.0 2023-06-15 04:10:24,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-06-15 04:10:26,997 INFO [train.py:988] (1/4) Epoch 8, batch 450, loss[loss=0.3036, simple_loss=0.3608, pruned_loss=0.1232, over 19868.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.371, pruned_loss=0.1376, over 3400765.26 frames. ], batch size: 120, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:10:39,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-15 04:10:43,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=27913.333333333332, ans=0.004801449275362319 2023-06-15 04:10:49,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=27913.333333333332, ans=0.125 2023-06-15 04:10:56,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=27913.333333333332, ans=0.0 2023-06-15 04:10:58,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=27913.333333333332, ans=0.125 2023-06-15 04:11:08,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=27980.0, ans=0.04949747468305833 2023-06-15 04:11:20,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=28046.666666666668, ans=0.125 2023-06-15 04:11:23,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.599e+02 3.032e+02 3.753e+02 5.594e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-15 04:11:38,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28113.333333333332, ans=0.125 2023-06-15 04:11:45,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=28113.333333333332, ans=0.125 2023-06-15 04:11:53,601 INFO [train.py:988] (1/4) Epoch 8, batch 500, loss[loss=0.32, simple_loss=0.3579, pruned_loss=0.1411, over 20478.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3708, pruned_loss=0.1368, over 3475933.34 frames. ], batch size: 160, lr: 2.85e-02, grad_scale: 32.0 2023-06-15 04:12:01,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.35 vs. limit=10.0 2023-06-15 04:12:31,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=28313.333333333332, ans=0.0 2023-06-15 04:12:32,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28313.333333333332, ans=0.1 2023-06-15 04:12:34,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=28313.333333333332, ans=0.0 2023-06-15 04:12:42,431 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:13:14,344 INFO [train.py:988] (1/4) Epoch 9, batch 0, loss[loss=0.3065, simple_loss=0.3598, pruned_loss=0.1266, over 19850.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3598, pruned_loss=0.1266, over 19850.00 frames. ], batch size: 120, lr: 2.72e-02, grad_scale: 32.0 2023-06-15 04:13:14,344 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 04:13:20,343 INFO [train.py:1020] (1/4) Epoch 9, validation: loss=0.2394, simple_loss=0.343, pruned_loss=0.06786, over 143649.00 frames. 2023-06-15 04:13:20,344 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 04:13:26,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=28393.333333333332, ans=0.125 2023-06-15 04:13:35,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.93 vs. limit=10.0 2023-06-15 04:13:39,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.36 vs. limit=6.0 2023-06-15 04:13:49,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=28460.0, ans=0.5 2023-06-15 04:14:15,162 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.13 vs. limit=22.5 2023-06-15 04:14:33,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=28660.0, ans=0.004639130434782609 2023-06-15 04:14:38,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.62 vs. limit=6.0 2023-06-15 04:14:47,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=28726.666666666668, ans=0.2 2023-06-15 04:14:49,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.336e+02 2.823e+02 3.585e+02 6.203e+02, threshold=5.645e+02, percent-clipped=2.0 2023-06-15 04:14:49,825 INFO [train.py:988] (1/4) Epoch 9, batch 50, loss[loss=0.3294, simple_loss=0.3629, pruned_loss=0.1479, over 20366.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3645, pruned_loss=0.1357, over 863863.24 frames. ], batch size: 239, lr: 2.71e-02, grad_scale: 32.0 2023-06-15 04:15:02,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=28726.666666666668, ans=0.0 2023-06-15 04:15:10,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28793.333333333332, ans=0.1 2023-06-15 04:15:24,989 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:15:33,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=28860.0, ans=0.125 2023-06-15 04:15:55,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.04 vs. limit=22.5 2023-06-15 04:16:14,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=28993.333333333332, ans=0.125 2023-06-15 04:16:17,508 INFO [train.py:988] (1/4) Epoch 9, batch 100, loss[loss=0.3187, simple_loss=0.3697, pruned_loss=0.1339, over 19067.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.366, pruned_loss=0.1343, over 1501363.04 frames. ], batch size: 89, lr: 2.71e-02, grad_scale: 32.0 2023-06-15 04:16:38,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=29126.666666666668, ans=0.0 2023-06-15 04:17:04,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=29193.333333333332, ans=0.2 2023-06-15 04:17:06,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=29193.333333333332, ans=0.125 2023-06-15 04:17:11,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=29260.0, ans=0.125 2023-06-15 04:17:14,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=29260.0, ans=0.125 2023-06-15 04:17:16,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=29260.0, ans=0.2 2023-06-15 04:17:32,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=29326.666666666668, ans=0.004494202898550724 2023-06-15 04:17:44,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.451e+02 3.023e+02 4.117e+02 8.643e+02, threshold=6.045e+02, percent-clipped=4.0 2023-06-15 04:17:44,349 INFO [train.py:988] (1/4) Epoch 9, batch 150, loss[loss=0.336, simple_loss=0.3721, pruned_loss=0.15, over 20694.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3663, pruned_loss=0.1329, over 1995033.70 frames. ], batch size: 211, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:17:48,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.26 vs. limit=15.0 2023-06-15 04:18:13,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.26 vs. limit=22.5 2023-06-15 04:18:26,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=29526.666666666668, ans=10.0 2023-06-15 04:19:06,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29660.0, ans=0.1 2023-06-15 04:19:10,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=29726.666666666668, ans=0.2 2023-06-15 04:19:12,154 INFO [train.py:988] (1/4) Epoch 9, batch 200, loss[loss=0.336, simple_loss=0.3613, pruned_loss=0.1553, over 20177.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3656, pruned_loss=0.1318, over 2391695.64 frames. ], batch size: 239, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:19:26,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=29726.666666666668, ans=0.09899494936611666 2023-06-15 04:19:30,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=29793.333333333332, ans=0.125 2023-06-15 04:19:41,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.82 vs. limit=15.0 2023-06-15 04:19:53,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=29860.0, ans=0.2 2023-06-15 04:20:20,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=29926.666666666668, ans=0.004363768115942028 2023-06-15 04:20:23,811 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:20:32,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-15 04:20:41,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.377e+02 2.781e+02 3.474e+02 5.000e+02, threshold=5.562e+02, percent-clipped=0.0 2023-06-15 04:20:41,777 INFO [train.py:988] (1/4) Epoch 9, batch 250, loss[loss=0.2845, simple_loss=0.3455, pruned_loss=0.1117, over 18913.00 frames. ], tot_loss[loss=0.3138, simple_loss=0.3652, pruned_loss=0.1312, over 2698498.68 frames. ], batch size: 86, lr: 2.70e-02, grad_scale: 32.0 2023-06-15 04:21:11,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=30126.666666666668, ans=0.2 2023-06-15 04:21:31,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=30193.333333333332, ans=0.004305797101449275 2023-06-15 04:21:56,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30326.666666666668, ans=0.1 2023-06-15 04:22:01,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=30326.666666666668, ans=0.2 2023-06-15 04:22:09,958 INFO [train.py:988] (1/4) Epoch 9, batch 300, loss[loss=0.3132, simple_loss=0.3787, pruned_loss=0.1239, over 15401.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3652, pruned_loss=0.1315, over 2926075.25 frames. ], batch size: 44, lr: 2.69e-02, grad_scale: 32.0 2023-06-15 04:22:14,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=30393.333333333332, ans=0.125 2023-06-15 04:22:17,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=30393.333333333332, ans=0.07 2023-06-15 04:22:31,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=30460.0, ans=0.0042478260869565215 2023-06-15 04:22:46,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=30526.666666666668, ans=0.07 2023-06-15 04:23:07,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=30593.333333333332, ans=0.125 2023-06-15 04:23:19,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30660.0, ans=0.1 2023-06-15 04:23:24,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.73 vs. limit=15.0 2023-06-15 04:23:31,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=30660.0, ans=0.004204347826086956 2023-06-15 04:23:40,474 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.682e+02 3.189e+02 4.179e+02 7.690e+02, threshold=6.378e+02, percent-clipped=10.0 2023-06-15 04:23:40,525 INFO [train.py:988] (1/4) Epoch 9, batch 350, loss[loss=0.2839, simple_loss=0.3414, pruned_loss=0.1132, over 19816.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3646, pruned_loss=0.13, over 3114795.84 frames. ], batch size: 120, lr: 2.69e-02, grad_scale: 32.0 2023-06-15 04:23:40,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=30726.666666666668, ans=0.125 2023-06-15 04:23:43,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-15 04:23:44,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.77 vs. limit=22.5 2023-06-15 04:24:19,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30860.0, ans=0.1 2023-06-15 04:24:30,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=30860.0, ans=0.125 2023-06-15 04:25:05,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=30993.333333333332, ans=0.2 2023-06-15 04:25:09,354 INFO [train.py:988] (1/4) Epoch 9, batch 400, loss[loss=0.3056, simple_loss=0.3533, pruned_loss=0.129, over 20559.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3639, pruned_loss=0.13, over 3264364.89 frames. ], batch size: 173, lr: 2.68e-02, grad_scale: 32.0 2023-06-15 04:25:24,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=31126.666666666668, ans=0.0 2023-06-15 04:25:37,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=31126.666666666668, ans=0.125 2023-06-15 04:25:41,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=31126.666666666668, ans=0.125 2023-06-15 04:25:46,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=31193.333333333332, ans=0.2 2023-06-15 04:25:58,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=31193.333333333332, ans=0.2 2023-06-15 04:26:24,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=31326.666666666668, ans=0.004059420289855072 2023-06-15 04:26:36,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.307e+02 2.949e+02 3.902e+02 6.879e+02, threshold=5.899e+02, percent-clipped=2.0 2023-06-15 04:26:36,188 INFO [train.py:988] (1/4) Epoch 9, batch 450, loss[loss=0.3426, simple_loss=0.3949, pruned_loss=0.1451, over 18929.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3641, pruned_loss=0.1297, over 3384944.43 frames. ], batch size: 86, lr: 2.68e-02, grad_scale: 32.0 2023-06-15 04:26:46,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2023-06-15 04:26:48,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=31393.333333333332, ans=0.125 2023-06-15 04:27:04,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.78 vs. limit=22.5 2023-06-15 04:27:13,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31526.666666666668, ans=0.1 2023-06-15 04:27:14,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=31526.666666666668, ans=0.0040159420289855065 2023-06-15 04:27:38,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=31593.333333333332, ans=0.125 2023-06-15 04:27:45,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.19 vs. limit=10.0 2023-06-15 04:28:01,912 INFO [train.py:988] (1/4) Epoch 9, batch 500, loss[loss=0.3211, simple_loss=0.3642, pruned_loss=0.139, over 20558.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3633, pruned_loss=0.1295, over 3480217.53 frames. ], batch size: 189, lr: 2.68e-02, grad_scale: 64.0 2023-06-15 04:28:02,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=31726.666666666668, ans=0.003972463768115941 2023-06-15 04:28:02,984 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2023-06-15 04:28:16,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.87 vs. limit=22.5 2023-06-15 04:28:17,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=31793.333333333332, ans=0.125 2023-06-15 04:28:20,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=31793.333333333332, ans=0.125 2023-06-15 04:28:49,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=31926.666666666668, ans=0.09899494936611666 2023-06-15 04:29:21,621 INFO [train.py:988] (1/4) Epoch 10, batch 0, loss[loss=0.293, simple_loss=0.3552, pruned_loss=0.1154, over 19078.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3552, pruned_loss=0.1154, over 19078.00 frames. ], batch size: 89, lr: 2.56e-02, grad_scale: 64.0 2023-06-15 04:29:21,622 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 04:29:28,496 INFO [train.py:1020] (1/4) Epoch 10, validation: loss=0.2327, simple_loss=0.3375, pruned_loss=0.06395, over 143649.00 frames. 2023-06-15 04:29:28,497 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 04:30:00,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=32006.666666666668, ans=0.125 2023-06-15 04:30:00,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32006.666666666668, ans=0.1 2023-06-15 04:30:01,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.277e+02 2.643e+02 3.234e+02 5.475e+02, threshold=5.286e+02, percent-clipped=0.0 2023-06-15 04:30:21,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.27 vs. limit=5.0 2023-06-15 04:30:46,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=32206.666666666668, ans=0.125 2023-06-15 04:30:59,513 INFO [train.py:988] (1/4) Epoch 10, batch 50, loss[loss=0.2768, simple_loss=0.339, pruned_loss=0.1073, over 20320.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3601, pruned_loss=0.1254, over 857995.51 frames. ], batch size: 149, lr: 2.56e-02, grad_scale: 64.0 2023-06-15 04:31:01,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-15 04:31:20,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=32340.0, ans=0.07 2023-06-15 04:31:31,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32340.0, ans=0.0 2023-06-15 04:31:36,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=32406.666666666668, ans=0.07 2023-06-15 04:31:36,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=32406.666666666668, ans=0.125 2023-06-15 04:32:07,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.95 vs. limit=22.5 2023-06-15 04:32:28,864 INFO [train.py:988] (1/4) Epoch 10, batch 100, loss[loss=0.3059, simple_loss=0.371, pruned_loss=0.1204, over 16841.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3587, pruned_loss=0.1256, over 1510651.47 frames. ], batch size: 59, lr: 2.55e-02, grad_scale: 64.0 2023-06-15 04:32:42,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=32606.666666666668, ans=0.125 2023-06-15 04:32:51,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=32673.333333333332, ans=0.125 2023-06-15 04:32:57,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32673.333333333332, ans=0.125 2023-06-15 04:33:02,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 2.450e+02 2.873e+02 3.278e+02 7.765e+02, threshold=5.745e+02, percent-clipped=3.0 2023-06-15 04:33:40,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=32873.333333333336, ans=0.95 2023-06-15 04:33:44,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32873.333333333336, ans=0.0 2023-06-15 04:33:59,940 INFO [train.py:988] (1/4) Epoch 10, batch 150, loss[loss=0.3043, simple_loss=0.356, pruned_loss=0.1263, over 20304.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3595, pruned_loss=0.1251, over 2008838.81 frames. ], batch size: 141, lr: 2.55e-02, grad_scale: 64.0 2023-06-15 04:34:05,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=32940.0, ans=0.09899494936611666 2023-06-15 04:34:17,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=33006.666666666664, ans=0.125 2023-06-15 04:34:47,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33073.333333333336, ans=0.1 2023-06-15 04:35:30,189 INFO [train.py:988] (1/4) Epoch 10, batch 200, loss[loss=0.2968, simple_loss=0.3493, pruned_loss=0.1222, over 20148.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3581, pruned_loss=0.1245, over 2416134.18 frames. ], batch size: 133, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:35:37,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=33273.333333333336, ans=0.0 2023-06-15 04:35:37,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=33273.333333333336, ans=0.125 2023-06-15 04:35:54,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=33340.0, ans=0.025 2023-06-15 04:36:02,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.312e+02 2.745e+02 3.367e+02 5.641e+02, threshold=5.490e+02, percent-clipped=0.0 2023-06-15 04:36:02,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=33340.0, ans=0.2 2023-06-15 04:36:02,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=33340.0, ans=0.125 2023-06-15 04:36:17,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33406.666666666664, ans=0.1 2023-06-15 04:36:22,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=33473.333333333336, ans=0.2 2023-06-15 04:36:25,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=33473.333333333336, ans=0.0 2023-06-15 04:36:36,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=33473.333333333336, ans=0.05 2023-06-15 04:36:58,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=33606.666666666664, ans=0.125 2023-06-15 04:36:59,956 INFO [train.py:988] (1/4) Epoch 10, batch 250, loss[loss=0.3081, simple_loss=0.3645, pruned_loss=0.1258, over 18611.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.359, pruned_loss=0.1253, over 2720587.36 frames. ], batch size: 80, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:37:12,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=33606.666666666664, ans=0.0 2023-06-15 04:37:19,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=33673.333333333336, ans=0.07 2023-06-15 04:37:21,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=33673.333333333336, ans=0.125 2023-06-15 04:37:24,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=33673.333333333336, ans=0.00354927536231884 2023-06-15 04:37:25,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-15 04:37:37,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=33740.0, ans=0.125 2023-06-15 04:37:42,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=33740.0, ans=0.0 2023-06-15 04:37:51,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=33806.666666666664, ans=0.125 2023-06-15 04:38:00,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33806.666666666664, ans=0.1 2023-06-15 04:38:08,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=33806.666666666664, ans=0.0 2023-06-15 04:38:10,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-15 04:38:22,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=33873.333333333336, ans=0.003505797101449275 2023-06-15 04:38:23,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=22.5 2023-06-15 04:38:29,332 INFO [train.py:988] (1/4) Epoch 10, batch 300, loss[loss=0.2886, simple_loss=0.3486, pruned_loss=0.1143, over 18620.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3597, pruned_loss=0.1252, over 2944101.67 frames. ], batch size: 80, lr: 2.54e-02, grad_scale: 64.0 2023-06-15 04:38:34,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=33940.0, ans=0.2 2023-06-15 04:39:01,816 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.483e+02 2.953e+02 3.724e+02 5.914e+02, threshold=5.906e+02, percent-clipped=1.0 2023-06-15 04:39:08,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=34073.333333333336, ans=0.00346231884057971 2023-06-15 04:39:19,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34073.333333333336, ans=0.125 2023-06-15 04:39:35,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=34140.0, ans=0.125 2023-06-15 04:39:59,127 INFO [train.py:988] (1/4) Epoch 10, batch 350, loss[loss=0.2928, simple_loss=0.3501, pruned_loss=0.1178, over 18467.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3585, pruned_loss=0.1245, over 3132739.27 frames. ], batch size: 77, lr: 2.53e-02, grad_scale: 64.0 2023-06-15 04:40:08,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=34273.333333333336, ans=0.0 2023-06-15 04:41:19,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-15 04:41:29,043 INFO [train.py:988] (1/4) Epoch 10, batch 400, loss[loss=0.2684, simple_loss=0.331, pruned_loss=0.1029, over 19327.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3575, pruned_loss=0.1245, over 3274971.56 frames. ], batch size: 98, lr: 2.53e-02, grad_scale: 32.0 2023-06-15 04:41:53,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34673.333333333336, ans=0.1 2023-06-15 04:42:02,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.389e+02 2.906e+02 3.855e+02 6.206e+02, threshold=5.812e+02, percent-clipped=1.0 2023-06-15 04:42:02,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=34740.0, ans=0.125 2023-06-15 04:42:10,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=34740.0, ans=0.0 2023-06-15 04:42:12,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=34740.0, ans=0.125 2023-06-15 04:42:13,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=34740.0, ans=0.003317391304347826 2023-06-15 04:42:13,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=34740.0, ans=0.125 2023-06-15 04:42:19,501 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:42:23,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-15 04:42:28,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.09 vs. limit=15.0 2023-06-15 04:42:47,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=34873.333333333336, ans=0.0 2023-06-15 04:42:53,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=22.5 2023-06-15 04:42:58,376 INFO [train.py:988] (1/4) Epoch 10, batch 450, loss[loss=0.3146, simple_loss=0.3584, pruned_loss=0.1354, over 20663.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3576, pruned_loss=0.1242, over 3372260.49 frames. ], batch size: 211, lr: 2.52e-02, grad_scale: 32.0 2023-06-15 04:43:06,182 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 04:43:55,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=35140.0, ans=0.125 2023-06-15 04:44:06,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=35206.666666666664, ans=0.1 2023-06-15 04:44:08,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=35206.666666666664, ans=0.95 2023-06-15 04:44:24,820 INFO [train.py:988] (1/4) Epoch 10, batch 500, loss[loss=0.3306, simple_loss=0.3962, pruned_loss=0.1324, over 18316.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3579, pruned_loss=0.1244, over 3463374.28 frames. ], batch size: 72, lr: 2.52e-02, grad_scale: 32.0 2023-06-15 04:44:28,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-15 04:44:33,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-15 04:44:39,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=35340.0, ans=0.0 2023-06-15 04:44:56,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.429e+02 2.839e+02 3.294e+02 4.521e+02, threshold=5.678e+02, percent-clipped=0.0 2023-06-15 04:45:00,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35406.666666666664, ans=0.1 2023-06-15 04:45:11,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35406.666666666664, ans=0.0 2023-06-15 04:45:44,025 INFO [train.py:988] (1/4) Epoch 11, batch 0, loss[loss=0.3229, simple_loss=0.3534, pruned_loss=0.1462, over 19930.00 frames. ], tot_loss[loss=0.3229, simple_loss=0.3534, pruned_loss=0.1462, over 19930.00 frames. ], batch size: 293, lr: 2.42e-02, grad_scale: 32.0 2023-06-15 04:45:44,025 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 04:45:50,100 INFO [train.py:1020] (1/4) Epoch 11, validation: loss=0.2306, simple_loss=0.3357, pruned_loss=0.06271, over 143649.00 frames. 2023-06-15 04:45:50,101 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 04:45:59,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=35493.333333333336, ans=0.95 2023-06-15 04:46:19,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.98 vs. limit=22.5 2023-06-15 04:46:31,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=35626.666666666664, ans=0.125 2023-06-15 04:46:54,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=35693.333333333336, ans=0.125 2023-06-15 04:46:55,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-15 04:47:05,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=35760.0, ans=0.025 2023-06-15 04:47:19,230 INFO [train.py:988] (1/4) Epoch 11, batch 50, loss[loss=0.3312, simple_loss=0.3913, pruned_loss=0.1355, over 17600.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3583, pruned_loss=0.1236, over 853845.91 frames. ], batch size: 67, lr: 2.41e-02, grad_scale: 32.0 2023-06-15 04:47:41,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=35893.333333333336, ans=0.0 2023-06-15 04:47:53,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35960.0, ans=0.1 2023-06-15 04:48:22,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.367e+02 2.815e+02 3.714e+02 5.103e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-15 04:48:33,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36093.333333333336, ans=0.1 2023-06-15 04:48:47,420 INFO [train.py:988] (1/4) Epoch 11, batch 100, loss[loss=0.3207, simple_loss=0.3554, pruned_loss=0.143, over 20347.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.357, pruned_loss=0.1213, over 1498495.52 frames. ], batch size: 239, lr: 2.41e-02, grad_scale: 32.0 2023-06-15 04:48:57,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=36160.0, ans=0.125 2023-06-15 04:49:25,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=36293.333333333336, ans=0.125 2023-06-15 04:49:58,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=36426.666666666664, ans=0.2 2023-06-15 04:50:10,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36426.666666666664, ans=0.1 2023-06-15 04:50:18,773 INFO [train.py:988] (1/4) Epoch 11, batch 150, loss[loss=0.3168, simple_loss=0.3768, pruned_loss=0.1284, over 18283.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3558, pruned_loss=0.1225, over 2008329.36 frames. ], batch size: 74, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:50:34,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.80 vs. limit=10.0 2023-06-15 04:50:50,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=36560.0, ans=0.125 2023-06-15 04:51:16,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-06-15 04:51:22,610 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 2.254e+02 2.489e+02 3.022e+02 4.758e+02, threshold=4.979e+02, percent-clipped=0.0 2023-06-15 04:51:26,835 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-06-15 04:51:31,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=36760.0, ans=0.0 2023-06-15 04:51:47,722 INFO [train.py:988] (1/4) Epoch 11, batch 200, loss[loss=0.2788, simple_loss=0.3329, pruned_loss=0.1123, over 20510.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.355, pruned_loss=0.1221, over 2379955.37 frames. ], batch size: 173, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:51:48,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36826.666666666664, ans=0.1 2023-06-15 04:51:51,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.35 vs. limit=22.5 2023-06-15 04:51:52,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=36826.666666666664, ans=0.0 2023-06-15 04:52:43,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.74 vs. limit=22.5 2023-06-15 04:52:45,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=37026.666666666664, ans=0.125 2023-06-15 04:52:49,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=37026.666666666664, ans=0.0 2023-06-15 04:52:50,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=37026.666666666664, ans=0.0028202898550724644 2023-06-15 04:53:17,714 INFO [train.py:988] (1/4) Epoch 11, batch 250, loss[loss=0.3501, simple_loss=0.4157, pruned_loss=0.1422, over 18327.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3553, pruned_loss=0.1218, over 2676074.22 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 32.0 2023-06-15 04:53:39,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=37226.666666666664, ans=0.0027768115942028992 2023-06-15 04:54:01,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37293.333333333336, ans=0.0 2023-06-15 04:54:02,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=37293.333333333336, ans=0.0027623188405797106 2023-06-15 04:54:04,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37293.333333333336, ans=0.125 2023-06-15 04:54:10,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37360.0, ans=0.125 2023-06-15 04:54:15,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=37360.0, ans=0.125 2023-06-15 04:54:22,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.173e+02 2.592e+02 3.214e+02 4.591e+02, threshold=5.183e+02, percent-clipped=0.0 2023-06-15 04:54:22,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=37360.0, ans=0.2 2023-06-15 04:54:46,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=37493.333333333336, ans=0.125 2023-06-15 04:54:48,179 INFO [train.py:988] (1/4) Epoch 11, batch 300, loss[loss=0.2742, simple_loss=0.338, pruned_loss=0.1052, over 19110.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3546, pruned_loss=0.1213, over 2925880.16 frames. ], batch size: 89, lr: 2.39e-02, grad_scale: 32.0 2023-06-15 04:54:59,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=37493.333333333336, ans=0.07 2023-06-15 04:56:18,515 INFO [train.py:988] (1/4) Epoch 11, batch 350, loss[loss=0.3289, simple_loss=0.3711, pruned_loss=0.1434, over 20528.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3539, pruned_loss=0.1215, over 3128318.71 frames. ], batch size: 160, lr: 2.39e-02, grad_scale: 32.0 2023-06-15 04:57:12,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=38026.666666666664, ans=0.125 2023-06-15 04:57:23,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.398e+02 2.846e+02 3.298e+02 5.496e+02, threshold=5.692e+02, percent-clipped=3.0 2023-06-15 04:57:48,733 INFO [train.py:988] (1/4) Epoch 11, batch 400, loss[loss=0.3103, simple_loss=0.3619, pruned_loss=0.1293, over 19101.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3522, pruned_loss=0.12, over 3289244.43 frames. ], batch size: 89, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 04:58:03,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=38160.0, ans=0.0025739130434782606 2023-06-15 04:58:12,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=38226.666666666664, ans=0.125 2023-06-15 04:58:17,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.18 vs. limit=22.5 2023-06-15 04:58:26,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=38293.333333333336, ans=0.125 2023-06-15 04:58:52,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38360.0, ans=0.1 2023-06-15 04:59:12,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=38426.666666666664, ans=0.0025159420289855077 2023-06-15 04:59:18,295 INFO [train.py:988] (1/4) Epoch 11, batch 450, loss[loss=0.3366, simple_loss=0.4071, pruned_loss=0.133, over 15490.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3523, pruned_loss=0.1196, over 3402671.29 frames. ], batch size: 44, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 04:59:20,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.40 vs. limit=22.5 2023-06-15 04:59:53,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38626.666666666664, ans=0.125 2023-06-15 05:00:11,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=38693.333333333336, ans=0.2 2023-06-15 05:00:21,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.134e+02 2.582e+02 3.273e+02 5.590e+02, threshold=5.163e+02, percent-clipped=0.0 2023-06-15 05:00:40,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.43 vs. limit=22.5 2023-06-15 05:00:45,639 INFO [train.py:988] (1/4) Epoch 11, batch 500, loss[loss=0.3162, simple_loss=0.376, pruned_loss=0.1282, over 17089.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3521, pruned_loss=0.1191, over 3485351.74 frames. ], batch size: 60, lr: 2.38e-02, grad_scale: 32.0 2023-06-15 05:01:04,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=38893.333333333336, ans=0.0 2023-06-15 05:02:04,595 INFO [train.py:988] (1/4) Epoch 12, batch 0, loss[loss=0.2804, simple_loss=0.3511, pruned_loss=0.1049, over 16943.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3511, pruned_loss=0.1049, over 16943.00 frames. ], batch size: 60, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:02:04,596 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 05:02:10,671 INFO [train.py:1020] (1/4) Epoch 12, validation: loss=0.2286, simple_loss=0.3321, pruned_loss=0.06259, over 143649.00 frames. 2023-06-15 05:02:10,672 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 05:02:15,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-06-15 05:02:32,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=39106.666666666664, ans=0.125 2023-06-15 05:02:53,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=39173.333333333336, ans=0.125 2023-06-15 05:03:28,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=39306.666666666664, ans=0.125 2023-06-15 05:03:39,958 INFO [train.py:988] (1/4) Epoch 12, batch 50, loss[loss=0.2718, simple_loss=0.3537, pruned_loss=0.09499, over 17598.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3473, pruned_loss=0.1176, over 850446.64 frames. ], batch size: 67, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:03:43,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=39373.333333333336, ans=0.1 2023-06-15 05:03:47,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.229e+02 2.614e+02 3.246e+02 5.755e+02, threshold=5.228e+02, percent-clipped=1.0 2023-06-15 05:03:48,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.76 vs. limit=12.0 2023-06-15 05:03:53,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=39373.333333333336, ans=0.1 2023-06-15 05:03:56,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=39440.0, ans=0.125 2023-06-15 05:05:06,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=39640.0, ans=0.05 2023-06-15 05:05:09,987 INFO [train.py:988] (1/4) Epoch 12, batch 100, loss[loss=0.3085, simple_loss=0.3711, pruned_loss=0.123, over 17035.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.346, pruned_loss=0.1166, over 1513054.89 frames. ], batch size: 60, lr: 2.28e-02, grad_scale: 32.0 2023-06-15 05:05:12,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2023-06-15 05:05:13,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=39706.666666666664, ans=0.07 2023-06-15 05:05:21,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=39706.666666666664, ans=0.125 2023-06-15 05:05:30,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=39773.333333333336, ans=0.2 2023-06-15 05:06:29,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=39973.333333333336, ans=0.2 2023-06-15 05:06:40,219 INFO [train.py:988] (1/4) Epoch 12, batch 150, loss[loss=0.2814, simple_loss=0.3333, pruned_loss=0.1148, over 20566.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.346, pruned_loss=0.1158, over 2040883.27 frames. ], batch size: 189, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:06:46,889 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 2.288e+02 2.648e+02 3.128e+02 5.617e+02, threshold=5.296e+02, percent-clipped=1.0 2023-06-15 05:06:54,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2023-06-15 05:07:43,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=40240.0, ans=0.125 2023-06-15 05:07:52,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=40306.666666666664, ans=0.002107246376811594 2023-06-15 05:07:55,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-15 05:08:08,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=40373.333333333336, ans=0.07 2023-06-15 05:08:09,685 INFO [train.py:988] (1/4) Epoch 12, batch 200, loss[loss=0.273, simple_loss=0.3522, pruned_loss=0.09694, over 16967.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3454, pruned_loss=0.1147, over 2435969.12 frames. ], batch size: 60, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:08:28,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=40440.0, ans=0.2 2023-06-15 05:08:46,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=40506.666666666664, ans=0.0 2023-06-15 05:09:23,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=40640.0, ans=0.2 2023-06-15 05:09:39,259 INFO [train.py:988] (1/4) Epoch 12, batch 250, loss[loss=0.3242, simple_loss=0.3872, pruned_loss=0.1306, over 16180.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3466, pruned_loss=0.1148, over 2732730.19 frames. ], batch size: 52, lr: 2.27e-02, grad_scale: 32.0 2023-06-15 05:09:46,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 2.177e+02 2.544e+02 3.093e+02 5.809e+02, threshold=5.088e+02, percent-clipped=2.0 2023-06-15 05:09:48,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40706.666666666664, ans=0.1 2023-06-15 05:10:54,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=40973.333333333336, ans=0.2 2023-06-15 05:11:09,751 INFO [train.py:988] (1/4) Epoch 12, batch 300, loss[loss=0.2974, simple_loss=0.364, pruned_loss=0.1154, over 18317.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3469, pruned_loss=0.1145, over 2970034.58 frames. ], batch size: 74, lr: 2.26e-02, grad_scale: 32.0 2023-06-15 05:11:12,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41040.0, ans=0.1 2023-06-15 05:11:13,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=41040.0, ans=0.125 2023-06-15 05:11:19,393 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:11:34,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2023-06-15 05:11:35,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=41106.666666666664, ans=0.2 2023-06-15 05:11:44,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.26 vs. limit=5.0 2023-06-15 05:11:59,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.58 vs. limit=10.0 2023-06-15 05:12:21,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=41306.666666666664, ans=0.5 2023-06-15 05:12:37,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41306.666666666664, ans=0.125 2023-06-15 05:12:40,673 INFO [train.py:988] (1/4) Epoch 12, batch 350, loss[loss=0.3109, simple_loss=0.3801, pruned_loss=0.1208, over 18300.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3466, pruned_loss=0.1147, over 3172405.01 frames. ], batch size: 72, lr: 2.26e-02, grad_scale: 32.0 2023-06-15 05:12:47,499 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.134e+02 2.417e+02 3.017e+02 4.561e+02, threshold=4.834e+02, percent-clipped=0.0 2023-06-15 05:13:13,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=41440.0, ans=0.2 2023-06-15 05:13:23,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.94 vs. limit=15.0 2023-06-15 05:14:10,572 INFO [train.py:988] (1/4) Epoch 12, batch 400, loss[loss=0.281, simple_loss=0.3472, pruned_loss=0.1074, over 19330.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3468, pruned_loss=0.1147, over 3312137.01 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:14:17,999 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:15:37,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41973.333333333336, ans=0.125 2023-06-15 05:15:40,746 INFO [train.py:988] (1/4) Epoch 12, batch 450, loss[loss=0.2932, simple_loss=0.3503, pruned_loss=0.118, over 19662.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3461, pruned_loss=0.1148, over 3414948.01 frames. ], batch size: 110, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:15:48,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.312e+02 2.678e+02 3.302e+02 6.342e+02, threshold=5.355e+02, percent-clipped=8.0 2023-06-15 05:15:50,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=42040.0, ans=0.2 2023-06-15 05:17:08,284 INFO [train.py:988] (1/4) Epoch 12, batch 500, loss[loss=0.272, simple_loss=0.3388, pruned_loss=0.1026, over 19353.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3456, pruned_loss=0.1151, over 3496487.71 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 32.0 2023-06-15 05:17:42,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=42506.666666666664, ans=0.125 2023-06-15 05:17:42,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=42506.666666666664, ans=0.0 2023-06-15 05:17:45,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42506.666666666664, ans=0.1 2023-06-15 05:17:51,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-15 05:17:53,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=42506.666666666664, ans=0.05 2023-06-15 05:18:28,983 INFO [train.py:988] (1/4) Epoch 13, batch 0, loss[loss=0.2919, simple_loss=0.3433, pruned_loss=0.1203, over 20266.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3433, pruned_loss=0.1203, over 20266.00 frames. ], batch size: 141, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:18:28,984 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 05:18:35,096 INFO [train.py:1020] (1/4) Epoch 13, validation: loss=0.2246, simple_loss=0.3282, pruned_loss=0.06053, over 143649.00 frames. 2023-06-15 05:18:35,097 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 05:18:45,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-06-15 05:18:50,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.30 vs. limit=10.0 2023-06-15 05:18:58,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=42660.0, ans=6.0 2023-06-15 05:19:13,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.235e+02 2.660e+02 3.477e+02 4.514e+02, threshold=5.320e+02, percent-clipped=0.0 2023-06-15 05:19:26,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=42726.666666666664, ans=0.5 2023-06-15 05:20:04,774 INFO [train.py:988] (1/4) Epoch 13, batch 50, loss[loss=0.2735, simple_loss=0.34, pruned_loss=0.1034, over 19721.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3451, pruned_loss=0.1136, over 855627.15 frames. ], batch size: 110, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:20:12,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=42926.666666666664, ans=0.0015376811594202903 2023-06-15 05:20:19,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-15 05:21:07,984 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2023-06-15 05:21:32,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=43260.0, ans=0.2 2023-06-15 05:21:33,299 INFO [train.py:988] (1/4) Epoch 13, batch 100, loss[loss=0.2838, simple_loss=0.3472, pruned_loss=0.1102, over 18764.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3423, pruned_loss=0.1121, over 1506835.38 frames. ], batch size: 83, lr: 2.16e-02, grad_scale: 32.0 2023-06-15 05:21:43,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=43260.0, ans=0.95 2023-06-15 05:21:48,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=43260.0, ans=0.95 2023-06-15 05:21:51,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=43326.666666666664, ans=0.125 2023-06-15 05:21:57,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=43326.666666666664, ans=0.0 2023-06-15 05:22:07,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=43393.333333333336, ans=0.0 2023-06-15 05:22:10,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.949e+02 2.269e+02 2.644e+02 4.836e+02, threshold=4.538e+02, percent-clipped=0.0 2023-06-15 05:22:11,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=12.0 2023-06-15 05:22:14,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43393.333333333336, ans=0.1 2023-06-15 05:22:35,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43460.0, ans=0.125 2023-06-15 05:22:44,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=43526.666666666664, ans=0.001407246376811595 2023-06-15 05:22:51,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=43526.666666666664, ans=0.2 2023-06-15 05:22:56,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=43526.666666666664, ans=0.125 2023-06-15 05:23:00,744 INFO [train.py:988] (1/4) Epoch 13, batch 150, loss[loss=0.284, simple_loss=0.3344, pruned_loss=0.1168, over 20555.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3425, pruned_loss=0.1112, over 2007301.50 frames. ], batch size: 189, lr: 2.15e-02, grad_scale: 32.0 2023-06-15 05:23:20,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=43660.0, ans=0.125 2023-06-15 05:23:23,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43660.0, ans=0.1 2023-06-15 05:24:02,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.39 vs. limit=6.0 2023-06-15 05:24:09,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=43860.0, ans=0.125 2023-06-15 05:24:28,503 INFO [train.py:988] (1/4) Epoch 13, batch 200, loss[loss=0.2657, simple_loss=0.3276, pruned_loss=0.1019, over 19870.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3439, pruned_loss=0.1123, over 2403591.74 frames. ], batch size: 120, lr: 2.15e-02, grad_scale: 32.0 2023-06-15 05:24:42,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=43926.666666666664, ans=0.0 2023-06-15 05:24:53,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=43993.333333333336, ans=0.125 2023-06-15 05:25:05,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.172e+02 2.424e+02 2.924e+02 5.184e+02, threshold=4.848e+02, percent-clipped=5.0 2023-06-15 05:25:14,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=44060.0, ans=0.5 2023-06-15 05:25:19,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=44126.666666666664, ans=0.0 2023-06-15 05:25:50,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-06-15 05:25:56,670 INFO [train.py:988] (1/4) Epoch 13, batch 250, loss[loss=0.2495, simple_loss=0.3216, pruned_loss=0.08868, over 19697.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3438, pruned_loss=0.1117, over 2713851.36 frames. ], batch size: 110, lr: 2.15e-02, grad_scale: 16.0 2023-06-15 05:26:25,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-06-15 05:26:29,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44326.666666666664, ans=0.1 2023-06-15 05:26:42,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44393.333333333336, ans=0.1 2023-06-15 05:26:53,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=44460.0, ans=0.125 2023-06-15 05:27:08,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44526.666666666664, ans=0.125 2023-06-15 05:27:23,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=44593.333333333336, ans=0.1 2023-06-15 05:27:24,263 INFO [train.py:988] (1/4) Epoch 13, batch 300, loss[loss=0.2706, simple_loss=0.3435, pruned_loss=0.09881, over 11146.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3434, pruned_loss=0.112, over 2945412.69 frames. ], batch size: 31, lr: 2.14e-02, grad_scale: 16.0 2023-06-15 05:27:40,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=44660.0, ans=0.0 2023-06-15 05:27:54,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=44660.0, ans=0.125 2023-06-15 05:28:03,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.137e+02 2.490e+02 3.192e+02 5.767e+02, threshold=4.980e+02, percent-clipped=3.0 2023-06-15 05:28:38,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.16 vs. limit=22.5 2023-06-15 05:28:52,651 INFO [train.py:988] (1/4) Epoch 13, batch 350, loss[loss=0.2926, simple_loss=0.3505, pruned_loss=0.1173, over 19691.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.342, pruned_loss=0.1119, over 3138783.01 frames. ], batch size: 110, lr: 2.14e-02, grad_scale: 16.0 2023-06-15 05:29:33,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=45060.0, ans=0.125 2023-06-15 05:29:40,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=45060.0, ans=0.1 2023-06-15 05:29:48,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.39 vs. limit=22.5 2023-06-15 05:30:19,939 INFO [train.py:988] (1/4) Epoch 13, batch 400, loss[loss=0.2847, simple_loss=0.3557, pruned_loss=0.1069, over 17648.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.342, pruned_loss=0.1108, over 3289914.15 frames. ], batch size: 67, lr: 2.14e-02, grad_scale: 32.0 2023-06-15 05:30:20,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=45260.0, ans=0.125 2023-06-15 05:30:53,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=45393.333333333336, ans=0.0010014492753623178 2023-06-15 05:30:59,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.082e+02 2.371e+02 2.770e+02 5.646e+02, threshold=4.742e+02, percent-clipped=0.0 2023-06-15 05:31:03,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=45393.333333333336, ans=0.09899494936611666 2023-06-15 05:31:10,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=45393.333333333336, ans=0.2 2023-06-15 05:31:14,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=45460.0, ans=0.125 2023-06-15 05:31:45,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=45526.666666666664, ans=10.0 2023-06-15 05:31:47,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45593.333333333336, ans=0.1 2023-06-15 05:31:47,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.17 vs. limit=22.5 2023-06-15 05:31:48,733 INFO [train.py:988] (1/4) Epoch 13, batch 450, loss[loss=0.2607, simple_loss=0.3411, pruned_loss=0.09018, over 17627.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3415, pruned_loss=0.1109, over 3406702.17 frames. ], batch size: 67, lr: 2.13e-02, grad_scale: 32.0 2023-06-15 05:31:56,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=45593.333333333336, ans=0.125 2023-06-15 05:31:57,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=45593.333333333336, ans=0.125 2023-06-15 05:32:27,430 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:32:45,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45793.333333333336, ans=0.1 2023-06-15 05:32:48,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=45793.333333333336, ans=0.0 2023-06-15 05:33:01,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=45860.0, ans=0.04949747468305833 2023-06-15 05:33:07,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-06-15 05:33:13,477 INFO [train.py:988] (1/4) Epoch 13, batch 500, loss[loss=0.292, simple_loss=0.3482, pruned_loss=0.1179, over 20526.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3424, pruned_loss=0.1114, over 3489096.56 frames. ], batch size: 189, lr: 2.13e-02, grad_scale: 32.0 2023-06-15 05:33:49,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.112e+02 2.450e+02 3.177e+02 4.704e+02, threshold=4.901e+02, percent-clipped=1.0 2023-06-15 05:33:57,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=46060.0, ans=0.125 2023-06-15 05:34:30,998 INFO [train.py:988] (1/4) Epoch 14, batch 0, loss[loss=0.2907, simple_loss=0.3407, pruned_loss=0.1203, over 20094.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3407, pruned_loss=0.1203, over 20094.00 frames. ], batch size: 133, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:34:30,998 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 05:34:37,023 INFO [train.py:1020] (1/4) Epoch 14, validation: loss=0.2205, simple_loss=0.3248, pruned_loss=0.05804, over 143649.00 frames. 2023-06-15 05:34:37,024 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 05:35:28,998 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:35:38,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=46340.0, ans=0.0 2023-06-15 05:36:00,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46406.666666666664, ans=0.125 2023-06-15 05:36:03,640 INFO [train.py:988] (1/4) Epoch 14, batch 50, loss[loss=0.2788, simple_loss=0.3347, pruned_loss=0.1115, over 20354.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3421, pruned_loss=0.1087, over 851135.73 frames. ], batch size: 149, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:36:06,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=22.5 2023-06-15 05:36:14,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=46473.333333333336, ans=0.2 2023-06-15 05:36:21,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.96 vs. limit=15.0 2023-06-15 05:36:25,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=46540.0, ans=0.125 2023-06-15 05:36:36,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46540.0, ans=0.125 2023-06-15 05:36:55,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=46673.333333333336, ans=0.125 2023-06-15 05:37:11,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-15 05:37:14,890 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.122e+02 2.332e+02 2.601e+02 5.252e+02, threshold=4.663e+02, percent-clipped=1.0 2023-06-15 05:37:19,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46740.0, ans=0.1 2023-06-15 05:37:32,792 INFO [train.py:988] (1/4) Epoch 14, batch 100, loss[loss=0.2514, simple_loss=0.3219, pruned_loss=0.09049, over 19878.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.34, pruned_loss=0.1091, over 1491556.44 frames. ], batch size: 120, lr: 2.05e-02, grad_scale: 32.0 2023-06-15 05:37:46,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=46806.666666666664, ans=0.0 2023-06-15 05:38:04,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=46873.333333333336, ans=0.0006797101449275353 2023-06-15 05:38:57,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=47073.333333333336, ans=0.125 2023-06-15 05:39:00,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=15.0 2023-06-15 05:39:01,366 INFO [train.py:988] (1/4) Epoch 14, batch 150, loss[loss=0.2802, simple_loss=0.348, pruned_loss=0.1062, over 19222.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3397, pruned_loss=0.1099, over 2024895.99 frames. ], batch size: 92, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:39:34,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=47273.333333333336, ans=0.0 2023-06-15 05:40:07,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=47340.0, ans=0.125 2023-06-15 05:40:11,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 2.106e+02 2.352e+02 2.722e+02 4.842e+02, threshold=4.704e+02, percent-clipped=2.0 2023-06-15 05:40:16,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=47406.666666666664, ans=0.125 2023-06-15 05:40:23,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=47406.666666666664, ans=0.09899494936611666 2023-06-15 05:40:28,797 INFO [train.py:988] (1/4) Epoch 14, batch 200, loss[loss=0.2743, simple_loss=0.334, pruned_loss=0.1073, over 20272.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3405, pruned_loss=0.1094, over 2412290.79 frames. ], batch size: 149, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:40:32,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=47473.333333333336, ans=0.00054927536231884 2023-06-15 05:40:51,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47540.0, ans=0.1 2023-06-15 05:40:56,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=47540.0, ans=0.125 2023-06-15 05:40:57,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=15.0 2023-06-15 05:41:01,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=47606.666666666664, ans=0.0 2023-06-15 05:41:02,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=47606.666666666664, ans=0.125 2023-06-15 05:41:56,679 INFO [train.py:988] (1/4) Epoch 14, batch 250, loss[loss=0.2772, simple_loss=0.3365, pruned_loss=0.1089, over 20510.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3398, pruned_loss=0.1088, over 2723145.18 frames. ], batch size: 160, lr: 2.04e-02, grad_scale: 32.0 2023-06-15 05:42:02,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=47806.666666666664, ans=0.125 2023-06-15 05:42:10,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=47806.666666666664, ans=0.125 2023-06-15 05:43:06,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.140e+02 2.401e+02 2.980e+02 6.123e+02, threshold=4.801e+02, percent-clipped=4.0 2023-06-15 05:43:06,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=48073.333333333336, ans=0.125 2023-06-15 05:43:15,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=48073.333333333336, ans=0.0 2023-06-15 05:43:23,496 INFO [train.py:988] (1/4) Epoch 14, batch 300, loss[loss=0.2813, simple_loss=0.3477, pruned_loss=0.1074, over 18627.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3397, pruned_loss=0.1087, over 2961215.04 frames. ], batch size: 80, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:43:35,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=48140.0, ans=0.0 2023-06-15 05:43:37,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-15 05:43:45,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48206.666666666664, ans=0.1 2023-06-15 05:44:44,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=48406.666666666664, ans=0.0 2023-06-15 05:44:50,834 INFO [train.py:988] (1/4) Epoch 14, batch 350, loss[loss=0.2575, simple_loss=0.3252, pruned_loss=0.09494, over 18622.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3385, pruned_loss=0.1076, over 3161815.85 frames. ], batch size: 80, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:44:51,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=48473.333333333336, ans=0.125 2023-06-15 05:45:05,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.39 vs. limit=6.0 2023-06-15 05:45:20,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=48540.0, ans=0.95 2023-06-15 05:45:23,033 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.62 vs. limit=15.0 2023-06-15 05:45:24,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=48606.666666666664, ans=0.00030289855072463713 2023-06-15 05:45:31,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2023-06-15 05:45:38,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.23 vs. limit=22.5 2023-06-15 05:45:58,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=48740.0, ans=0.0 2023-06-15 05:45:59,720 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.238e+02 2.722e+02 3.535e+02 5.292e+02, threshold=5.444e+02, percent-clipped=1.0 2023-06-15 05:46:03,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48740.0, ans=0.125 2023-06-15 05:46:05,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=48740.0, ans=0.125 2023-06-15 05:46:16,363 INFO [train.py:988] (1/4) Epoch 14, batch 400, loss[loss=0.2641, simple_loss=0.3361, pruned_loss=0.09601, over 19234.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3376, pruned_loss=0.1071, over 3317311.71 frames. ], batch size: 92, lr: 2.03e-02, grad_scale: 32.0 2023-06-15 05:47:05,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=48940.0, ans=0.125 2023-06-15 05:47:19,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=49006.666666666664, ans=0.125 2023-06-15 05:47:19,312 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 05:47:38,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.60 vs. limit=6.0 2023-06-15 05:47:42,468 INFO [train.py:988] (1/4) Epoch 14, batch 450, loss[loss=0.2409, simple_loss=0.3166, pruned_loss=0.08262, over 19199.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3375, pruned_loss=0.1071, over 3431078.44 frames. ], batch size: 92, lr: 2.02e-02, grad_scale: 32.0 2023-06-15 05:47:44,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=49140.0, ans=0.0 2023-06-15 05:48:00,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=49206.666666666664, ans=0.0 2023-06-15 05:48:09,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=49206.666666666664, ans=0.2 2023-06-15 05:48:50,092 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.110e+02 2.476e+02 3.051e+02 4.850e+02, threshold=4.953e+02, percent-clipped=0.0 2023-06-15 05:49:06,430 INFO [train.py:988] (1/4) Epoch 14, batch 500, loss[loss=0.2516, simple_loss=0.3258, pruned_loss=0.08867, over 19335.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3378, pruned_loss=0.1065, over 3512355.30 frames. ], batch size: 98, lr: 2.02e-02, grad_scale: 32.0 2023-06-15 05:49:11,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=49473.333333333336, ans=0.125 2023-06-15 05:49:16,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=49473.333333333336, ans=0.95 2023-06-15 05:49:31,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=49540.0, ans=15.0 2023-06-15 05:49:35,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=49540.0, ans=0.95 2023-06-15 05:49:47,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.02 vs. limit=22.5 2023-06-15 05:49:57,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-15 05:50:25,281 INFO [train.py:988] (1/4) Epoch 15, batch 0, loss[loss=0.2757, simple_loss=0.3386, pruned_loss=0.1064, over 19075.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3386, pruned_loss=0.1064, over 19075.00 frames. ], batch size: 89, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:50:25,281 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 05:50:31,410 INFO [train.py:1020] (1/4) Epoch 15, validation: loss=0.2189, simple_loss=0.3232, pruned_loss=0.05727, over 143649.00 frames. 2023-06-15 05:50:31,410 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 05:50:36,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=49693.333333333336, ans=0.125 2023-06-15 05:50:44,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49693.333333333336, ans=0.125 2023-06-15 05:50:46,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=49760.0, ans=0.0 2023-06-15 05:51:07,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=49826.666666666664, ans=0.125 2023-06-15 05:51:46,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49960.0, ans=0.125 2023-06-15 05:51:55,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=49960.0, ans=0.0 2023-06-15 05:51:58,712 INFO [train.py:988] (1/4) Epoch 15, batch 50, loss[loss=0.2727, simple_loss=0.323, pruned_loss=0.1111, over 20115.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3373, pruned_loss=0.1069, over 859621.87 frames. ], batch size: 239, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:52:10,519 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.143e+02 2.454e+02 2.855e+02 6.420e+02, threshold=4.907e+02, percent-clipped=3.0 2023-06-15 05:52:10,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=50026.666666666664, ans=0.0 2023-06-15 05:52:17,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=50093.333333333336, ans=0.0 2023-06-15 05:53:00,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=15.0 2023-06-15 05:53:10,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=50293.333333333336, ans=0.0 2023-06-15 05:53:26,702 INFO [train.py:988] (1/4) Epoch 15, batch 100, loss[loss=0.2638, simple_loss=0.3205, pruned_loss=0.1035, over 20516.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3364, pruned_loss=0.1057, over 1509464.03 frames. ], batch size: 173, lr: 1.95e-02, grad_scale: 32.0 2023-06-15 05:53:30,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=50360.0, ans=0.125 2023-06-15 05:53:42,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=50426.666666666664, ans=0.09899494936611666 2023-06-15 05:53:43,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50426.666666666664, ans=0.125 2023-06-15 05:54:10,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50493.333333333336, ans=0.1 2023-06-15 05:54:38,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.61 vs. limit=22.5 2023-06-15 05:54:46,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.01 vs. limit=22.5 2023-06-15 05:54:54,018 INFO [train.py:988] (1/4) Epoch 15, batch 150, loss[loss=0.2906, simple_loss=0.3518, pruned_loss=0.1147, over 19870.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3371, pruned_loss=0.1057, over 2020084.19 frames. ], batch size: 120, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:55:06,225 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.090e+02 2.403e+02 2.891e+02 4.165e+02, threshold=4.806e+02, percent-clipped=0.0 2023-06-15 05:55:12,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2023-06-15 05:55:35,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=50826.666666666664, ans=0.07 2023-06-15 05:55:49,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-15 05:55:50,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=50893.333333333336, ans=0.0 2023-06-15 05:55:55,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=50893.333333333336, ans=0.125 2023-06-15 05:56:00,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=50893.333333333336, ans=0.2 2023-06-15 05:56:22,228 INFO [train.py:988] (1/4) Epoch 15, batch 200, loss[loss=0.2612, simple_loss=0.3272, pruned_loss=0.09757, over 19071.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3361, pruned_loss=0.1063, over 2406198.79 frames. ], batch size: 94, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:56:30,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=51026.666666666664, ans=0.125 2023-06-15 05:56:46,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51093.333333333336, ans=0.125 2023-06-15 05:57:11,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=51160.0, ans=0.0 2023-06-15 05:57:34,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=51293.333333333336, ans=15.0 2023-06-15 05:57:50,257 INFO [train.py:988] (1/4) Epoch 15, batch 250, loss[loss=0.2822, simple_loss=0.3509, pruned_loss=0.1067, over 19122.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3348, pruned_loss=0.1048, over 2713234.50 frames. ], batch size: 94, lr: 1.94e-02, grad_scale: 32.0 2023-06-15 05:58:03,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.983e+02 2.255e+02 2.708e+02 4.170e+02, threshold=4.510e+02, percent-clipped=0.0 2023-06-15 05:58:13,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.79 vs. limit=22.5 2023-06-15 05:58:35,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=51493.333333333336, ans=0.0 2023-06-15 05:59:00,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51626.666666666664, ans=0.125 2023-06-15 05:59:01,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-15 05:59:14,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-06-15 05:59:19,770 INFO [train.py:988] (1/4) Epoch 15, batch 300, loss[loss=0.2693, simple_loss=0.3439, pruned_loss=0.09731, over 18316.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3345, pruned_loss=0.1052, over 2947014.64 frames. ], batch size: 74, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 05:59:20,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=51693.333333333336, ans=0.0 2023-06-15 05:59:24,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=51693.333333333336, ans=15.0 2023-06-15 05:59:37,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=51760.0, ans=0.05 2023-06-15 05:59:41,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=22.5 2023-06-15 05:59:53,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=51826.666666666664, ans=0.125 2023-06-15 06:00:02,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-15 06:00:47,517 INFO [train.py:988] (1/4) Epoch 15, batch 350, loss[loss=0.279, simple_loss=0.3661, pruned_loss=0.09595, over 17642.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3347, pruned_loss=0.1042, over 3132620.73 frames. ], batch size: 67, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 06:00:51,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=52026.666666666664, ans=0.2 2023-06-15 06:00:55,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=52026.666666666664, ans=0.125 2023-06-15 06:01:00,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.062e+02 2.429e+02 2.907e+02 4.781e+02, threshold=4.857e+02, percent-clipped=2.0 2023-06-15 06:01:53,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=52226.666666666664, ans=0.2 2023-06-15 06:02:01,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-15 06:02:03,992 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:02:06,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=52293.333333333336, ans=0.125 2023-06-15 06:02:07,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=52293.333333333336, ans=0.1 2023-06-15 06:02:13,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=52293.333333333336, ans=0.125 2023-06-15 06:02:16,536 INFO [train.py:988] (1/4) Epoch 15, batch 400, loss[loss=0.2546, simple_loss=0.3271, pruned_loss=0.09099, over 19518.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3349, pruned_loss=0.1041, over 3270584.23 frames. ], batch size: 102, lr: 1.93e-02, grad_scale: 32.0 2023-06-15 06:02:24,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-06-15 06:02:47,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=52426.666666666664, ans=0.0 2023-06-15 06:03:05,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=52493.333333333336, ans=0.1 2023-06-15 06:03:24,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.36 vs. limit=10.0 2023-06-15 06:03:43,716 INFO [train.py:988] (1/4) Epoch 15, batch 450, loss[loss=0.2732, simple_loss=0.3386, pruned_loss=0.1038, over 19068.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3352, pruned_loss=0.1039, over 3380445.98 frames. ], batch size: 94, lr: 1.92e-02, grad_scale: 32.0 2023-06-15 06:03:48,257 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:03:56,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.130e+02 2.421e+02 3.094e+02 4.907e+02, threshold=4.841e+02, percent-clipped=1.0 2023-06-15 06:04:16,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52760.0, ans=0.125 2023-06-15 06:04:37,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2023-06-15 06:05:00,330 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:05:04,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.66 vs. limit=12.0 2023-06-15 06:05:05,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52960.0, ans=0.1 2023-06-15 06:05:10,601 INFO [train.py:988] (1/4) Epoch 15, batch 500, loss[loss=0.2657, simple_loss=0.3328, pruned_loss=0.09932, over 19233.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3346, pruned_loss=0.1036, over 3488028.61 frames. ], batch size: 92, lr: 1.92e-02, grad_scale: 32.0 2023-06-15 06:05:52,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=53160.0, ans=0.125 2023-06-15 06:06:28,448 INFO [train.py:988] (1/4) Epoch 16, batch 0, loss[loss=0.2945, simple_loss=0.3098, pruned_loss=0.1396, over 16940.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3098, pruned_loss=0.1396, over 16940.00 frames. ], batch size: 391, lr: 1.86e-02, grad_scale: 32.0 2023-06-15 06:06:28,449 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 06:06:34,515 INFO [train.py:1020] (1/4) Epoch 16, validation: loss=0.2134, simple_loss=0.3194, pruned_loss=0.05367, over 143649.00 frames. 2023-06-15 06:06:34,516 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 06:06:41,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.88 vs. limit=15.0 2023-06-15 06:06:44,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=53240.0, ans=0.07 2023-06-15 06:06:50,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=53306.666666666664, ans=0.125 2023-06-15 06:06:54,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=53306.666666666664, ans=0.125 2023-06-15 06:07:19,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2023-06-15 06:07:19,736 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.212e+02 2.676e+02 3.191e+02 5.269e+02, threshold=5.353e+02, percent-clipped=1.0 2023-06-15 06:07:41,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=53440.0, ans=0.125 2023-06-15 06:08:01,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=53573.333333333336, ans=0.125 2023-06-15 06:08:03,129 INFO [train.py:988] (1/4) Epoch 16, batch 50, loss[loss=0.2991, simple_loss=0.3537, pruned_loss=0.1222, over 20308.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3319, pruned_loss=0.1038, over 853084.44 frames. ], batch size: 149, lr: 1.86e-02, grad_scale: 32.0 2023-06-15 06:08:55,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=53773.333333333336, ans=0.0 2023-06-15 06:09:21,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=53840.0, ans=0.025 2023-06-15 06:09:29,418 INFO [train.py:988] (1/4) Epoch 16, batch 100, loss[loss=0.2855, simple_loss=0.3403, pruned_loss=0.1154, over 20102.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3322, pruned_loss=0.1028, over 1518433.39 frames. ], batch size: 133, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:10:02,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-06-15 06:10:12,493 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.999e+02 2.214e+02 2.667e+02 3.874e+02, threshold=4.428e+02, percent-clipped=0.0 2023-06-15 06:10:13,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.66 vs. limit=15.0 2023-06-15 06:10:24,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=54106.666666666664, ans=0.0 2023-06-15 06:10:55,320 INFO [train.py:988] (1/4) Epoch 16, batch 150, loss[loss=0.3001, simple_loss=0.3782, pruned_loss=0.111, over 18322.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3327, pruned_loss=0.103, over 2032927.23 frames. ], batch size: 72, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:10:56,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-06-15 06:12:16,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=54506.666666666664, ans=0.125 2023-06-15 06:12:22,655 INFO [train.py:988] (1/4) Epoch 16, batch 200, loss[loss=0.2854, simple_loss=0.3568, pruned_loss=0.107, over 18288.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3333, pruned_loss=0.1027, over 2420733.94 frames. ], batch size: 74, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:12:29,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-06-15 06:12:29,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=54573.333333333336, ans=0.125 2023-06-15 06:12:43,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=12.0 2023-06-15 06:13:05,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.168e+02 2.479e+02 3.039e+02 4.350e+02, threshold=4.958e+02, percent-clipped=0.0 2023-06-15 06:13:21,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-15 06:13:26,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2023-06-15 06:13:42,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.74 vs. limit=22.5 2023-06-15 06:13:50,157 INFO [train.py:988] (1/4) Epoch 16, batch 250, loss[loss=0.2568, simple_loss=0.3249, pruned_loss=0.09433, over 19827.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3328, pruned_loss=0.1023, over 2715609.97 frames. ], batch size: 115, lr: 1.85e-02, grad_scale: 32.0 2023-06-15 06:14:17,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=54973.333333333336, ans=0.125 2023-06-15 06:14:25,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=55040.0, ans=0.0 2023-06-15 06:14:59,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=55173.333333333336, ans=15.0 2023-06-15 06:15:04,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=55173.333333333336, ans=0.125 2023-06-15 06:15:11,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=55173.333333333336, ans=0.125 2023-06-15 06:15:15,832 INFO [train.py:988] (1/4) Epoch 16, batch 300, loss[loss=0.2704, simple_loss=0.3511, pruned_loss=0.09487, over 15339.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3326, pruned_loss=0.102, over 2947872.87 frames. ], batch size: 44, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:15:16,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=55240.0, ans=0.0 2023-06-15 06:15:18,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2023-06-15 06:15:23,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=55240.0, ans=0.125 2023-06-15 06:15:23,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=55240.0, ans=0.07 2023-06-15 06:15:39,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.95 vs. limit=12.0 2023-06-15 06:15:48,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.47 vs. limit=15.0 2023-06-15 06:15:59,334 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.170e+02 2.637e+02 3.215e+02 4.848e+02, threshold=5.274e+02, percent-clipped=0.0 2023-06-15 06:16:43,041 INFO [train.py:988] (1/4) Epoch 16, batch 350, loss[loss=0.2563, simple_loss=0.3328, pruned_loss=0.08991, over 19813.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3316, pruned_loss=0.1009, over 3139011.46 frames. ], batch size: 115, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:16:43,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=55573.333333333336, ans=0.5 2023-06-15 06:16:53,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-15 06:17:23,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55706.666666666664, ans=0.1 2023-06-15 06:17:34,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.51 vs. limit=22.5 2023-06-15 06:18:10,023 INFO [train.py:988] (1/4) Epoch 16, batch 400, loss[loss=0.2652, simple_loss=0.34, pruned_loss=0.09521, over 18931.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3311, pruned_loss=0.1008, over 3285090.40 frames. ], batch size: 86, lr: 1.84e-02, grad_scale: 32.0 2023-06-15 06:18:32,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=55973.333333333336, ans=0.07 2023-06-15 06:18:53,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.986e+02 2.238e+02 2.515e+02 3.872e+02, threshold=4.476e+02, percent-clipped=0.0 2023-06-15 06:18:57,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-06-15 06:19:19,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.51 vs. limit=22.5 2023-06-15 06:19:24,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=56173.333333333336, ans=0.125 2023-06-15 06:19:28,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=56173.333333333336, ans=0.2 2023-06-15 06:19:36,774 INFO [train.py:988] (1/4) Epoch 16, batch 450, loss[loss=0.249, simple_loss=0.3233, pruned_loss=0.08738, over 19666.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3308, pruned_loss=0.1015, over 3394643.73 frames. ], batch size: 110, lr: 1.83e-02, grad_scale: 32.0 2023-06-15 06:19:51,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=56240.0, ans=0.125 2023-06-15 06:19:51,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=56240.0, ans=0.2 2023-06-15 06:19:59,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=56306.666666666664, ans=0.125 2023-06-15 06:20:50,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56506.666666666664, ans=0.1 2023-06-15 06:20:59,729 INFO [train.py:988] (1/4) Epoch 16, batch 500, loss[loss=0.2546, simple_loss=0.324, pruned_loss=0.09259, over 19558.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3299, pruned_loss=0.1008, over 3477263.55 frames. ], batch size: 102, lr: 1.83e-02, grad_scale: 32.0 2023-06-15 06:21:40,071 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.020e+02 2.317e+02 2.823e+02 4.617e+02, threshold=4.634e+02, percent-clipped=2.0 2023-06-15 06:22:11,784 INFO [train.py:988] (1/4) Epoch 17, batch 0, loss[loss=0.2768, simple_loss=0.3443, pruned_loss=0.1046, over 19553.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3443, pruned_loss=0.1046, over 19553.00 frames. ], batch size: 102, lr: 1.78e-02, grad_scale: 32.0 2023-06-15 06:22:11,785 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 06:22:17,845 INFO [train.py:1020] (1/4) Epoch 17, validation: loss=0.2144, simple_loss=0.3175, pruned_loss=0.05564, over 143649.00 frames. 2023-06-15 06:22:17,846 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 06:22:38,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.55 vs. limit=15.0 2023-06-15 06:22:52,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2023-06-15 06:22:53,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=56920.0, ans=0.0 2023-06-15 06:22:55,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=56920.0, ans=0.125 2023-06-15 06:22:55,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=56920.0, ans=0.125 2023-06-15 06:22:57,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=56920.0, ans=0.2 2023-06-15 06:23:45,531 INFO [train.py:988] (1/4) Epoch 17, batch 50, loss[loss=0.2546, simple_loss=0.326, pruned_loss=0.09163, over 18776.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3311, pruned_loss=0.09969, over 863742.02 frames. ], batch size: 83, lr: 1.77e-02, grad_scale: 32.0 2023-06-15 06:24:10,540 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:24:15,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=57186.666666666664, ans=0.0 2023-06-15 06:24:33,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=57253.333333333336, ans=0.1 2023-06-15 06:24:50,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=57320.0, ans=0.125 2023-06-15 06:25:01,587 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 2.063e+02 2.327e+02 2.647e+02 3.796e+02, threshold=4.655e+02, percent-clipped=0.0 2023-06-15 06:25:03,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57386.666666666664, ans=0.125 2023-06-15 06:25:14,205 INFO [train.py:988] (1/4) Epoch 17, batch 100, loss[loss=0.259, simple_loss=0.3263, pruned_loss=0.09588, over 18461.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3309, pruned_loss=0.09891, over 1512554.14 frames. ], batch size: 77, lr: 1.77e-02, grad_scale: 32.0 2023-06-15 06:25:37,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=57520.0, ans=0.0 2023-06-15 06:26:16,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.38 vs. limit=15.0 2023-06-15 06:26:42,435 INFO [train.py:988] (1/4) Epoch 17, batch 150, loss[loss=0.2528, simple_loss=0.3057, pruned_loss=0.1, over 20013.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.329, pruned_loss=0.09794, over 2026861.81 frames. ], batch size: 294, lr: 1.77e-02, grad_scale: 64.0 2023-06-15 06:27:31,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=57920.0, ans=0.1 2023-06-15 06:27:46,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57986.666666666664, ans=0.1 2023-06-15 06:27:51,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=58053.333333333336, ans=0.2 2023-06-15 06:27:57,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.306e+02 2.737e+02 3.252e+02 5.355e+02, threshold=5.474e+02, percent-clipped=3.0 2023-06-15 06:27:59,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58053.333333333336, ans=0.125 2023-06-15 06:28:08,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58120.0, ans=0.125 2023-06-15 06:28:09,789 INFO [train.py:988] (1/4) Epoch 17, batch 200, loss[loss=0.2394, simple_loss=0.3092, pruned_loss=0.0848, over 19849.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3297, pruned_loss=0.09833, over 2419940.30 frames. ], batch size: 120, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:28:18,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=58120.0, ans=0.0 2023-06-15 06:28:23,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2023-06-15 06:28:54,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-15 06:29:12,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=58320.0, ans=0.0 2023-06-15 06:29:15,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=58320.0, ans=0.125 2023-06-15 06:29:32,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2023-06-15 06:29:38,286 INFO [train.py:988] (1/4) Epoch 17, batch 250, loss[loss=0.2767, simple_loss=0.3031, pruned_loss=0.1251, over 16829.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3291, pruned_loss=0.09822, over 2725472.88 frames. ], batch size: 391, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:29:38,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=58453.333333333336, ans=0.125 2023-06-15 06:29:47,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=58453.333333333336, ans=0.0 2023-06-15 06:29:49,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=58453.333333333336, ans=0.0 2023-06-15 06:30:03,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-06-15 06:30:05,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-15 06:30:19,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-15 06:30:27,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58586.666666666664, ans=0.125 2023-06-15 06:30:53,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=58720.0, ans=0.2 2023-06-15 06:30:54,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.968e+02 2.205e+02 2.465e+02 3.814e+02, threshold=4.411e+02, percent-clipped=0.0 2023-06-15 06:31:06,215 INFO [train.py:988] (1/4) Epoch 17, batch 300, loss[loss=0.2591, simple_loss=0.3318, pruned_loss=0.09315, over 18278.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3296, pruned_loss=0.09853, over 2967104.66 frames. ], batch size: 74, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:31:10,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.89 vs. limit=10.0 2023-06-15 06:31:21,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=58853.333333333336, ans=0.0 2023-06-15 06:32:16,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2023-06-15 06:32:24,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=59053.333333333336, ans=0.0 2023-06-15 06:32:33,731 INFO [train.py:988] (1/4) Epoch 17, batch 350, loss[loss=0.3102, simple_loss=0.372, pruned_loss=0.1242, over 17049.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3292, pruned_loss=0.09802, over 3137978.11 frames. ], batch size: 60, lr: 1.76e-02, grad_scale: 64.0 2023-06-15 06:32:54,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=59186.666666666664, ans=0.125 2023-06-15 06:32:58,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=59186.666666666664, ans=0.2 2023-06-15 06:33:09,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=59253.333333333336, ans=0.125 2023-06-15 06:33:21,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=59253.333333333336, ans=0.2 2023-06-15 06:33:49,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 2.042e+02 2.346e+02 2.711e+02 3.857e+02, threshold=4.693e+02, percent-clipped=0.0 2023-06-15 06:33:50,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59386.666666666664, ans=0.1 2023-06-15 06:33:57,225 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:34:00,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=59453.333333333336, ans=0.125 2023-06-15 06:34:01,689 INFO [train.py:988] (1/4) Epoch 17, batch 400, loss[loss=0.234, simple_loss=0.3139, pruned_loss=0.07708, over 18779.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3285, pruned_loss=0.09745, over 3296369.98 frames. ], batch size: 83, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:34:01,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=59453.333333333336, ans=0.125 2023-06-15 06:34:10,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59453.333333333336, ans=0.1 2023-06-15 06:34:12,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59453.333333333336, ans=0.0 2023-06-15 06:34:25,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59520.0, ans=0.1 2023-06-15 06:34:33,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=59520.0, ans=0.0 2023-06-15 06:34:37,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=15.0 2023-06-15 06:35:26,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=59786.666666666664, ans=0.125 2023-06-15 06:35:27,600 INFO [train.py:988] (1/4) Epoch 17, batch 450, loss[loss=0.28, simple_loss=0.3521, pruned_loss=0.1039, over 18273.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3276, pruned_loss=0.09739, over 3388085.05 frames. ], batch size: 74, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:35:43,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=59853.333333333336, ans=0.125 2023-06-15 06:35:48,455 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:35:58,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.89 vs. limit=22.5 2023-06-15 06:36:05,144 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=12.0 2023-06-15 06:36:41,059 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 2.138e+02 2.622e+02 3.155e+02 6.039e+02, threshold=5.245e+02, percent-clipped=6.0 2023-06-15 06:36:52,652 INFO [train.py:988] (1/4) Epoch 17, batch 500, loss[loss=0.284, simple_loss=0.3284, pruned_loss=0.1198, over 19838.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3274, pruned_loss=0.09726, over 3484426.60 frames. ], batch size: 293, lr: 1.75e-02, grad_scale: 64.0 2023-06-15 06:37:03,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=60120.0, ans=0.125 2023-06-15 06:37:08,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=60186.666666666664, ans=0.0 2023-06-15 06:37:23,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=60186.666666666664, ans=0.05 2023-06-15 06:37:23,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=60186.666666666664, ans=0.2 2023-06-15 06:37:24,826 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:37:28,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=60253.333333333336, ans=0.125 2023-06-15 06:38:08,683 INFO [train.py:988] (1/4) Epoch 18, batch 0, loss[loss=0.2581, simple_loss=0.3296, pruned_loss=0.09324, over 19695.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3296, pruned_loss=0.09324, over 19695.00 frames. ], batch size: 110, lr: 1.70e-02, grad_scale: 64.0 2023-06-15 06:38:08,683 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 06:38:14,740 INFO [train.py:1020] (1/4) Epoch 18, validation: loss=0.2126, simple_loss=0.3161, pruned_loss=0.05459, over 143649.00 frames. 2023-06-15 06:38:14,741 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 06:39:04,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=60466.666666666664, ans=0.125 2023-06-15 06:39:19,151 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=22.5 2023-06-15 06:39:32,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60600.0, ans=0.125 2023-06-15 06:39:35,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60600.0, ans=0.1 2023-06-15 06:39:38,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=60600.0, ans=0.0 2023-06-15 06:39:42,216 INFO [train.py:988] (1/4) Epoch 18, batch 50, loss[loss=0.2794, simple_loss=0.3427, pruned_loss=0.1081, over 19106.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3281, pruned_loss=0.09635, over 856169.31 frames. ], batch size: 94, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:40:00,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.965e+02 2.316e+02 2.715e+02 4.312e+02, threshold=4.632e+02, percent-clipped=0.0 2023-06-15 06:40:08,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=60733.333333333336, ans=0.1 2023-06-15 06:40:17,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=60800.0, ans=0.0 2023-06-15 06:40:27,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=60800.0, ans=0.125 2023-06-15 06:41:01,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=60933.333333333336, ans=0.125 2023-06-15 06:41:03,049 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:41:10,128 INFO [train.py:988] (1/4) Epoch 18, batch 100, loss[loss=0.2701, simple_loss=0.3356, pruned_loss=0.1023, over 19099.00 frames. ], tot_loss[loss=0.258, simple_loss=0.326, pruned_loss=0.09498, over 1525206.32 frames. ], batch size: 94, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:41:17,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=61000.0, ans=0.125 2023-06-15 06:41:17,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.19 vs. limit=22.5 2023-06-15 06:41:43,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=61133.333333333336, ans=0.2 2023-06-15 06:41:57,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=61133.333333333336, ans=0.0 2023-06-15 06:42:37,688 INFO [train.py:988] (1/4) Epoch 18, batch 150, loss[loss=0.328, simple_loss=0.3859, pruned_loss=0.135, over 17662.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3264, pruned_loss=0.09632, over 2039097.88 frames. ], batch size: 67, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:42:45,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=61333.333333333336, ans=0.1 2023-06-15 06:42:57,616 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.047e+02 2.339e+02 2.700e+02 3.981e+02, threshold=4.677e+02, percent-clipped=0.0 2023-06-15 06:43:04,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61400.0, ans=0.1 2023-06-15 06:43:10,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=61400.0, ans=0.0 2023-06-15 06:43:29,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=61533.333333333336, ans=0.125 2023-06-15 06:43:36,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=61533.333333333336, ans=0.0 2023-06-15 06:43:50,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.88 vs. limit=22.5 2023-06-15 06:44:04,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=61666.666666666664, ans=0.125 2023-06-15 06:44:06,113 INFO [train.py:988] (1/4) Epoch 18, batch 200, loss[loss=0.2657, simple_loss=0.3506, pruned_loss=0.09042, over 18342.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3267, pruned_loss=0.09569, over 2436367.63 frames. ], batch size: 72, lr: 1.69e-02, grad_scale: 64.0 2023-06-15 06:44:35,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=61733.333333333336, ans=0.125 2023-06-15 06:45:08,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.34 vs. limit=15.0 2023-06-15 06:45:11,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=61866.666666666664, ans=0.0 2023-06-15 06:45:23,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=61933.333333333336, ans=0.0 2023-06-15 06:45:33,961 INFO [train.py:988] (1/4) Epoch 18, batch 250, loss[loss=0.2634, simple_loss=0.3228, pruned_loss=0.102, over 20250.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3262, pruned_loss=0.09544, over 2731628.86 frames. ], batch size: 239, lr: 1.68e-02, grad_scale: 64.0 2023-06-15 06:45:53,599 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 2.076e+02 2.248e+02 2.609e+02 3.858e+02, threshold=4.496e+02, percent-clipped=0.0 2023-06-15 06:46:06,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=62066.666666666664, ans=0.0 2023-06-15 06:46:11,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=62133.333333333336, ans=0.2 2023-06-15 06:46:29,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=62200.0, ans=0.035 2023-06-15 06:46:53,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=62266.666666666664, ans=0.0 2023-06-15 06:46:54,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.62 vs. limit=22.5 2023-06-15 06:46:57,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=62266.666666666664, ans=0.125 2023-06-15 06:47:02,685 INFO [train.py:988] (1/4) Epoch 18, batch 300, loss[loss=0.2827, simple_loss=0.3589, pruned_loss=0.1033, over 17578.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3254, pruned_loss=0.09435, over 2982303.29 frames. ], batch size: 67, lr: 1.68e-02, grad_scale: 64.0 2023-06-15 06:47:41,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=62466.666666666664, ans=0.125 2023-06-15 06:48:24,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=62600.0, ans=0.125 2023-06-15 06:48:30,200 INFO [train.py:988] (1/4) Epoch 18, batch 350, loss[loss=0.254, simple_loss=0.3185, pruned_loss=0.09472, over 20304.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3243, pruned_loss=0.09411, over 3173045.29 frames. ], batch size: 141, lr: 1.68e-02, grad_scale: 32.0 2023-06-15 06:48:42,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62666.666666666664, ans=0.125 2023-06-15 06:48:51,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 2.098e+02 2.384e+02 2.713e+02 4.621e+02, threshold=4.767e+02, percent-clipped=2.0 2023-06-15 06:49:54,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=62933.333333333336, ans=0.125 2023-06-15 06:49:57,690 INFO [train.py:988] (1/4) Epoch 18, batch 400, loss[loss=0.2794, simple_loss=0.3055, pruned_loss=0.1267, over 16761.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3239, pruned_loss=0.09451, over 3313252.26 frames. ], batch size: 392, lr: 1.68e-02, grad_scale: 32.0 2023-06-15 06:49:59,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=63000.0, ans=0.0 2023-06-15 06:50:38,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=63133.333333333336, ans=0.2 2023-06-15 06:50:50,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=25.67 vs. limit=15.0 2023-06-15 06:51:26,161 INFO [train.py:988] (1/4) Epoch 18, batch 450, loss[loss=0.259, simple_loss=0.3324, pruned_loss=0.09283, over 18617.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3234, pruned_loss=0.09404, over 3426350.64 frames. ], batch size: 80, lr: 1.67e-02, grad_scale: 32.0 2023-06-15 06:51:35,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63333.333333333336, ans=0.1 2023-06-15 06:51:35,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=63333.333333333336, ans=0.2 2023-06-15 06:51:47,896 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.045e+02 2.298e+02 2.779e+02 4.422e+02, threshold=4.596e+02, percent-clipped=0.0 2023-06-15 06:52:31,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=63533.333333333336, ans=10.0 2023-06-15 06:52:51,114 INFO [train.py:988] (1/4) Epoch 18, batch 500, loss[loss=0.2871, simple_loss=0.3223, pruned_loss=0.1259, over 19714.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3235, pruned_loss=0.09481, over 3525046.41 frames. ], batch size: 293, lr: 1.67e-02, grad_scale: 32.0 2023-06-15 06:52:51,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=63666.666666666664, ans=0.2 2023-06-15 06:52:52,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.36 vs. limit=15.0 2023-06-15 06:52:56,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=63666.666666666664, ans=0.125 2023-06-15 06:53:29,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63800.0, ans=0.125 2023-06-15 06:54:08,041 INFO [train.py:988] (1/4) Epoch 19, batch 0, loss[loss=0.2585, simple_loss=0.3204, pruned_loss=0.09828, over 20721.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3204, pruned_loss=0.09828, over 20721.00 frames. ], batch size: 211, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:54:08,042 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 06:54:14,164 INFO [train.py:1020] (1/4) Epoch 19, validation: loss=0.2113, simple_loss=0.3157, pruned_loss=0.05349, over 143649.00 frames. 2023-06-15 06:54:14,165 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 06:54:17,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63880.0, ans=0.1 2023-06-15 06:54:42,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-06-15 06:54:43,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=63946.666666666664, ans=0.95 2023-06-15 06:54:45,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=63946.666666666664, ans=0.125 2023-06-15 06:54:51,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-15 06:55:05,671 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.944e+02 2.133e+02 2.428e+02 3.266e+02, threshold=4.266e+02, percent-clipped=0.0 2023-06-15 06:55:19,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=64080.0, ans=0.04949747468305833 2023-06-15 06:55:28,133 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:55:36,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.83 vs. limit=10.0 2023-06-15 06:55:40,299 INFO [train.py:988] (1/4) Epoch 19, batch 50, loss[loss=0.2365, simple_loss=0.3099, pruned_loss=0.08157, over 19110.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3252, pruned_loss=0.09272, over 840424.45 frames. ], batch size: 94, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:55:40,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=64213.333333333336, ans=0.125 2023-06-15 06:55:56,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=64280.0, ans=0.0 2023-06-15 06:56:09,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.45 vs. limit=22.5 2023-06-15 06:56:10,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=64280.0, ans=0.125 2023-06-15 06:56:13,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64346.666666666664, ans=0.1 2023-06-15 06:56:26,079 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 06:56:35,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64413.333333333336, ans=0.1 2023-06-15 06:56:41,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=64413.333333333336, ans=0.0 2023-06-15 06:57:08,446 INFO [train.py:988] (1/4) Epoch 19, batch 100, loss[loss=0.2639, simple_loss=0.3373, pruned_loss=0.0953, over 18449.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3232, pruned_loss=0.09445, over 1484955.25 frames. ], batch size: 77, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:57:12,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=64546.666666666664, ans=0.0 2023-06-15 06:57:22,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=64546.666666666664, ans=0.125 2023-06-15 06:57:34,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=64613.333333333336, ans=0.05 2023-06-15 06:57:38,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=15.0 2023-06-15 06:57:39,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=64613.333333333336, ans=0.125 2023-06-15 06:57:52,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64680.0, ans=0.125 2023-06-15 06:58:00,779 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 2.073e+02 2.297e+02 2.619e+02 4.375e+02, threshold=4.594e+02, percent-clipped=1.0 2023-06-15 06:58:11,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64746.666666666664, ans=0.1 2023-06-15 06:58:27,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64813.333333333336, ans=0.1 2023-06-15 06:58:36,701 INFO [train.py:988] (1/4) Epoch 19, batch 150, loss[loss=0.2448, simple_loss=0.3146, pruned_loss=0.0875, over 19094.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3247, pruned_loss=0.09368, over 1980454.31 frames. ], batch size: 89, lr: 1.62e-02, grad_scale: 32.0 2023-06-15 06:58:59,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=64946.666666666664, ans=0.0 2023-06-15 06:59:20,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=65013.333333333336, ans=0.125 2023-06-15 06:59:33,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65080.0, ans=0.1 2023-06-15 06:59:33,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0 2023-06-15 06:59:41,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2023-06-15 06:59:46,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65146.666666666664, ans=0.1 2023-06-15 07:00:04,049 INFO [train.py:988] (1/4) Epoch 19, batch 200, loss[loss=0.2799, simple_loss=0.3375, pruned_loss=0.1112, over 20133.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3245, pruned_loss=0.09423, over 2404588.39 frames. ], batch size: 133, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:00:11,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=65213.333333333336, ans=0.0 2023-06-15 07:00:13,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65213.333333333336, ans=0.1 2023-06-15 07:00:17,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=65213.333333333336, ans=0.125 2023-06-15 07:00:25,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=65280.0, ans=0.0 2023-06-15 07:00:51,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65346.666666666664, ans=0.125 2023-06-15 07:00:53,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65346.666666666664, ans=0.1 2023-06-15 07:00:56,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.984e+02 2.245e+02 2.601e+02 3.971e+02, threshold=4.489e+02, percent-clipped=0.0 2023-06-15 07:01:13,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-15 07:01:32,671 INFO [train.py:988] (1/4) Epoch 19, batch 250, loss[loss=0.2476, simple_loss=0.3054, pruned_loss=0.09488, over 20334.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3233, pruned_loss=0.09363, over 2707212.29 frames. ], batch size: 239, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:01:34,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=65546.66666666667, ans=0.0 2023-06-15 07:01:59,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=65613.33333333333, ans=0.0 2023-06-15 07:02:08,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.97 vs. limit=15.0 2023-06-15 07:02:36,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=65746.66666666667, ans=0.2 2023-06-15 07:02:48,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=65813.33333333333, ans=0.125 2023-06-15 07:03:01,191 INFO [train.py:988] (1/4) Epoch 19, batch 300, loss[loss=0.2426, simple_loss=0.3071, pruned_loss=0.08905, over 20623.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3232, pruned_loss=0.09314, over 2954937.30 frames. ], batch size: 173, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:03:01,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=65880.0, ans=0.125 2023-06-15 07:03:03,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=65880.0, ans=0.125 2023-06-15 07:03:04,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65880.0, ans=0.1 2023-06-15 07:03:16,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.71 vs. limit=15.0 2023-06-15 07:03:32,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=65946.66666666667, ans=0.0 2023-06-15 07:03:48,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66013.33333333333, ans=0.125 2023-06-15 07:03:53,120 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.956e+02 2.157e+02 2.430e+02 3.269e+02, threshold=4.313e+02, percent-clipped=0.0 2023-06-15 07:04:03,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=66080.0, ans=0.0 2023-06-15 07:04:25,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=66146.66666666667, ans=0.125 2023-06-15 07:04:28,909 INFO [train.py:988] (1/4) Epoch 19, batch 350, loss[loss=0.255, simple_loss=0.3166, pruned_loss=0.09673, over 20587.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3225, pruned_loss=0.09283, over 3144697.04 frames. ], batch size: 189, lr: 1.61e-02, grad_scale: 32.0 2023-06-15 07:04:48,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=66280.0, ans=0.0 2023-06-15 07:05:24,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=66413.33333333333, ans=0.2 2023-06-15 07:05:35,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=66413.33333333333, ans=0.0 2023-06-15 07:05:48,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66480.0, ans=0.125 2023-06-15 07:05:48,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=66480.0, ans=0.125 2023-06-15 07:05:54,466 INFO [train.py:988] (1/4) Epoch 19, batch 400, loss[loss=0.2602, simple_loss=0.3364, pruned_loss=0.092, over 18776.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3232, pruned_loss=0.09319, over 3261630.07 frames. ], batch size: 83, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:06:19,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=66613.33333333333, ans=0.0 2023-06-15 07:06:47,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.968e+02 2.388e+02 2.889e+02 4.258e+02, threshold=4.776e+02, percent-clipped=0.0 2023-06-15 07:07:20,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=66880.0, ans=0.125 2023-06-15 07:07:22,220 INFO [train.py:988] (1/4) Epoch 19, batch 450, loss[loss=0.2505, simple_loss=0.3232, pruned_loss=0.08893, over 19111.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3231, pruned_loss=0.09296, over 3383871.16 frames. ], batch size: 94, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:07:29,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66880.0, ans=0.125 2023-06-15 07:07:34,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=66880.0, ans=0.0 2023-06-15 07:07:40,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=66946.66666666667, ans=0.125 2023-06-15 07:08:02,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67013.33333333333, ans=0.0 2023-06-15 07:08:17,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=67080.0, ans=0.125 2023-06-15 07:08:20,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-15 07:08:48,904 INFO [train.py:988] (1/4) Epoch 19, batch 500, loss[loss=0.2365, simple_loss=0.3077, pruned_loss=0.08266, over 19867.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3227, pruned_loss=0.09251, over 3476516.37 frames. ], batch size: 120, lr: 1.60e-02, grad_scale: 32.0 2023-06-15 07:08:52,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=67213.33333333333, ans=0.04949747468305833 2023-06-15 07:09:05,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=67280.0, ans=0.035 2023-06-15 07:09:12,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2023-06-15 07:09:34,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=67346.66666666667, ans=0.125 2023-06-15 07:09:37,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.943e+02 2.120e+02 2.445e+02 3.405e+02, threshold=4.239e+02, percent-clipped=0.0 2023-06-15 07:09:38,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-15 07:10:07,204 INFO [train.py:988] (1/4) Epoch 20, batch 0, loss[loss=0.2687, simple_loss=0.3124, pruned_loss=0.1125, over 19989.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3124, pruned_loss=0.1125, over 19989.00 frames. ], batch size: 293, lr: 1.56e-02, grad_scale: 32.0 2023-06-15 07:10:07,205 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 07:10:13,276 INFO [train.py:1020] (1/4) Epoch 20, validation: loss=0.2092, simple_loss=0.3126, pruned_loss=0.05295, over 143649.00 frames. 2023-06-15 07:10:13,278 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 07:10:26,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=67433.33333333333, ans=0.2 2023-06-15 07:10:26,159 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:10:34,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-15 07:10:35,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=67500.0, ans=0.0 2023-06-15 07:10:38,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67500.0, ans=0.1 2023-06-15 07:11:31,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67700.0, ans=0.125 2023-06-15 07:11:36,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=67700.0, ans=0.2 2023-06-15 07:11:41,408 INFO [train.py:988] (1/4) Epoch 20, batch 50, loss[loss=0.2256, simple_loss=0.3086, pruned_loss=0.07128, over 19523.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3221, pruned_loss=0.09268, over 857305.69 frames. ], batch size: 102, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:11:56,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=67766.66666666667, ans=0.04949747468305833 2023-06-15 07:13:03,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 2.103e+02 2.332e+02 2.749e+02 4.592e+02, threshold=4.664e+02, percent-clipped=1.0 2023-06-15 07:13:09,723 INFO [train.py:988] (1/4) Epoch 20, batch 100, loss[loss=0.2674, simple_loss=0.3273, pruned_loss=0.1038, over 19942.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3199, pruned_loss=0.09192, over 1517625.65 frames. ], batch size: 126, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:13:15,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68100.0, ans=0.125 2023-06-15 07:13:48,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=68233.33333333333, ans=0.125 2023-06-15 07:14:03,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2023-06-15 07:14:13,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=68300.0, ans=0.125 2023-06-15 07:14:18,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68366.66666666667, ans=0.1 2023-06-15 07:14:18,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=68366.66666666667, ans=0.125 2023-06-15 07:14:29,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=68366.66666666667, ans=0.125 2023-06-15 07:14:37,881 INFO [train.py:988] (1/4) Epoch 20, batch 150, loss[loss=0.247, simple_loss=0.3015, pruned_loss=0.09621, over 19997.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3189, pruned_loss=0.09158, over 2033243.26 frames. ], batch size: 293, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:14:48,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=68433.33333333333, ans=0.0 2023-06-15 07:15:20,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68566.66666666667, ans=0.1 2023-06-15 07:15:56,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=68700.0, ans=0.125 2023-06-15 07:15:59,994 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.981e+02 2.212e+02 2.517e+02 3.865e+02, threshold=4.424e+02, percent-clipped=0.0 2023-06-15 07:16:05,159 INFO [train.py:988] (1/4) Epoch 20, batch 200, loss[loss=0.3079, simple_loss=0.3703, pruned_loss=0.1228, over 16360.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3213, pruned_loss=0.09205, over 2418026.91 frames. ], batch size: 52, lr: 1.55e-02, grad_scale: 32.0 2023-06-15 07:16:34,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68833.33333333333, ans=0.1 2023-06-15 07:17:11,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=68966.66666666667, ans=0.125 2023-06-15 07:17:11,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=68966.66666666667, ans=0.125 2023-06-15 07:17:16,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=69033.33333333333, ans=0.125 2023-06-15 07:17:28,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=69033.33333333333, ans=0.125 2023-06-15 07:17:33,536 INFO [train.py:988] (1/4) Epoch 20, batch 250, loss[loss=0.2443, simple_loss=0.3063, pruned_loss=0.09121, over 20570.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3208, pruned_loss=0.09172, over 2720876.42 frames. ], batch size: 189, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:18:18,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=15.0 2023-06-15 07:18:21,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2023-06-15 07:18:21,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.06 vs. limit=22.5 2023-06-15 07:18:27,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=69300.0, ans=0.0 2023-06-15 07:18:31,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=69300.0, ans=0.125 2023-06-15 07:18:35,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=69300.0, ans=0.0 2023-06-15 07:18:54,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.07 vs. limit=22.5 2023-06-15 07:18:55,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.907e+02 2.171e+02 2.570e+02 4.262e+02, threshold=4.342e+02, percent-clipped=0.0 2023-06-15 07:19:00,682 INFO [train.py:988] (1/4) Epoch 20, batch 300, loss[loss=0.2492, simple_loss=0.3121, pruned_loss=0.09316, over 20694.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3208, pruned_loss=0.09173, over 2958335.92 frames. ], batch size: 211, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:19:11,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=69433.33333333333, ans=0.125 2023-06-15 07:19:42,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=69566.66666666667, ans=0.0 2023-06-15 07:20:22,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-06-15 07:20:28,567 INFO [train.py:988] (1/4) Epoch 20, batch 350, loss[loss=0.2089, simple_loss=0.2824, pruned_loss=0.06765, over 19345.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3201, pruned_loss=0.091, over 3140299.19 frames. ], batch size: 98, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:21:19,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=69966.66666666667, ans=0.125 2023-06-15 07:21:43,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70033.33333333333, ans=0.125 2023-06-15 07:21:50,096 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.058e+02 2.277e+02 2.664e+02 3.684e+02, threshold=4.554e+02, percent-clipped=0.0 2023-06-15 07:21:55,233 INFO [train.py:988] (1/4) Epoch 20, batch 400, loss[loss=0.2415, simple_loss=0.3228, pruned_loss=0.08016, over 18808.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3207, pruned_loss=0.09127, over 3280373.02 frames. ], batch size: 83, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:22:34,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70233.33333333333, ans=0.1 2023-06-15 07:22:36,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70233.33333333333, ans=0.0 2023-06-15 07:22:54,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70300.0, ans=0.125 2023-06-15 07:22:59,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70300.0, ans=0.125 2023-06-15 07:23:23,992 INFO [train.py:988] (1/4) Epoch 20, batch 450, loss[loss=0.2571, simple_loss=0.3389, pruned_loss=0.08768, over 16337.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3202, pruned_loss=0.09088, over 3390792.58 frames. ], batch size: 52, lr: 1.54e-02, grad_scale: 32.0 2023-06-15 07:23:59,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=70566.66666666667, ans=0.0 2023-06-15 07:24:06,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=70566.66666666667, ans=0.2 2023-06-15 07:24:14,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=70633.33333333333, ans=0.2 2023-06-15 07:24:26,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.77 vs. limit=15.0 2023-06-15 07:24:36,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=70700.0, ans=0.1 2023-06-15 07:24:44,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.031e+02 2.228e+02 2.537e+02 4.676e+02, threshold=4.456e+02, percent-clipped=1.0 2023-06-15 07:24:45,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=70700.0, ans=0.0 2023-06-15 07:24:49,647 INFO [train.py:988] (1/4) Epoch 20, batch 500, loss[loss=0.2311, simple_loss=0.3094, pruned_loss=0.07639, over 19102.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3194, pruned_loss=0.09119, over 3500031.64 frames. ], batch size: 94, lr: 1.53e-02, grad_scale: 32.0 2023-06-15 07:25:11,633 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:25:27,778 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:26:07,921 INFO [train.py:988] (1/4) Epoch 21, batch 0, loss[loss=0.2745, simple_loss=0.3389, pruned_loss=0.1051, over 18487.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3389, pruned_loss=0.1051, over 18487.00 frames. ], batch size: 77, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:26:07,922 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 07:26:14,419 INFO [train.py:1020] (1/4) Epoch 21, validation: loss=0.209, simple_loss=0.3126, pruned_loss=0.05274, over 143649.00 frames. 2023-06-15 07:26:14,420 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 07:26:15,436 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=15.0 2023-06-15 07:26:37,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-06-15 07:27:41,939 INFO [train.py:988] (1/4) Epoch 21, batch 50, loss[loss=0.2671, simple_loss=0.2925, pruned_loss=0.1209, over 16885.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3144, pruned_loss=0.09032, over 852927.59 frames. ], batch size: 392, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:27:47,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71313.33333333333, ans=0.1 2023-06-15 07:27:50,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=71313.33333333333, ans=0.0 2023-06-15 07:27:58,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.93 vs. limit=22.5 2023-06-15 07:28:07,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.043e+02 2.374e+02 2.878e+02 4.060e+02, threshold=4.748e+02, percent-clipped=0.0 2023-06-15 07:28:10,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=71380.0, ans=0.125 2023-06-15 07:28:29,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=71446.66666666667, ans=0.0 2023-06-15 07:28:36,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=71513.33333333333, ans=0.125 2023-06-15 07:28:39,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=71513.33333333333, ans=0.125 2023-06-15 07:29:09,313 INFO [train.py:988] (1/4) Epoch 21, batch 100, loss[loss=0.2304, simple_loss=0.3045, pruned_loss=0.07818, over 19806.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3179, pruned_loss=0.09033, over 1514779.54 frames. ], batch size: 115, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:29:16,307 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:30:09,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=71846.66666666667, ans=0.125 2023-06-15 07:30:16,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=71913.33333333333, ans=0.0 2023-06-15 07:30:22,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.79 vs. limit=15.0 2023-06-15 07:30:35,478 INFO [train.py:988] (1/4) Epoch 21, batch 150, loss[loss=0.2566, simple_loss=0.3306, pruned_loss=0.09126, over 15346.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.318, pruned_loss=0.09014, over 2023663.78 frames. ], batch size: 44, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:30:43,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=71980.0, ans=0.125 2023-06-15 07:31:01,901 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.007e+02 2.293e+02 2.740e+02 3.931e+02, threshold=4.586e+02, percent-clipped=0.0 2023-06-15 07:31:14,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72113.33333333333, ans=0.0 2023-06-15 07:31:28,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=72180.0, ans=0.09899494936611666 2023-06-15 07:31:40,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=72180.0, ans=0.0 2023-06-15 07:31:56,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-15 07:32:02,954 INFO [train.py:988] (1/4) Epoch 21, batch 200, loss[loss=0.2591, simple_loss=0.3321, pruned_loss=0.09304, over 17645.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3177, pruned_loss=0.08977, over 2415547.48 frames. ], batch size: 67, lr: 1.49e-02, grad_scale: 32.0 2023-06-15 07:32:13,850 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:32:42,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72446.66666666667, ans=0.1 2023-06-15 07:32:55,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=72513.33333333333, ans=0.09899494936611666 2023-06-15 07:33:19,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.29 vs. limit=22.5 2023-06-15 07:33:29,795 INFO [train.py:988] (1/4) Epoch 21, batch 250, loss[loss=0.245, simple_loss=0.3069, pruned_loss=0.09155, over 20208.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3173, pruned_loss=0.08999, over 2735027.17 frames. ], batch size: 239, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:33:29,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=72646.66666666667, ans=0.2 2023-06-15 07:33:52,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.81 vs. limit=12.0 2023-06-15 07:33:54,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.979e+02 2.224e+02 2.730e+02 4.525e+02, threshold=4.448e+02, percent-clipped=0.0 2023-06-15 07:33:55,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=72713.33333333333, ans=0.125 2023-06-15 07:34:14,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72780.0, ans=0.0 2023-06-15 07:34:55,960 INFO [train.py:988] (1/4) Epoch 21, batch 300, loss[loss=0.237, simple_loss=0.311, pruned_loss=0.08143, over 19122.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3179, pruned_loss=0.09008, over 2964102.24 frames. ], batch size: 94, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:35:35,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=73113.33333333333, ans=0.0 2023-06-15 07:35:44,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=73113.33333333333, ans=22.5 2023-06-15 07:35:57,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=73180.0, ans=0.125 2023-06-15 07:36:22,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=73313.33333333333, ans=0.125 2023-06-15 07:36:23,391 INFO [train.py:988] (1/4) Epoch 21, batch 350, loss[loss=0.2316, simple_loss=0.3027, pruned_loss=0.0803, over 20560.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3171, pruned_loss=0.0894, over 3146489.07 frames. ], batch size: 189, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:36:35,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=73313.33333333333, ans=0.2 2023-06-15 07:36:49,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 2.024e+02 2.303e+02 3.042e+02 4.564e+02, threshold=4.607e+02, percent-clipped=2.0 2023-06-15 07:37:29,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=73513.33333333333, ans=0.125 2023-06-15 07:37:30,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-06-15 07:37:41,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73580.0, ans=0.1 2023-06-15 07:37:46,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=73580.0, ans=0.0 2023-06-15 07:37:49,162 INFO [train.py:988] (1/4) Epoch 21, batch 400, loss[loss=0.2502, simple_loss=0.3352, pruned_loss=0.08267, over 17626.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3169, pruned_loss=0.08863, over 3288041.15 frames. ], batch size: 67, lr: 1.48e-02, grad_scale: 32.0 2023-06-15 07:38:02,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=73646.66666666667, ans=0.0 2023-06-15 07:38:27,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=73780.0, ans=0.0 2023-06-15 07:38:30,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.34 vs. limit=15.0 2023-06-15 07:39:15,484 INFO [train.py:988] (1/4) Epoch 21, batch 450, loss[loss=0.2686, simple_loss=0.3435, pruned_loss=0.09691, over 16276.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3167, pruned_loss=0.08822, over 3399333.91 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 32.0 2023-06-15 07:39:19,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=73980.0, ans=0.125 2023-06-15 07:39:39,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=74046.66666666667, ans=0.2 2023-06-15 07:39:42,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 2.035e+02 2.412e+02 2.747e+02 5.192e+02, threshold=4.824e+02, percent-clipped=2.0 2023-06-15 07:40:11,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=74180.0, ans=0.0 2023-06-15 07:40:19,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.47 vs. limit=15.0 2023-06-15 07:40:40,645 INFO [train.py:988] (1/4) Epoch 21, batch 500, loss[loss=0.2609, simple_loss=0.3425, pruned_loss=0.0897, over 18284.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3159, pruned_loss=0.08767, over 3499608.15 frames. ], batch size: 74, lr: 1.47e-02, grad_scale: 32.0 2023-06-15 07:41:05,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-15 07:41:15,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74446.66666666667, ans=0.1 2023-06-15 07:42:00,891 INFO [train.py:988] (1/4) Epoch 22, batch 0, loss[loss=0.2209, simple_loss=0.3029, pruned_loss=0.0694, over 19643.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3029, pruned_loss=0.0694, over 19643.00 frames. ], batch size: 110, lr: 1.44e-02, grad_scale: 32.0 2023-06-15 07:42:00,892 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 07:42:07,048 INFO [train.py:1020] (1/4) Epoch 22, validation: loss=0.2075, simple_loss=0.3107, pruned_loss=0.05212, over 143649.00 frames. 2023-06-15 07:42:07,050 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 07:42:07,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=74533.33333333333, ans=0.2 2023-06-15 07:42:12,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74533.33333333333, ans=0.125 2023-06-15 07:42:47,640 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:42:52,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-06-15 07:43:03,104 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.996e+02 2.190e+02 2.519e+02 3.668e+02, threshold=4.380e+02, percent-clipped=0.0 2023-06-15 07:43:35,416 INFO [train.py:988] (1/4) Epoch 22, batch 50, loss[loss=0.2385, simple_loss=0.3058, pruned_loss=0.08561, over 20592.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3149, pruned_loss=0.08648, over 857235.75 frames. ], batch size: 173, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:43:51,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.69 vs. limit=10.0 2023-06-15 07:43:53,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=74933.33333333333, ans=0.1 2023-06-15 07:44:40,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.83 vs. limit=22.5 2023-06-15 07:45:02,465 INFO [train.py:988] (1/4) Epoch 22, batch 100, loss[loss=0.2505, simple_loss=0.3312, pruned_loss=0.08494, over 17639.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3169, pruned_loss=0.0866, over 1508487.56 frames. ], batch size: 67, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:45:25,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=75266.66666666667, ans=0.5 2023-06-15 07:45:27,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=75266.66666666667, ans=0.0 2023-06-15 07:45:29,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=75266.66666666667, ans=0.04949747468305833 2023-06-15 07:45:33,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=75266.66666666667, ans=0.07 2023-06-15 07:45:45,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=75333.33333333333, ans=0.125 2023-06-15 07:45:54,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=75400.0, ans=0.125 2023-06-15 07:45:56,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=75400.0, ans=0.125 2023-06-15 07:45:59,429 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.974e+02 2.218e+02 2.504e+02 3.922e+02, threshold=4.437e+02, percent-clipped=0.0 2023-06-15 07:46:05,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=75400.0, ans=0.125 2023-06-15 07:46:21,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=75466.66666666667, ans=0.125 2023-06-15 07:46:31,642 INFO [train.py:988] (1/4) Epoch 22, batch 150, loss[loss=0.2441, simple_loss=0.3203, pruned_loss=0.08395, over 18273.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3165, pruned_loss=0.08682, over 2015850.99 frames. ], batch size: 74, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:46:32,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=75533.33333333333, ans=0.2 2023-06-15 07:46:32,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75533.33333333333, ans=0.125 2023-06-15 07:46:47,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=75600.0, ans=0.07 2023-06-15 07:47:51,512 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:47:58,666 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:48:00,065 INFO [train.py:988] (1/4) Epoch 22, batch 200, loss[loss=0.2551, simple_loss=0.337, pruned_loss=0.08664, over 17551.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3163, pruned_loss=0.08642, over 2415309.56 frames. ], batch size: 67, lr: 1.43e-02, grad_scale: 32.0 2023-06-15 07:48:27,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=15.0 2023-06-15 07:48:30,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.44 vs. limit=22.5 2023-06-15 07:48:34,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=76000.0, ans=0.125 2023-06-15 07:48:56,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.899e+02 2.203e+02 2.409e+02 3.907e+02, threshold=4.406e+02, percent-clipped=0.0 2023-06-15 07:48:58,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=76066.66666666667, ans=0.0 2023-06-15 07:49:03,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.83 vs. limit=15.0 2023-06-15 07:49:17,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=76133.33333333333, ans=0.2 2023-06-15 07:49:29,220 INFO [train.py:988] (1/4) Epoch 22, batch 250, loss[loss=0.2461, simple_loss=0.3074, pruned_loss=0.09241, over 20433.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3162, pruned_loss=0.08672, over 2728766.30 frames. ], batch size: 160, lr: 1.43e-02, grad_scale: 64.0 2023-06-15 07:49:34,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=76200.0, ans=0.0 2023-06-15 07:49:34,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=22.5 2023-06-15 07:49:54,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.76 vs. limit=10.0 2023-06-15 07:49:55,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76266.66666666667, ans=0.1 2023-06-15 07:49:57,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76266.66666666667, ans=0.125 2023-06-15 07:50:01,567 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-06-15 07:50:05,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=76333.33333333333, ans=0.0 2023-06-15 07:50:37,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=76466.66666666667, ans=0.0 2023-06-15 07:50:57,237 INFO [train.py:988] (1/4) Epoch 22, batch 300, loss[loss=0.2292, simple_loss=0.2981, pruned_loss=0.08013, over 20721.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3155, pruned_loss=0.08605, over 2955323.71 frames. ], batch size: 211, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:51:03,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.44 vs. limit=6.0 2023-06-15 07:51:23,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.51 vs. limit=22.5 2023-06-15 07:51:28,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=76600.0, ans=0.0 2023-06-15 07:51:35,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=76666.66666666667, ans=0.0 2023-06-15 07:51:53,166 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 1.887e+02 2.089e+02 2.496e+02 3.491e+02, threshold=4.177e+02, percent-clipped=0.0 2023-06-15 07:52:09,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-15 07:52:24,339 INFO [train.py:988] (1/4) Epoch 22, batch 350, loss[loss=0.2918, simple_loss=0.3644, pruned_loss=0.1096, over 18317.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3165, pruned_loss=0.08572, over 3136535.14 frames. ], batch size: 72, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:52:26,581 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.105e-02 2023-06-15 07:52:42,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=76933.33333333333, ans=0.125 2023-06-15 07:52:44,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=76933.33333333333, ans=0.0 2023-06-15 07:53:00,378 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 07:53:16,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=77066.66666666667, ans=0.125 2023-06-15 07:53:44,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=77133.33333333333, ans=0.02 2023-06-15 07:53:54,160 INFO [train.py:988] (1/4) Epoch 22, batch 400, loss[loss=0.2326, simple_loss=0.3095, pruned_loss=0.07782, over 19524.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.316, pruned_loss=0.08594, over 3273980.74 frames. ], batch size: 102, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:53:56,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77200.0, ans=0.1 2023-06-15 07:53:59,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.54 vs. limit=10.0 2023-06-15 07:54:40,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=77333.33333333333, ans=0.1 2023-06-15 07:54:44,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=77333.33333333333, ans=0.125 2023-06-15 07:54:47,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=77400.0, ans=0.0 2023-06-15 07:54:50,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.936e+02 2.140e+02 2.519e+02 4.231e+02, threshold=4.281e+02, percent-clipped=1.0 2023-06-15 07:55:00,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=77400.0, ans=0.125 2023-06-15 07:55:21,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=77533.33333333333, ans=0.125 2023-06-15 07:55:22,549 INFO [train.py:988] (1/4) Epoch 22, batch 450, loss[loss=0.2309, simple_loss=0.3049, pruned_loss=0.07848, over 19201.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3147, pruned_loss=0.08518, over 3395305.03 frames. ], batch size: 92, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:56:37,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=77800.0, ans=0.125 2023-06-15 07:56:44,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=77800.0, ans=0.0 2023-06-15 07:56:49,267 INFO [train.py:988] (1/4) Epoch 22, batch 500, loss[loss=0.2745, simple_loss=0.3561, pruned_loss=0.09643, over 16398.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3141, pruned_loss=0.0852, over 3490255.78 frames. ], batch size: 52, lr: 1.42e-02, grad_scale: 64.0 2023-06-15 07:56:53,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=77866.66666666667, ans=0.125 2023-06-15 07:57:01,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=77866.66666666667, ans=0.125 2023-06-15 07:57:20,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.08 vs. limit=15.0 2023-06-15 07:57:37,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=78066.66666666667, ans=0.2 2023-06-15 07:57:37,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=78066.66666666667, ans=0.125 2023-06-15 07:58:03,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2023-06-15 07:58:03,839 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.901e+02 2.129e+02 2.481e+02 3.635e+02, threshold=4.258e+02, percent-clipped=0.0 2023-06-15 07:58:03,888 INFO [train.py:988] (1/4) Epoch 23, batch 0, loss[loss=0.2513, simple_loss=0.3138, pruned_loss=0.0944, over 20555.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3138, pruned_loss=0.0944, over 20555.00 frames. ], batch size: 189, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 07:58:03,888 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 07:58:10,162 INFO [train.py:1020] (1/4) Epoch 23, validation: loss=0.2051, simple_loss=0.3092, pruned_loss=0.05051, over 143649.00 frames. 2023-06-15 07:58:10,163 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 07:58:13,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=78080.0, ans=0.0 2023-06-15 07:58:43,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=78146.66666666667, ans=0.125 2023-06-15 07:59:04,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=78280.0, ans=0.0 2023-06-15 07:59:11,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=78280.0, ans=0.125 2023-06-15 07:59:39,927 INFO [train.py:988] (1/4) Epoch 23, batch 50, loss[loss=0.2705, simple_loss=0.3472, pruned_loss=0.09692, over 18295.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3129, pruned_loss=0.08539, over 862180.22 frames. ], batch size: 74, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 07:59:45,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=78413.33333333333, ans=0.0 2023-06-15 07:59:57,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=78480.0, ans=0.0 2023-06-15 08:00:03,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.41 vs. limit=15.0 2023-06-15 08:00:04,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=78480.0, ans=0.2 2023-06-15 08:00:07,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78480.0, ans=0.1 2023-06-15 08:00:57,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=78680.0, ans=0.0 2023-06-15 08:01:08,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-15 08:01:10,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.902e+02 2.065e+02 2.419e+02 3.199e+02, threshold=4.129e+02, percent-clipped=0.0 2023-06-15 08:01:10,822 INFO [train.py:988] (1/4) Epoch 23, batch 100, loss[loss=0.2494, simple_loss=0.3287, pruned_loss=0.08503, over 18493.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3129, pruned_loss=0.08461, over 1525546.52 frames. ], batch size: 77, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:01:14,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=78746.66666666667, ans=0.025 2023-06-15 08:01:16,309 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:01:28,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=78813.33333333333, ans=0.125 2023-06-15 08:02:40,322 INFO [train.py:988] (1/4) Epoch 23, batch 150, loss[loss=0.2595, simple_loss=0.291, pruned_loss=0.114, over 17059.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.313, pruned_loss=0.08517, over 2014408.65 frames. ], batch size: 391, lr: 1.38e-02, grad_scale: 64.0 2023-06-15 08:02:54,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79080.0, ans=0.125 2023-06-15 08:02:56,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.11 vs. limit=15.0 2023-06-15 08:03:01,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=79146.66666666667, ans=0.5 2023-06-15 08:03:29,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=15.0 2023-06-15 08:03:44,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=79280.0, ans=0.125 2023-06-15 08:03:46,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=79280.0, ans=0.125 2023-06-15 08:04:09,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.913e+02 2.150e+02 2.448e+02 4.177e+02, threshold=4.300e+02, percent-clipped=1.0 2023-06-15 08:04:09,234 INFO [train.py:988] (1/4) Epoch 23, batch 200, loss[loss=0.2426, simple_loss=0.3023, pruned_loss=0.0915, over 20691.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3124, pruned_loss=0.08423, over 2414044.42 frames. ], batch size: 211, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:04:55,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79546.66666666667, ans=0.125 2023-06-15 08:04:58,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=79546.66666666667, ans=0.125 2023-06-15 08:05:04,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2023-06-15 08:05:11,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-15 08:05:23,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=79680.0, ans=0.0 2023-06-15 08:05:29,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=79680.0, ans=0.125 2023-06-15 08:05:37,770 INFO [train.py:988] (1/4) Epoch 23, batch 250, loss[loss=0.244, simple_loss=0.3188, pruned_loss=0.08459, over 19883.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3131, pruned_loss=0.08448, over 2716418.02 frames. ], batch size: 120, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:05:46,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=79746.66666666667, ans=0.125 2023-06-15 08:05:53,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=79813.33333333333, ans=0.125 2023-06-15 08:06:32,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=79946.66666666667, ans=10.0 2023-06-15 08:06:37,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=79946.66666666667, ans=0.125 2023-06-15 08:06:42,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79946.66666666667, ans=0.125 2023-06-15 08:06:54,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=80013.33333333333, ans=0.125 2023-06-15 08:07:10,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.868e+02 2.154e+02 2.660e+02 5.559e+02, threshold=4.308e+02, percent-clipped=2.0 2023-06-15 08:07:10,074 INFO [train.py:988] (1/4) Epoch 23, batch 300, loss[loss=0.2324, simple_loss=0.3084, pruned_loss=0.07815, over 18461.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3134, pruned_loss=0.08479, over 2957422.04 frames. ], batch size: 77, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:07:27,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=80146.66666666667, ans=0.125 2023-06-15 08:08:05,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80280.0, ans=0.1 2023-06-15 08:08:20,087 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:08:27,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=80346.66666666667, ans=0.125 2023-06-15 08:08:35,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.70 vs. limit=22.5 2023-06-15 08:08:38,456 INFO [train.py:988] (1/4) Epoch 23, batch 350, loss[loss=0.2589, simple_loss=0.3273, pruned_loss=0.09521, over 18643.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3138, pruned_loss=0.0855, over 3141196.77 frames. ], batch size: 80, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:08:40,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=80413.33333333333, ans=0.0 2023-06-15 08:08:54,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.79 vs. limit=15.0 2023-06-15 08:09:01,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.88 vs. limit=10.0 2023-06-15 08:09:39,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=80613.33333333333, ans=0.125 2023-06-15 08:09:40,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=80613.33333333333, ans=0.0 2023-06-15 08:09:45,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=80680.0, ans=0.125 2023-06-15 08:10:05,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.945e+02 2.149e+02 2.617e+02 4.906e+02, threshold=4.297e+02, percent-clipped=1.0 2023-06-15 08:10:05,678 INFO [train.py:988] (1/4) Epoch 23, batch 400, loss[loss=0.2539, simple_loss=0.3314, pruned_loss=0.0882, over 19463.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3138, pruned_loss=0.08482, over 3301994.42 frames. ], batch size: 105, lr: 1.37e-02, grad_scale: 64.0 2023-06-15 08:10:15,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80746.66666666667, ans=0.1 2023-06-15 08:10:22,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=80813.33333333333, ans=0.0 2023-06-15 08:10:26,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2023-06-15 08:10:41,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=80880.0, ans=0.0 2023-06-15 08:10:43,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=80880.0, ans=0.0 2023-06-15 08:11:34,905 INFO [train.py:988] (1/4) Epoch 23, batch 450, loss[loss=0.2253, simple_loss=0.3063, pruned_loss=0.0721, over 19308.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3142, pruned_loss=0.08499, over 3417209.68 frames. ], batch size: 98, lr: 1.36e-02, grad_scale: 64.0 2023-06-15 08:12:29,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.60 vs. limit=22.5 2023-06-15 08:12:35,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=81280.0, ans=0.125 2023-06-15 08:12:35,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=81280.0, ans=0.2 2023-06-15 08:12:42,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-06-15 08:12:47,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81346.66666666667, ans=0.1 2023-06-15 08:12:47,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=81346.66666666667, ans=0.0 2023-06-15 08:12:56,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=81346.66666666667, ans=0.07 2023-06-15 08:13:01,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.014e+02 2.230e+02 2.721e+02 4.611e+02, threshold=4.461e+02, percent-clipped=1.0 2023-06-15 08:13:01,052 INFO [train.py:988] (1/4) Epoch 23, batch 500, loss[loss=0.2545, simple_loss=0.3203, pruned_loss=0.09433, over 20091.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3123, pruned_loss=0.08464, over 3514521.17 frames. ], batch size: 133, lr: 1.36e-02, grad_scale: 64.0 2023-06-15 08:13:03,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.87 vs. limit=12.0 2023-06-15 08:13:33,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=81546.66666666667, ans=0.125 2023-06-15 08:13:46,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2023-06-15 08:14:20,665 INFO [train.py:988] (1/4) Epoch 24, batch 0, loss[loss=0.2398, simple_loss=0.3241, pruned_loss=0.07781, over 18279.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3241, pruned_loss=0.07781, over 18279.00 frames. ], batch size: 74, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:14:20,665 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 08:14:27,205 INFO [train.py:1020] (1/4) Epoch 24, validation: loss=0.2057, simple_loss=0.3089, pruned_loss=0.05123, over 143649.00 frames. 2023-06-15 08:14:27,206 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 08:14:37,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=81626.66666666667, ans=0.125 2023-06-15 08:14:38,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.39 vs. limit=15.0 2023-06-15 08:14:55,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=81693.33333333333, ans=0.0 2023-06-15 08:15:04,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81760.0, ans=0.125 2023-06-15 08:15:36,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=81826.66666666667, ans=0.2 2023-06-15 08:15:40,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=81893.33333333333, ans=0.0 2023-06-15 08:15:57,120 INFO [train.py:988] (1/4) Epoch 24, batch 50, loss[loss=0.2258, simple_loss=0.2997, pruned_loss=0.076, over 19704.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3067, pruned_loss=0.08194, over 873270.30 frames. ], batch size: 110, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:16:26,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=82026.66666666667, ans=0.0 2023-06-15 08:16:29,074 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.940e+02 2.249e+02 2.625e+02 3.999e+02, threshold=4.499e+02, percent-clipped=0.0 2023-06-15 08:16:35,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=22.5 2023-06-15 08:17:25,779 INFO [train.py:988] (1/4) Epoch 24, batch 100, loss[loss=0.2272, simple_loss=0.3084, pruned_loss=0.07298, over 19544.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.31, pruned_loss=0.08235, over 1528082.29 frames. ], batch size: 102, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:17:39,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=82293.33333333333, ans=0.125 2023-06-15 08:18:32,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82493.33333333333, ans=0.1 2023-06-15 08:18:54,704 INFO [train.py:988] (1/4) Epoch 24, batch 150, loss[loss=0.2242, simple_loss=0.2981, pruned_loss=0.07513, over 19329.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3109, pruned_loss=0.08226, over 2022419.63 frames. ], batch size: 98, lr: 1.33e-02, grad_scale: 64.0 2023-06-15 08:19:27,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.862e+02 2.076e+02 2.332e+02 3.767e+02, threshold=4.152e+02, percent-clipped=0.0 2023-06-15 08:20:10,102 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:20:15,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82893.33333333333, ans=0.1 2023-06-15 08:20:22,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=82960.0, ans=0.125 2023-06-15 08:20:24,136 INFO [train.py:988] (1/4) Epoch 24, batch 200, loss[loss=0.2366, simple_loss=0.314, pruned_loss=0.07958, over 18609.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3106, pruned_loss=0.08204, over 2418032.77 frames. ], batch size: 80, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:20:29,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=82960.0, ans=0.0 2023-06-15 08:20:32,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-15 08:21:53,209 INFO [train.py:988] (1/4) Epoch 24, batch 250, loss[loss=0.2351, simple_loss=0.3112, pruned_loss=0.07946, over 18936.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3104, pruned_loss=0.08188, over 2736925.85 frames. ], batch size: 86, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:22:05,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=83293.33333333333, ans=0.125 2023-06-15 08:22:24,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.954e+02 2.132e+02 2.511e+02 4.253e+02, threshold=4.265e+02, percent-clipped=1.0 2023-06-15 08:22:25,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=83360.0, ans=0.125 2023-06-15 08:23:01,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=83560.0, ans=0.125 2023-06-15 08:23:21,364 INFO [train.py:988] (1/4) Epoch 24, batch 300, loss[loss=0.2277, simple_loss=0.3076, pruned_loss=0.07386, over 19078.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3111, pruned_loss=0.08227, over 2981384.38 frames. ], batch size: 89, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:23:57,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=83760.0, ans=0.125 2023-06-15 08:24:08,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=83760.0, ans=0.0 2023-06-15 08:24:09,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=83760.0, ans=0.125 2023-06-15 08:24:14,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=83826.66666666667, ans=0.125 2023-06-15 08:24:22,218 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 08:24:34,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=83893.33333333333, ans=0.125 2023-06-15 08:24:50,666 INFO [train.py:988] (1/4) Epoch 24, batch 350, loss[loss=0.2185, simple_loss=0.2892, pruned_loss=0.07392, over 20125.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3101, pruned_loss=0.08252, over 3150460.35 frames. ], batch size: 133, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:24:53,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=83960.0, ans=0.04949747468305833 2023-06-15 08:25:12,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=84026.66666666667, ans=0.125 2023-06-15 08:25:21,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=84026.66666666667, ans=0.125 2023-06-15 08:25:22,579 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.988e+02 2.360e+02 2.767e+02 4.250e+02, threshold=4.720e+02, percent-clipped=0.0 2023-06-15 08:25:26,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=84093.33333333333, ans=0.0 2023-06-15 08:25:31,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=84093.33333333333, ans=0.2 2023-06-15 08:25:46,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=84160.0, ans=0.0 2023-06-15 08:25:57,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=84160.0, ans=0.125 2023-06-15 08:26:06,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=15.0 2023-06-15 08:26:08,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=84226.66666666667, ans=0.125 2023-06-15 08:26:21,079 INFO [train.py:988] (1/4) Epoch 24, batch 400, loss[loss=0.2282, simple_loss=0.302, pruned_loss=0.07721, over 18941.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3099, pruned_loss=0.08289, over 3285655.90 frames. ], batch size: 86, lr: 1.32e-02, grad_scale: 64.0 2023-06-15 08:26:29,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=84293.33333333333, ans=0.2 2023-06-15 08:26:42,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=84360.0, ans=0.125 2023-06-15 08:27:17,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84493.33333333333, ans=0.125 2023-06-15 08:27:17,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=84493.33333333333, ans=0.125 2023-06-15 08:27:49,977 INFO [train.py:988] (1/4) Epoch 24, batch 450, loss[loss=0.2513, simple_loss=0.3079, pruned_loss=0.09729, over 20026.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3104, pruned_loss=0.08313, over 3398051.23 frames. ], batch size: 293, lr: 1.31e-02, grad_scale: 64.0 2023-06-15 08:27:58,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84626.66666666667, ans=0.1 2023-06-15 08:28:04,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-06-15 08:28:11,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=84693.33333333333, ans=0.0 2023-06-15 08:28:21,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.787e+02 2.025e+02 2.287e+02 3.179e+02, threshold=4.050e+02, percent-clipped=0.0 2023-06-15 08:28:50,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84826.66666666667, ans=0.125 2023-06-15 08:29:12,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=84893.33333333333, ans=0.0 2023-06-15 08:29:15,496 INFO [train.py:988] (1/4) Epoch 24, batch 500, loss[loss=0.2302, simple_loss=0.3022, pruned_loss=0.07907, over 19329.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3092, pruned_loss=0.08264, over 3502603.31 frames. ], batch size: 98, lr: 1.31e-02, grad_scale: 32.0 2023-06-15 08:29:17,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=84960.0, ans=0.0 2023-06-15 08:29:53,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=15.0 2023-06-15 08:30:31,642 INFO [train.py:988] (1/4) Epoch 25, batch 0, loss[loss=0.244, simple_loss=0.3161, pruned_loss=0.08593, over 19773.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3161, pruned_loss=0.08593, over 19773.00 frames. ], batch size: 115, lr: 1.29e-02, grad_scale: 32.0 2023-06-15 08:30:31,643 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 08:30:37,734 INFO [train.py:1020] (1/4) Epoch 25, validation: loss=0.205, simple_loss=0.3085, pruned_loss=0.05071, over 143649.00 frames. 2023-06-15 08:30:37,735 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 08:30:47,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-06-15 08:31:39,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=85373.33333333333, ans=0.125 2023-06-15 08:31:44,659 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.894e+02 2.218e+02 2.492e+02 3.446e+02, threshold=4.437e+02, percent-clipped=0.0 2023-06-15 08:31:50,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.34 vs. limit=15.0 2023-06-15 08:32:07,412 INFO [train.py:988] (1/4) Epoch 25, batch 50, loss[loss=0.2246, simple_loss=0.3025, pruned_loss=0.07337, over 18479.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3063, pruned_loss=0.08273, over 850667.53 frames. ], batch size: 77, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:32:20,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85506.66666666667, ans=0.1 2023-06-15 08:32:40,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-15 08:32:49,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85640.0, ans=0.125 2023-06-15 08:33:34,519 INFO [train.py:988] (1/4) Epoch 25, batch 100, loss[loss=0.2268, simple_loss=0.3119, pruned_loss=0.07084, over 18454.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3077, pruned_loss=0.08066, over 1501853.45 frames. ], batch size: 77, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:33:46,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-06-15 08:33:49,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-15 08:34:39,970 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.831e+02 2.059e+02 2.312e+02 3.649e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-15 08:34:46,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=15.0 2023-06-15 08:35:02,336 INFO [train.py:988] (1/4) Epoch 25, batch 150, loss[loss=0.2341, simple_loss=0.2918, pruned_loss=0.08824, over 19942.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3087, pruned_loss=0.08116, over 2002802.49 frames. ], batch size: 294, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:35:34,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.60 vs. limit=22.5 2023-06-15 08:35:46,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=86306.66666666667, ans=0.1 2023-06-15 08:36:12,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=86440.0, ans=0.0 2023-06-15 08:36:30,647 INFO [train.py:988] (1/4) Epoch 25, batch 200, loss[loss=0.2155, simple_loss=0.2928, pruned_loss=0.06907, over 18567.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.308, pruned_loss=0.08183, over 2403209.60 frames. ], batch size: 80, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:37:08,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=86640.0, ans=0.1 2023-06-15 08:37:26,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=86706.66666666667, ans=0.0 2023-06-15 08:37:27,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86706.66666666667, ans=0.125 2023-06-15 08:37:30,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=86706.66666666667, ans=15.0 2023-06-15 08:37:34,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.840e+02 2.036e+02 2.337e+02 3.806e+02, threshold=4.072e+02, percent-clipped=0.0 2023-06-15 08:37:44,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=86773.33333333333, ans=0.0 2023-06-15 08:37:51,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=86773.33333333333, ans=0.0 2023-06-15 08:37:58,090 INFO [train.py:988] (1/4) Epoch 25, batch 250, loss[loss=0.2548, simple_loss=0.3348, pruned_loss=0.08742, over 18610.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3084, pruned_loss=0.08154, over 2702849.70 frames. ], batch size: 80, lr: 1.28e-02, grad_scale: 32.0 2023-06-15 08:37:58,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=86840.0, ans=0.035 2023-06-15 08:38:03,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=86840.0, ans=0.125 2023-06-15 08:38:24,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=86906.66666666667, ans=0.0 2023-06-15 08:38:40,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=86973.33333333333, ans=0.125 2023-06-15 08:38:59,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87040.0, ans=0.1 2023-06-15 08:39:06,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=87106.66666666667, ans=0.125 2023-06-15 08:39:14,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=87106.66666666667, ans=0.0 2023-06-15 08:39:14,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=87106.66666666667, ans=0.125 2023-06-15 08:39:25,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-06-15 08:39:26,081 INFO [train.py:988] (1/4) Epoch 25, batch 300, loss[loss=0.2433, simple_loss=0.3241, pruned_loss=0.08122, over 18929.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3076, pruned_loss=0.08155, over 2947473.73 frames. ], batch size: 86, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:39:42,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=87240.0, ans=0.0 2023-06-15 08:39:56,945 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.43 vs. limit=15.0 2023-06-15 08:40:05,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=15.0 2023-06-15 08:40:22,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87373.33333333333, ans=0.1 2023-06-15 08:40:25,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-15 08:40:28,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=87373.33333333333, ans=0.95 2023-06-15 08:40:31,127 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.920e+02 2.106e+02 2.371e+02 3.707e+02, threshold=4.212e+02, percent-clipped=0.0 2023-06-15 08:40:31,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=87373.33333333333, ans=0.125 2023-06-15 08:40:54,080 INFO [train.py:988] (1/4) Epoch 25, batch 350, loss[loss=0.2673, simple_loss=0.3461, pruned_loss=0.09425, over 16387.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3088, pruned_loss=0.08104, over 3141163.48 frames. ], batch size: 52, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:40:54,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=87506.66666666667, ans=0.125 2023-06-15 08:41:01,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=15.0 2023-06-15 08:41:04,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=87506.66666666667, ans=0.0 2023-06-15 08:42:04,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=87773.33333333333, ans=0.125 2023-06-15 08:42:21,266 INFO [train.py:988] (1/4) Epoch 25, batch 400, loss[loss=0.2326, simple_loss=0.3078, pruned_loss=0.07866, over 18913.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3085, pruned_loss=0.08099, over 3291961.47 frames. ], batch size: 86, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:42:59,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=87973.33333333333, ans=0.125 2023-06-15 08:43:08,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=87973.33333333333, ans=0.0 2023-06-15 08:43:13,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=88040.0, ans=0.0 2023-06-15 08:43:13,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=88040.0, ans=0.0 2023-06-15 08:43:15,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=88040.0, ans=0.125 2023-06-15 08:43:27,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.933e+02 2.148e+02 2.533e+02 3.587e+02, threshold=4.297e+02, percent-clipped=0.0 2023-06-15 08:43:29,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-15 08:43:36,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=88106.66666666667, ans=0.125 2023-06-15 08:43:50,358 INFO [train.py:988] (1/4) Epoch 25, batch 450, loss[loss=0.251, simple_loss=0.3401, pruned_loss=0.08094, over 18324.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3082, pruned_loss=0.08117, over 3397631.07 frames. ], batch size: 72, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:43:59,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=88173.33333333333, ans=0.125 2023-06-15 08:44:10,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=88240.0, ans=0.125 2023-06-15 08:44:11,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=88240.0, ans=0.125 2023-06-15 08:44:55,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=88373.33333333333, ans=0.0 2023-06-15 08:45:15,640 INFO [train.py:988] (1/4) Epoch 25, batch 500, loss[loss=0.223, simple_loss=0.3109, pruned_loss=0.06759, over 19046.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3083, pruned_loss=0.08087, over 3482907.03 frames. ], batch size: 89, lr: 1.27e-02, grad_scale: 32.0 2023-06-15 08:45:21,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=88506.66666666667, ans=0.0 2023-06-15 08:45:37,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88573.33333333333, ans=0.1 2023-06-15 08:45:44,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.14 vs. limit=22.5 2023-06-15 08:45:52,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=88640.0, ans=0.1 2023-06-15 08:46:29,696 INFO [train.py:988] (1/4) Epoch 26, batch 0, loss[loss=0.2307, simple_loss=0.3134, pruned_loss=0.074, over 18272.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3134, pruned_loss=0.074, over 18272.00 frames. ], batch size: 74, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:46:29,697 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 08:46:35,672 INFO [train.py:1020] (1/4) Epoch 26, validation: loss=0.2057, simple_loss=0.3076, pruned_loss=0.05187, over 143649.00 frames. 2023-06-15 08:46:35,673 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 08:46:42,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=88720.0, ans=0.125 2023-06-15 08:46:43,667 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.988e+02 2.148e+02 2.357e+02 3.601e+02, threshold=4.296e+02, percent-clipped=0.0 2023-06-15 08:46:54,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=88786.66666666667, ans=0.05 2023-06-15 08:46:56,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-06-15 08:46:57,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=88786.66666666667, ans=0.0 2023-06-15 08:47:03,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=88786.66666666667, ans=0.125 2023-06-15 08:47:08,128 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-06-15 08:47:09,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88853.33333333333, ans=0.1 2023-06-15 08:47:19,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=88853.33333333333, ans=0.125 2023-06-15 08:48:02,841 INFO [train.py:988] (1/4) Epoch 26, batch 50, loss[loss=0.2325, simple_loss=0.2954, pruned_loss=0.08483, over 20272.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.308, pruned_loss=0.0805, over 846362.56 frames. ], batch size: 239, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:48:04,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=89053.33333333333, ans=0.125 2023-06-15 08:48:17,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=89053.33333333333, ans=0.125 2023-06-15 08:48:17,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=89053.33333333333, ans=0.2 2023-06-15 08:48:56,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=89253.33333333333, ans=0.125 2023-06-15 08:48:56,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.03 vs. limit=22.5 2023-06-15 08:49:01,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=89253.33333333333, ans=0.0 2023-06-15 08:49:28,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=89320.0, ans=0.0 2023-06-15 08:49:30,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=89320.0, ans=0.07 2023-06-15 08:49:33,000 INFO [train.py:988] (1/4) Epoch 26, batch 100, loss[loss=0.2393, simple_loss=0.3327, pruned_loss=0.07296, over 17620.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3073, pruned_loss=0.07784, over 1499521.65 frames. ], batch size: 67, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:49:33,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2023-06-15 08:49:41,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.958e+02 2.207e+02 2.501e+02 3.726e+02, threshold=4.413e+02, percent-clipped=0.0 2023-06-15 08:50:01,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=89453.33333333333, ans=0.125 2023-06-15 08:50:08,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=89520.0, ans=0.2 2023-06-15 08:50:35,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=89586.66666666667, ans=0.125 2023-06-15 08:51:01,452 INFO [train.py:988] (1/4) Epoch 26, batch 150, loss[loss=0.2324, simple_loss=0.3009, pruned_loss=0.08197, over 20128.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3086, pruned_loss=0.07972, over 2004627.68 frames. ], batch size: 133, lr: 1.24e-02, grad_scale: 32.0 2023-06-15 08:51:07,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=89720.0, ans=0.125 2023-06-15 08:51:19,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=89786.66666666667, ans=0.125 2023-06-15 08:51:41,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=89853.33333333333, ans=0.125 2023-06-15 08:51:51,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=89853.33333333333, ans=0.0 2023-06-15 08:52:05,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89920.0, ans=0.1 2023-06-15 08:52:21,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=89986.66666666667, ans=0.125 2023-06-15 08:52:23,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=89986.66666666667, ans=0.125 2023-06-15 08:52:29,988 INFO [train.py:988] (1/4) Epoch 26, batch 200, loss[loss=0.2316, simple_loss=0.3042, pruned_loss=0.07951, over 20335.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3079, pruned_loss=0.08006, over 2416400.48 frames. ], batch size: 149, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:52:39,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.858e+02 1.962e+02 2.261e+02 3.889e+02, threshold=3.924e+02, percent-clipped=0.0 2023-06-15 08:53:23,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=90253.33333333333, ans=0.2 2023-06-15 08:53:58,445 INFO [train.py:988] (1/4) Epoch 26, batch 250, loss[loss=0.2403, simple_loss=0.3163, pruned_loss=0.08211, over 16250.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3087, pruned_loss=0.08025, over 2701997.79 frames. ], batch size: 52, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:54:01,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90386.66666666667, ans=0.1 2023-06-15 08:54:08,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=90386.66666666667, ans=0.2 2023-06-15 08:54:19,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-15 08:54:43,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=90520.0, ans=0.125 2023-06-15 08:54:55,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=90586.66666666667, ans=0.125 2023-06-15 08:54:57,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=90586.66666666667, ans=0.125 2023-06-15 08:55:08,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=90653.33333333333, ans=0.2 2023-06-15 08:55:26,693 INFO [train.py:988] (1/4) Epoch 26, batch 300, loss[loss=0.2243, simple_loss=0.3023, pruned_loss=0.07317, over 18923.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3082, pruned_loss=0.0799, over 2944736.37 frames. ], batch size: 86, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:55:36,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.907e+02 2.229e+02 2.657e+02 5.301e+02, threshold=4.457e+02, percent-clipped=1.0 2023-06-15 08:55:40,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=90720.0, ans=15.0 2023-06-15 08:56:55,564 INFO [train.py:988] (1/4) Epoch 26, batch 350, loss[loss=0.2404, simple_loss=0.3277, pruned_loss=0.07654, over 17581.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3072, pruned_loss=0.08027, over 3122026.94 frames. ], batch size: 67, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:56:58,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=91053.33333333333, ans=0.0 2023-06-15 08:57:32,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=91186.66666666667, ans=0.125 2023-06-15 08:57:39,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=91186.66666666667, ans=0.5 2023-06-15 08:58:19,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=91320.0, ans=0.125 2023-06-15 08:58:24,524 INFO [train.py:988] (1/4) Epoch 26, batch 400, loss[loss=0.2264, simple_loss=0.2985, pruned_loss=0.07709, over 19703.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3069, pruned_loss=0.07962, over 3264742.92 frames. ], batch size: 110, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 08:58:33,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.996e+02 2.310e+02 2.636e+02 4.226e+02, threshold=4.620e+02, percent-clipped=0.0 2023-06-15 08:58:59,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.94 vs. limit=15.0 2023-06-15 08:59:02,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=91520.0, ans=0.2 2023-06-15 08:59:06,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=91520.0, ans=0.0 2023-06-15 08:59:06,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-06-15 08:59:20,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=91586.66666666667, ans=0.125 2023-06-15 08:59:23,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=91586.66666666667, ans=0.125 2023-06-15 08:59:32,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-06-15 08:59:44,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=91653.33333333333, ans=0.125 2023-06-15 08:59:53,746 INFO [train.py:988] (1/4) Epoch 26, batch 450, loss[loss=0.2378, simple_loss=0.3078, pruned_loss=0.08389, over 20313.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3068, pruned_loss=0.07956, over 3391066.23 frames. ], batch size: 141, lr: 1.23e-02, grad_scale: 32.0 2023-06-15 09:00:21,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.95 vs. limit=10.0 2023-06-15 09:01:20,828 INFO [train.py:988] (1/4) Epoch 26, batch 500, loss[loss=0.2662, simple_loss=0.3262, pruned_loss=0.1031, over 20088.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3077, pruned_loss=0.07968, over 3465400.73 frames. ], batch size: 133, lr: 1.22e-02, grad_scale: 32.0 2023-06-15 09:01:28,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.890e+02 2.061e+02 2.503e+02 4.030e+02, threshold=4.121e+02, percent-clipped=0.0 2023-06-15 09:01:42,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-15 09:01:45,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-15 09:01:46,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=92120.0, ans=0.0 2023-06-15 09:01:59,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.38 vs. limit=15.0 2023-06-15 09:02:43,441 INFO [train.py:988] (1/4) Epoch 27, batch 0, loss[loss=0.2395, simple_loss=0.3213, pruned_loss=0.07887, over 10733.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3213, pruned_loss=0.07887, over 10733.00 frames. ], batch size: 30, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:02:43,441 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 09:02:52,298 INFO [train.py:1020] (1/4) Epoch 27, validation: loss=0.2009, simple_loss=0.305, pruned_loss=0.04841, over 143649.00 frames. 2023-06-15 09:02:52,299 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 09:02:58,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=92273.33333333333, ans=0.035 2023-06-15 09:03:40,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.11 vs. limit=10.0 2023-06-15 09:03:43,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=92406.66666666667, ans=0.0 2023-06-15 09:03:44,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=92473.33333333333, ans=0.125 2023-06-15 09:03:51,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=92473.33333333333, ans=0.0 2023-06-15 09:04:21,888 INFO [train.py:988] (1/4) Epoch 27, batch 50, loss[loss=0.2336, simple_loss=0.3069, pruned_loss=0.08019, over 19962.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3043, pruned_loss=0.08058, over 855173.58 frames. ], batch size: 126, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:04:37,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=92673.33333333333, ans=0.0 2023-06-15 09:04:59,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.842e+02 2.115e+02 2.294e+02 3.126e+02, threshold=4.230e+02, percent-clipped=0.0 2023-06-15 09:05:47,892 INFO [train.py:988] (1/4) Epoch 27, batch 100, loss[loss=0.2216, simple_loss=0.2998, pruned_loss=0.07166, over 19133.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3066, pruned_loss=0.08014, over 1502895.29 frames. ], batch size: 94, lr: 1.20e-02, grad_scale: 32.0 2023-06-15 09:06:00,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=92940.0, ans=0.125 2023-06-15 09:06:01,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=92940.0, ans=0.125 2023-06-15 09:06:21,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=93073.33333333333, ans=0.05 2023-06-15 09:06:23,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=93073.33333333333, ans=0.0 2023-06-15 09:06:44,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=93140.0, ans=0.0 2023-06-15 09:07:14,882 INFO [train.py:988] (1/4) Epoch 27, batch 150, loss[loss=0.2206, simple_loss=0.2974, pruned_loss=0.07194, over 20105.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3061, pruned_loss=0.07809, over 1996438.43 frames. ], batch size: 133, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:07:32,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=93340.0, ans=0.125 2023-06-15 09:07:53,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=93406.66666666667, ans=0.1 2023-06-15 09:07:54,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.908e+02 2.202e+02 2.518e+02 3.722e+02, threshold=4.404e+02, percent-clipped=0.0 2023-06-15 09:08:03,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=93406.66666666667, ans=0.125 2023-06-15 09:08:23,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=93473.33333333333, ans=0.0 2023-06-15 09:08:27,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=93540.0, ans=0.5 2023-06-15 09:08:34,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=93540.0, ans=0.0 2023-06-15 09:08:43,686 INFO [train.py:988] (1/4) Epoch 27, batch 200, loss[loss=0.2125, simple_loss=0.2939, pruned_loss=0.06553, over 19093.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3063, pruned_loss=0.0789, over 2401337.49 frames. ], batch size: 94, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:08:49,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=93606.66666666667, ans=0.2 2023-06-15 09:09:14,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=93673.33333333333, ans=0.0 2023-06-15 09:10:11,421 INFO [train.py:988] (1/4) Epoch 27, batch 250, loss[loss=0.2288, simple_loss=0.3088, pruned_loss=0.0744, over 18950.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3055, pruned_loss=0.07834, over 2711456.11 frames. ], batch size: 86, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:10:50,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.782e+02 1.944e+02 2.302e+02 3.570e+02, threshold=3.888e+02, percent-clipped=0.0 2023-06-15 09:11:06,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=94140.0, ans=0.0 2023-06-15 09:11:08,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=94140.0, ans=0.0 2023-06-15 09:11:13,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94140.0, ans=0.1 2023-06-15 09:11:13,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=94140.0, ans=0.0 2023-06-15 09:11:39,416 INFO [train.py:988] (1/4) Epoch 27, batch 300, loss[loss=0.2536, simple_loss=0.3329, pruned_loss=0.08717, over 15165.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3055, pruned_loss=0.07844, over 2936259.28 frames. ], batch size: 43, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:11:55,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2023-06-15 09:12:28,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=94406.66666666667, ans=0.1 2023-06-15 09:12:35,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=94473.33333333333, ans=0.125 2023-06-15 09:12:46,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.09 vs. limit=10.0 2023-06-15 09:12:54,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.62 vs. limit=15.0 2023-06-15 09:12:57,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=94540.0, ans=0.125 2023-06-15 09:13:06,376 INFO [train.py:988] (1/4) Epoch 27, batch 350, loss[loss=0.2156, simple_loss=0.2965, pruned_loss=0.0674, over 19487.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3047, pruned_loss=0.07795, over 3119023.79 frames. ], batch size: 105, lr: 1.19e-02, grad_scale: 16.0 2023-06-15 09:13:11,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=94606.66666666667, ans=0.04949747468305833 2023-06-15 09:13:19,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=22.5 2023-06-15 09:13:46,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.867e+02 2.054e+02 2.333e+02 3.496e+02, threshold=4.108e+02, percent-clipped=0.0 2023-06-15 09:13:49,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=94740.0, ans=0.0 2023-06-15 09:13:52,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=94740.0, ans=0.125 2023-06-15 09:14:00,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=94806.66666666667, ans=0.125 2023-06-15 09:14:20,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=94873.33333333333, ans=0.025 2023-06-15 09:14:32,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=94873.33333333333, ans=0.5 2023-06-15 09:14:35,060 INFO [train.py:988] (1/4) Epoch 27, batch 400, loss[loss=0.2139, simple_loss=0.2946, pruned_loss=0.06666, over 19713.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3045, pruned_loss=0.07815, over 3274996.67 frames. ], batch size: 110, lr: 1.19e-02, grad_scale: 32.0 2023-06-15 09:14:43,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=94940.0, ans=0.95 2023-06-15 09:14:55,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=95006.66666666667, ans=0.0 2023-06-15 09:15:13,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-06-15 09:15:27,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=95140.0, ans=0.2 2023-06-15 09:15:29,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=95140.0, ans=0.125 2023-06-15 09:15:55,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.64 vs. limit=12.0 2023-06-15 09:16:02,644 INFO [train.py:988] (1/4) Epoch 27, batch 450, loss[loss=0.2232, simple_loss=0.3013, pruned_loss=0.0726, over 18936.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3047, pruned_loss=0.07795, over 3384368.70 frames. ], batch size: 86, lr: 1.18e-02, grad_scale: 16.0 2023-06-15 09:16:17,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=95340.0, ans=0.125 2023-06-15 09:16:43,712 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.922e+02 2.176e+02 2.834e+02 5.039e+02, threshold=4.352e+02, percent-clipped=1.0 2023-06-15 09:16:44,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=95406.66666666667, ans=0.125 2023-06-15 09:17:14,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=95540.0, ans=0.2 2023-06-15 09:17:20,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=95540.0, ans=0.0 2023-06-15 09:17:27,060 INFO [train.py:988] (1/4) Epoch 27, batch 500, loss[loss=0.2291, simple_loss=0.307, pruned_loss=0.07561, over 19124.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3053, pruned_loss=0.07859, over 3485084.54 frames. ], batch size: 94, lr: 1.18e-02, grad_scale: 16.0 2023-06-15 09:17:34,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.38 vs. limit=15.0 2023-06-15 09:18:12,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=95740.0, ans=0.0 2023-06-15 09:18:18,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=95806.66666666667, ans=0.125 2023-06-15 09:18:47,758 INFO [train.py:988] (1/4) Epoch 28, batch 0, loss[loss=0.2193, simple_loss=0.2939, pruned_loss=0.07234, over 19563.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2939, pruned_loss=0.07234, over 19563.00 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:18:47,759 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 09:18:53,824 INFO [train.py:1020] (1/4) Epoch 28, validation: loss=0.203, simple_loss=0.307, pruned_loss=0.0495, over 143649.00 frames. 2023-06-15 09:18:53,824 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 09:19:35,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=95960.0, ans=0.0 2023-06-15 09:19:35,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95960.0, ans=0.1 2023-06-15 09:19:44,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-15 09:20:05,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.841e+02 2.127e+02 2.560e+02 4.411e+02, threshold=4.254e+02, percent-clipped=1.0 2023-06-15 09:20:21,904 INFO [train.py:988] (1/4) Epoch 28, batch 50, loss[loss=0.2205, simple_loss=0.2889, pruned_loss=0.0761, over 20590.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3057, pruned_loss=0.07703, over 856227.74 frames. ], batch size: 189, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:20:34,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=96160.0, ans=0.0 2023-06-15 09:21:01,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=96293.33333333333, ans=0.0 2023-06-15 09:21:31,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=96426.66666666667, ans=0.125 2023-06-15 09:21:32,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.96 vs. limit=15.0 2023-06-15 09:21:49,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=96493.33333333333, ans=0.125 2023-06-15 09:21:51,028 INFO [train.py:988] (1/4) Epoch 28, batch 100, loss[loss=0.2419, simple_loss=0.3341, pruned_loss=0.07484, over 17595.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3031, pruned_loss=0.07662, over 1519149.92 frames. ], batch size: 67, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:21:51,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=96493.33333333333, ans=0.125 2023-06-15 09:22:04,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=96493.33333333333, ans=0.0 2023-06-15 09:22:08,224 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:22:37,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-15 09:22:59,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=96760.0, ans=0.05 2023-06-15 09:23:02,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.826e+02 2.059e+02 2.321e+02 3.339e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-15 09:23:18,525 INFO [train.py:988] (1/4) Epoch 28, batch 150, loss[loss=0.235, simple_loss=0.3139, pruned_loss=0.07811, over 18253.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3033, pruned_loss=0.07577, over 2025008.67 frames. ], batch size: 74, lr: 1.16e-02, grad_scale: 32.0 2023-06-15 09:23:56,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=96960.0, ans=0.125 2023-06-15 09:24:18,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97026.66666666667, ans=0.1 2023-06-15 09:24:46,176 INFO [train.py:988] (1/4) Epoch 28, batch 200, loss[loss=0.2295, simple_loss=0.3032, pruned_loss=0.0779, over 19107.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3048, pruned_loss=0.07613, over 2412463.68 frames. ], batch size: 94, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:24:46,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=97160.0, ans=0.0 2023-06-15 09:24:50,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.56 vs. limit=22.5 2023-06-15 09:25:05,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=97226.66666666667, ans=0.125 2023-06-15 09:25:17,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97226.66666666667, ans=0.0 2023-06-15 09:25:41,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-15 09:25:50,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=97360.0, ans=0.2 2023-06-15 09:25:56,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.821e+02 1.977e+02 2.337e+02 4.646e+02, threshold=3.954e+02, percent-clipped=1.0 2023-06-15 09:26:00,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97426.66666666667, ans=0.1 2023-06-15 09:26:11,653 INFO [train.py:988] (1/4) Epoch 28, batch 250, loss[loss=0.2303, simple_loss=0.3069, pruned_loss=0.07684, over 18312.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3038, pruned_loss=0.07573, over 2712715.56 frames. ], batch size: 74, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:26:27,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=97493.33333333333, ans=0.125 2023-06-15 09:26:50,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.96 vs. limit=22.5 2023-06-15 09:27:10,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.01 vs. limit=15.0 2023-06-15 09:27:25,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=97760.0, ans=0.2 2023-06-15 09:27:39,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97826.66666666667, ans=0.1 2023-06-15 09:27:41,178 INFO [train.py:988] (1/4) Epoch 28, batch 300, loss[loss=0.2324, simple_loss=0.3152, pruned_loss=0.07481, over 18304.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3035, pruned_loss=0.07613, over 2946667.53 frames. ], batch size: 74, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:28:11,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=97893.33333333333, ans=0.125 2023-06-15 09:28:21,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97960.0, ans=0.1 2023-06-15 09:28:25,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-15 09:28:44,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-15 09:28:52,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.947e+02 2.273e+02 2.757e+02 4.812e+02, threshold=4.546e+02, percent-clipped=3.0 2023-06-15 09:29:07,334 INFO [train.py:988] (1/4) Epoch 28, batch 350, loss[loss=0.2175, simple_loss=0.2984, pruned_loss=0.06832, over 19456.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.303, pruned_loss=0.07591, over 3137101.01 frames. ], batch size: 105, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:29:26,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=98226.66666666667, ans=0.125 2023-06-15 09:29:32,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=98226.66666666667, ans=0.125 2023-06-15 09:29:39,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=98226.66666666667, ans=0.07 2023-06-15 09:29:50,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=98293.33333333333, ans=15.0 2023-06-15 09:29:57,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=98293.33333333333, ans=0.0 2023-06-15 09:30:33,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=98493.33333333333, ans=0.125 2023-06-15 09:30:34,472 INFO [train.py:988] (1/4) Epoch 28, batch 400, loss[loss=0.2285, simple_loss=0.3194, pruned_loss=0.06885, over 17582.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3037, pruned_loss=0.07616, over 3266558.02 frames. ], batch size: 67, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:30:40,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=98493.33333333333, ans=0.125 2023-06-15 09:30:52,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-06-15 09:30:53,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=98560.0, ans=0.125 2023-06-15 09:31:05,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=98560.0, ans=0.0 2023-06-15 09:31:14,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.48 vs. limit=15.0 2023-06-15 09:31:23,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=98626.66666666667, ans=0.0 2023-06-15 09:31:28,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98693.33333333333, ans=0.125 2023-06-15 09:31:44,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=98760.0, ans=0.2 2023-06-15 09:31:48,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.911e+02 2.084e+02 2.355e+02 3.960e+02, threshold=4.169e+02, percent-clipped=0.0 2023-06-15 09:32:01,904 INFO [train.py:988] (1/4) Epoch 28, batch 450, loss[loss=0.2127, simple_loss=0.292, pruned_loss=0.06669, over 18787.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3035, pruned_loss=0.07636, over 3396054.94 frames. ], batch size: 83, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:32:20,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=98893.33333333333, ans=0.2 2023-06-15 09:32:29,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-06-15 09:32:46,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=98960.0, ans=0.2 2023-06-15 09:32:46,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=98960.0, ans=0.1 2023-06-15 09:32:48,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=98960.0, ans=0.2 2023-06-15 09:32:49,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98960.0, ans=0.125 2023-06-15 09:33:04,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=99026.66666666667, ans=0.2 2023-06-15 09:33:11,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99093.33333333333, ans=0.125 2023-06-15 09:33:20,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=99093.33333333333, ans=0.2 2023-06-15 09:33:28,911 INFO [train.py:988] (1/4) Epoch 28, batch 500, loss[loss=0.2426, simple_loss=0.328, pruned_loss=0.07859, over 15543.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3036, pruned_loss=0.07587, over 3482307.69 frames. ], batch size: 44, lr: 1.15e-02, grad_scale: 32.0 2023-06-15 09:33:39,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=99160.0, ans=0.2 2023-06-15 09:34:13,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=99293.33333333333, ans=0.0 2023-06-15 09:34:16,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=99360.0, ans=0.0 2023-06-15 09:34:46,565 INFO [train.py:988] (1/4) Epoch 29, batch 0, loss[loss=0.2343, simple_loss=0.3133, pruned_loss=0.07768, over 18441.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3133, pruned_loss=0.07768, over 18441.00 frames. ], batch size: 77, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:34:46,566 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 09:34:52,720 INFO [train.py:1020] (1/4) Epoch 29, validation: loss=0.2012, simple_loss=0.3049, pruned_loss=0.04872, over 143649.00 frames. 2023-06-15 09:34:52,722 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 09:34:54,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=99380.0, ans=0.125 2023-06-15 09:35:07,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.808e+02 2.025e+02 2.226e+02 3.535e+02, threshold=4.050e+02, percent-clipped=0.0 2023-06-15 09:35:11,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=99446.66666666667, ans=0.2 2023-06-15 09:35:43,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=99580.0, ans=0.0 2023-06-15 09:36:20,603 INFO [train.py:988] (1/4) Epoch 29, batch 50, loss[loss=0.2399, simple_loss=0.3148, pruned_loss=0.08255, over 20298.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3021, pruned_loss=0.07446, over 862938.56 frames. ], batch size: 149, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:36:52,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=99780.0, ans=0.5 2023-06-15 09:37:08,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99846.66666666667, ans=0.1 2023-06-15 09:37:37,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=99980.0, ans=0.0 2023-06-15 09:37:37,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=99980.0, ans=0.0 2023-06-15 09:37:48,025 INFO [train.py:988] (1/4) Epoch 29, batch 100, loss[loss=0.2502, simple_loss=0.3175, pruned_loss=0.09145, over 19995.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3013, pruned_loss=0.0748, over 1539462.96 frames. ], batch size: 126, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:38:03,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.850e+02 2.039e+02 2.478e+02 3.886e+02, threshold=4.079e+02, percent-clipped=0.0 2023-06-15 09:38:38,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=100246.66666666667, ans=0.125 2023-06-15 09:38:52,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-06-15 09:39:15,406 INFO [train.py:988] (1/4) Epoch 29, batch 150, loss[loss=0.2204, simple_loss=0.3008, pruned_loss=0.06996, over 18468.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3018, pruned_loss=0.07612, over 2025754.57 frames. ], batch size: 77, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:39:20,870 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:39:21,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.86 vs. limit=15.0 2023-06-15 09:39:22,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=100380.0, ans=0.125 2023-06-15 09:39:50,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=100513.33333333333, ans=0.125 2023-06-15 09:39:52,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.75 vs. limit=6.0 2023-06-15 09:40:23,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=100646.66666666667, ans=0.0 2023-06-15 09:40:29,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=100646.66666666667, ans=0.125 2023-06-15 09:40:29,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-06-15 09:40:40,944 INFO [train.py:988] (1/4) Epoch 29, batch 200, loss[loss=0.2664, simple_loss=0.3473, pruned_loss=0.09271, over 16341.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3029, pruned_loss=0.07612, over 2422787.54 frames. ], batch size: 52, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:40:57,020 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.792e+02 2.001e+02 2.365e+02 3.519e+02, threshold=4.002e+02, percent-clipped=0.0 2023-06-15 09:40:59,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=100780.0, ans=0.125 2023-06-15 09:41:02,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-06-15 09:41:21,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=100846.66666666667, ans=0.0 2023-06-15 09:41:41,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.19 vs. limit=22.5 2023-06-15 09:41:49,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=100980.0, ans=0.125 2023-06-15 09:42:08,460 INFO [train.py:988] (1/4) Epoch 29, batch 250, loss[loss=0.2096, simple_loss=0.283, pruned_loss=0.06813, over 20502.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3023, pruned_loss=0.07551, over 2735803.05 frames. ], batch size: 160, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:42:18,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=101046.66666666667, ans=0.2 2023-06-15 09:42:43,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=101180.0, ans=0.0 2023-06-15 09:42:55,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=101180.0, ans=0.125 2023-06-15 09:42:59,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.94 vs. limit=10.0 2023-06-15 09:43:18,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=101313.33333333333, ans=0.125 2023-06-15 09:43:35,038 INFO [train.py:988] (1/4) Epoch 29, batch 300, loss[loss=0.217, simple_loss=0.3057, pruned_loss=0.06422, over 17015.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3013, pruned_loss=0.07521, over 2968364.71 frames. ], batch size: 60, lr: 1.12e-02, grad_scale: 32.0 2023-06-15 09:43:37,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=22.5 2023-06-15 09:43:46,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=101380.0, ans=0.125 2023-06-15 09:43:50,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.806e+02 2.034e+02 2.286e+02 3.270e+02, threshold=4.068e+02, percent-clipped=0.0 2023-06-15 09:44:06,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=101446.66666666667, ans=0.07 2023-06-15 09:44:06,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=101446.66666666667, ans=0.025 2023-06-15 09:44:08,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=101513.33333333333, ans=0.1 2023-06-15 09:44:11,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=101513.33333333333, ans=0.125 2023-06-15 09:45:02,599 INFO [train.py:988] (1/4) Epoch 29, batch 350, loss[loss=0.2234, simple_loss=0.3042, pruned_loss=0.07129, over 20091.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3012, pruned_loss=0.07549, over 3141335.61 frames. ], batch size: 133, lr: 1.11e-02, grad_scale: 16.0 2023-06-15 09:45:12,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.62 vs. limit=10.0 2023-06-15 09:45:16,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=101713.33333333333, ans=0.0 2023-06-15 09:45:27,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=101780.0, ans=0.2 2023-06-15 09:45:28,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101780.0, ans=0.125 2023-06-15 09:45:44,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=101846.66666666667, ans=0.05 2023-06-15 09:45:47,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101846.66666666667, ans=0.1 2023-06-15 09:45:48,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.26 vs. limit=15.0 2023-06-15 09:46:18,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=101980.0, ans=0.125 2023-06-15 09:46:21,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.24 vs. limit=22.5 2023-06-15 09:46:29,148 INFO [train.py:988] (1/4) Epoch 29, batch 400, loss[loss=0.2244, simple_loss=0.2885, pruned_loss=0.08012, over 20036.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3008, pruned_loss=0.07508, over 3288558.70 frames. ], batch size: 293, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:46:32,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=102046.66666666667, ans=0.0 2023-06-15 09:46:46,570 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.969e+02 2.274e+02 2.602e+02 3.577e+02, threshold=4.548e+02, percent-clipped=0.0 2023-06-15 09:46:53,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102113.33333333333, ans=0.125 2023-06-15 09:46:57,065 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:47:00,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=102113.33333333333, ans=0.125 2023-06-15 09:47:36,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=102313.33333333333, ans=0.0 2023-06-15 09:47:36,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102313.33333333333, ans=0.1 2023-06-15 09:47:50,478 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:47:55,552 INFO [train.py:988] (1/4) Epoch 29, batch 450, loss[loss=0.2162, simple_loss=0.2946, pruned_loss=0.06889, over 19852.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3013, pruned_loss=0.07551, over 3389901.54 frames. ], batch size: 115, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:48:14,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102446.66666666667, ans=0.1 2023-06-15 09:48:17,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-15 09:48:59,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=102580.0, ans=0.0 2023-06-15 09:49:16,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=102646.66666666667, ans=0.125 2023-06-15 09:49:20,221 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-06-15 09:49:20,994 INFO [train.py:988] (1/4) Epoch 29, batch 500, loss[loss=0.223, simple_loss=0.2933, pruned_loss=0.07634, over 20535.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.301, pruned_loss=0.0749, over 3484133.25 frames. ], batch size: 173, lr: 1.11e-02, grad_scale: 32.0 2023-06-15 09:49:37,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.900e+02 2.167e+02 2.548e+02 3.918e+02, threshold=4.335e+02, percent-clipped=0.0 2023-06-15 09:49:47,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=102780.0, ans=0.125 2023-06-15 09:49:47,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=102780.0, ans=0.0 2023-06-15 09:49:52,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102846.66666666667, ans=0.125 2023-06-15 09:49:55,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=102846.66666666667, ans=0.125 2023-06-15 09:49:57,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102846.66666666667, ans=0.125 2023-06-15 09:50:09,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=102913.33333333333, ans=0.2 2023-06-15 09:50:35,330 INFO [train.py:988] (1/4) Epoch 30, batch 0, loss[loss=0.2458, simple_loss=0.3018, pruned_loss=0.0949, over 20284.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3018, pruned_loss=0.0949, over 20284.00 frames. ], batch size: 239, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:50:35,331 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 09:50:41,560 INFO [train.py:1020] (1/4) Epoch 30, validation: loss=0.2006, simple_loss=0.3036, pruned_loss=0.04881, over 143649.00 frames. 2023-06-15 09:50:41,561 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 09:50:52,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102926.66666666667, ans=0.1 2023-06-15 09:51:04,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-06-15 09:52:02,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=22.5 2023-06-15 09:52:04,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-15 09:52:08,091 INFO [train.py:988] (1/4) Epoch 30, batch 50, loss[loss=0.2371, simple_loss=0.3063, pruned_loss=0.08392, over 20016.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2986, pruned_loss=0.07644, over 864621.66 frames. ], batch size: 126, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:52:27,423 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:52:55,893 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.895e+02 2.155e+02 2.482e+02 4.117e+02, threshold=4.310e+02, percent-clipped=0.0 2023-06-15 09:53:07,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=103460.0, ans=0.0 2023-06-15 09:53:18,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=103526.66666666667, ans=0.125 2023-06-15 09:53:34,737 INFO [train.py:988] (1/4) Epoch 30, batch 100, loss[loss=0.2201, simple_loss=0.2911, pruned_loss=0.0746, over 20614.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3001, pruned_loss=0.07569, over 1530340.43 frames. ], batch size: 189, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:53:43,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=103593.33333333333, ans=0.2 2023-06-15 09:53:56,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=103660.0, ans=0.125 2023-06-15 09:53:59,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=103660.0, ans=0.0 2023-06-15 09:54:20,387 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 09:54:44,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=103860.0, ans=0.05 2023-06-15 09:54:49,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=103860.0, ans=0.0 2023-06-15 09:55:01,983 INFO [train.py:988] (1/4) Epoch 30, batch 150, loss[loss=0.2134, simple_loss=0.2961, pruned_loss=0.06536, over 18934.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3002, pruned_loss=0.07539, over 2016469.07 frames. ], batch size: 86, lr: 1.09e-02, grad_scale: 32.0 2023-06-15 09:55:40,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=104060.0, ans=0.125 2023-06-15 09:55:44,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=104060.0, ans=0.125 2023-06-15 09:55:51,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.792e+02 2.087e+02 2.377e+02 3.180e+02, threshold=4.174e+02, percent-clipped=0.0 2023-06-15 09:55:52,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.77 vs. limit=10.0 2023-06-15 09:56:02,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=104126.66666666667, ans=6.0 2023-06-15 09:56:06,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=104126.66666666667, ans=0.125 2023-06-15 09:56:11,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104193.33333333333, ans=0.1 2023-06-15 09:56:19,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=104193.33333333333, ans=0.0 2023-06-15 09:56:24,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=104193.33333333333, ans=0.2 2023-06-15 09:56:29,873 INFO [train.py:988] (1/4) Epoch 30, batch 200, loss[loss=0.2274, simple_loss=0.3085, pruned_loss=0.07314, over 18632.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.299, pruned_loss=0.07374, over 2424702.43 frames. ], batch size: 80, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:56:52,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=104326.66666666667, ans=0.125 2023-06-15 09:57:52,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=104526.66666666667, ans=0.2 2023-06-15 09:57:54,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=104593.33333333333, ans=0.125 2023-06-15 09:57:56,081 INFO [train.py:988] (1/4) Epoch 30, batch 250, loss[loss=0.2051, simple_loss=0.2888, pruned_loss=0.06073, over 18924.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2994, pruned_loss=0.07378, over 2736175.93 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:57:59,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=104593.33333333333, ans=0.125 2023-06-15 09:58:31,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=15.0 2023-06-15 09:58:45,943 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.806e+02 1.918e+02 2.139e+02 3.492e+02, threshold=3.836e+02, percent-clipped=0.0 2023-06-15 09:58:46,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.10 vs. limit=15.0 2023-06-15 09:59:02,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.28 vs. limit=15.0 2023-06-15 09:59:10,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.78 vs. limit=15.0 2023-06-15 09:59:23,427 INFO [train.py:988] (1/4) Epoch 30, batch 300, loss[loss=0.2342, simple_loss=0.3217, pruned_loss=0.07328, over 17590.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2998, pruned_loss=0.07407, over 2976974.25 frames. ], batch size: 67, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 09:59:42,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=104993.33333333333, ans=0.125 2023-06-15 10:00:10,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-15 10:00:11,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=105060.0, ans=0.0 2023-06-15 10:00:32,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=105193.33333333333, ans=0.2 2023-06-15 10:00:43,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=105193.33333333333, ans=0.2 2023-06-15 10:00:46,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105193.33333333333, ans=0.1 2023-06-15 10:00:49,823 INFO [train.py:988] (1/4) Epoch 30, batch 350, loss[loss=0.2286, simple_loss=0.2989, pruned_loss=0.0792, over 20269.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3007, pruned_loss=0.07436, over 3157744.26 frames. ], batch size: 141, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:01:38,548 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.770e+02 1.941e+02 2.233e+02 2.810e+02, threshold=3.882e+02, percent-clipped=0.0 2023-06-15 10:02:00,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=105526.66666666667, ans=0.0 2023-06-15 10:02:15,108 INFO [train.py:988] (1/4) Epoch 30, batch 400, loss[loss=0.2378, simple_loss=0.2967, pruned_loss=0.0894, over 19922.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.301, pruned_loss=0.0746, over 3307414.19 frames. ], batch size: 294, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:02:15,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=105593.33333333333, ans=0.0 2023-06-15 10:02:57,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2023-06-15 10:03:40,281 INFO [train.py:988] (1/4) Epoch 30, batch 450, loss[loss=0.1984, simple_loss=0.2807, pruned_loss=0.05808, over 18943.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2998, pruned_loss=0.07385, over 3419794.81 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:04:07,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-06-15 10:04:12,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=105993.33333333333, ans=0.0 2023-06-15 10:04:28,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.828e+02 2.027e+02 2.323e+02 3.183e+02, threshold=4.054e+02, percent-clipped=0.0 2023-06-15 10:04:43,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.15 vs. limit=15.0 2023-06-15 10:05:04,414 INFO [train.py:988] (1/4) Epoch 30, batch 500, loss[loss=0.2625, simple_loss=0.3527, pruned_loss=0.08618, over 18316.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3, pruned_loss=0.07416, over 3500368.66 frames. ], batch size: 72, lr: 1.08e-02, grad_scale: 32.0 2023-06-15 10:05:13,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=106260.0, ans=0.125 2023-06-15 10:05:17,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=106260.0, ans=0.0 2023-06-15 10:05:42,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=106393.33333333333, ans=0.125 2023-06-15 10:05:48,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=106393.33333333333, ans=0.07 2023-06-15 10:05:50,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=106393.33333333333, ans=0.09899494936611666 2023-06-15 10:06:22,317 INFO [train.py:988] (1/4) Epoch 31, batch 0, loss[loss=0.23, simple_loss=0.3208, pruned_loss=0.06957, over 18302.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3208, pruned_loss=0.06957, over 18302.00 frames. ], batch size: 72, lr: 1.06e-02, grad_scale: 32.0 2023-06-15 10:06:22,317 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 10:06:28,538 INFO [train.py:1020] (1/4) Epoch 31, validation: loss=0.2014, simple_loss=0.3032, pruned_loss=0.0498, over 143649.00 frames. 2023-06-15 10:06:28,539 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 10:06:35,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=106480.0, ans=0.125 2023-06-15 10:06:41,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106480.0, ans=0.1 2023-06-15 10:07:24,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=106680.0, ans=0.125 2023-06-15 10:07:27,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=106680.0, ans=0.02 2023-06-15 10:07:33,151 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.78 vs. limit=15.0 2023-06-15 10:07:39,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=106746.66666666667, ans=0.125 2023-06-15 10:07:48,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.719e+02 1.932e+02 2.142e+02 3.238e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-15 10:07:56,266 INFO [train.py:988] (1/4) Epoch 31, batch 50, loss[loss=0.2065, simple_loss=0.2848, pruned_loss=0.06406, over 19527.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3003, pruned_loss=0.07259, over 839401.64 frames. ], batch size: 102, lr: 1.06e-02, grad_scale: 32.0 2023-06-15 10:07:59,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=106813.33333333333, ans=0.125 2023-06-15 10:08:41,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=106946.66666666667, ans=0.125 2023-06-15 10:08:46,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=107013.33333333333, ans=0.125 2023-06-15 10:08:55,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=107013.33333333333, ans=0.0 2023-06-15 10:09:22,803 INFO [train.py:988] (1/4) Epoch 31, batch 100, loss[loss=0.2265, simple_loss=0.3005, pruned_loss=0.07622, over 20284.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3004, pruned_loss=0.07249, over 1482361.02 frames. ], batch size: 141, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:09:32,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.76 vs. limit=22.5 2023-06-15 10:09:51,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=107213.33333333333, ans=0.0 2023-06-15 10:10:33,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=107413.33333333333, ans=0.125 2023-06-15 10:10:40,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.809e+02 2.076e+02 2.464e+02 3.768e+02, threshold=4.152e+02, percent-clipped=0.0 2023-06-15 10:10:41,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-15 10:10:49,499 INFO [train.py:988] (1/4) Epoch 31, batch 150, loss[loss=0.2184, simple_loss=0.2985, pruned_loss=0.06915, over 19139.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2987, pruned_loss=0.07197, over 2011945.69 frames. ], batch size: 94, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:11:11,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=107546.66666666667, ans=0.09899494936611666 2023-06-15 10:12:08,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=107746.66666666667, ans=0.0 2023-06-15 10:12:15,584 INFO [train.py:988] (1/4) Epoch 31, batch 200, loss[loss=0.2035, simple_loss=0.2914, pruned_loss=0.0578, over 19223.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2994, pruned_loss=0.07244, over 2398456.00 frames. ], batch size: 92, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:12:15,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=107813.33333333333, ans=0.0 2023-06-15 10:12:33,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107880.0, ans=0.1 2023-06-15 10:12:40,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-15 10:13:33,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.862e+02 2.218e+02 2.544e+02 4.456e+02, threshold=4.436e+02, percent-clipped=1.0 2023-06-15 10:13:41,913 INFO [train.py:988] (1/4) Epoch 31, batch 250, loss[loss=0.247, simple_loss=0.3414, pruned_loss=0.07626, over 18311.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3, pruned_loss=0.07255, over 2698552.37 frames. ], batch size: 72, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:13:44,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=108146.66666666667, ans=0.0 2023-06-15 10:14:13,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=108213.33333333333, ans=0.125 2023-06-15 10:14:23,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=108280.0, ans=0.0 2023-06-15 10:14:46,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.59 vs. limit=22.5 2023-06-15 10:14:48,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=12.0 2023-06-15 10:14:56,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2023-06-15 10:15:06,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=108413.33333333333, ans=0.125 2023-06-15 10:15:07,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=108480.0, ans=0.0 2023-06-15 10:15:09,059 INFO [train.py:988] (1/4) Epoch 31, batch 300, loss[loss=0.2158, simple_loss=0.2943, pruned_loss=0.06866, over 19463.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.299, pruned_loss=0.07287, over 2946190.76 frames. ], batch size: 105, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:15:11,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-15 10:15:24,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=108480.0, ans=10.0 2023-06-15 10:15:37,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=108546.66666666667, ans=0.125 2023-06-15 10:15:55,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=108613.33333333333, ans=0.0 2023-06-15 10:16:04,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=108680.0, ans=0.1 2023-06-15 10:16:10,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=108680.0, ans=0.0 2023-06-15 10:16:17,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=15.0 2023-06-15 10:16:26,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.814e+02 2.012e+02 2.294e+02 3.766e+02, threshold=4.023e+02, percent-clipped=0.0 2023-06-15 10:16:34,829 INFO [train.py:988] (1/4) Epoch 31, batch 350, loss[loss=0.2234, simple_loss=0.3046, pruned_loss=0.07108, over 19726.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2989, pruned_loss=0.07239, over 3135414.42 frames. ], batch size: 110, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:16:43,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-06-15 10:16:58,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=108880.0, ans=0.125 2023-06-15 10:17:21,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=22.5 2023-06-15 10:17:34,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109013.33333333333, ans=0.1 2023-06-15 10:18:00,826 INFO [train.py:988] (1/4) Epoch 31, batch 400, loss[loss=0.2225, simple_loss=0.2902, pruned_loss=0.07744, over 20657.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2991, pruned_loss=0.07227, over 3268106.82 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:18:29,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-15 10:19:06,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=109346.66666666667, ans=0.125 2023-06-15 10:19:13,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=109413.33333333333, ans=0.125 2023-06-15 10:19:14,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=12.0 2023-06-15 10:19:20,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.873e+02 2.079e+02 2.420e+02 3.208e+02, threshold=4.158e+02, percent-clipped=0.0 2023-06-15 10:19:27,287 INFO [train.py:988] (1/4) Epoch 31, batch 450, loss[loss=0.2238, simple_loss=0.3042, pruned_loss=0.0717, over 18501.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2987, pruned_loss=0.07216, over 3386553.32 frames. ], batch size: 77, lr: 1.05e-02, grad_scale: 32.0 2023-06-15 10:19:50,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.11 vs. limit=15.0 2023-06-15 10:19:54,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.94 vs. limit=22.5 2023-06-15 10:20:07,906 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:20:22,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=109680.0, ans=0.2 2023-06-15 10:20:31,531 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=15.0 2023-06-15 10:20:35,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=109746.66666666667, ans=0.0 2023-06-15 10:20:42,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=109746.66666666667, ans=0.1 2023-06-15 10:20:48,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=109746.66666666667, ans=0.125 2023-06-15 10:20:51,576 INFO [train.py:988] (1/4) Epoch 31, batch 500, loss[loss=0.2216, simple_loss=0.3065, pruned_loss=0.06833, over 16741.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2983, pruned_loss=0.07188, over 3477474.95 frames. ], batch size: 59, lr: 1.04e-02, grad_scale: 32.0 2023-06-15 10:20:55,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=109813.33333333333, ans=0.0 2023-06-15 10:21:09,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=109880.0, ans=0.0 2023-06-15 10:22:07,563 INFO [train.py:988] (1/4) Epoch 32, batch 0, loss[loss=0.2148, simple_loss=0.2947, pruned_loss=0.06739, over 19536.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2947, pruned_loss=0.06739, over 19536.00 frames. ], batch size: 102, lr: 1.03e-02, grad_scale: 32.0 2023-06-15 10:22:07,564 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 10:22:13,531 INFO [train.py:1020] (1/4) Epoch 32, validation: loss=0.1996, simple_loss=0.3022, pruned_loss=0.04853, over 143649.00 frames. 2023-06-15 10:22:13,532 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 10:22:15,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=110026.66666666667, ans=0.125 2023-06-15 10:22:16,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-06-15 10:22:17,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=110026.66666666667, ans=0.125 2023-06-15 10:22:25,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=110026.66666666667, ans=0.2 2023-06-15 10:22:33,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110093.33333333333, ans=0.125 2023-06-15 10:22:37,962 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.793e+02 2.026e+02 2.443e+02 4.216e+02, threshold=4.052e+02, percent-clipped=1.0 2023-06-15 10:22:40,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2023-06-15 10:22:58,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=110160.0, ans=0.1 2023-06-15 10:23:01,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=110160.0, ans=0.125 2023-06-15 10:23:06,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=110226.66666666667, ans=0.125 2023-06-15 10:23:11,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=110226.66666666667, ans=0.2 2023-06-15 10:23:18,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=110226.66666666667, ans=10.0 2023-06-15 10:23:40,691 INFO [train.py:988] (1/4) Epoch 32, batch 50, loss[loss=0.2158, simple_loss=0.2945, pruned_loss=0.06857, over 18777.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2972, pruned_loss=0.07062, over 848346.29 frames. ], batch size: 83, lr: 1.03e-02, grad_scale: 32.0 2023-06-15 10:23:58,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=110426.66666666667, ans=0.125 2023-06-15 10:24:00,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=22.5 2023-06-15 10:24:09,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=110426.66666666667, ans=0.2 2023-06-15 10:24:22,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=110493.33333333333, ans=0.125 2023-06-15 10:25:08,054 INFO [train.py:988] (1/4) Epoch 32, batch 100, loss[loss=0.2108, simple_loss=0.2951, pruned_loss=0.06322, over 19069.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2982, pruned_loss=0.07436, over 1509893.31 frames. ], batch size: 89, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:25:31,498 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.727e+02 1.862e+02 2.037e+02 3.271e+02, threshold=3.724e+02, percent-clipped=0.0 2023-06-15 10:25:48,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=110826.66666666667, ans=0.0 2023-06-15 10:25:51,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110826.66666666667, ans=0.1 2023-06-15 10:26:10,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-06-15 10:26:17,106 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:26:34,705 INFO [train.py:988] (1/4) Epoch 32, batch 150, loss[loss=0.2152, simple_loss=0.299, pruned_loss=0.06571, over 18776.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.299, pruned_loss=0.07338, over 2016811.11 frames. ], batch size: 83, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:26:38,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=111026.66666666667, ans=0.2 2023-06-15 10:27:06,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=111093.33333333333, ans=0.0 2023-06-15 10:27:34,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=12.0 2023-06-15 10:27:58,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=111293.33333333333, ans=0.125 2023-06-15 10:28:00,902 INFO [train.py:988] (1/4) Epoch 32, batch 200, loss[loss=0.2067, simple_loss=0.2861, pruned_loss=0.06367, over 19519.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2976, pruned_loss=0.07277, over 2420769.02 frames. ], batch size: 102, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:28:02,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.90 vs. limit=22.5 2023-06-15 10:28:12,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=111360.0, ans=15.0 2023-06-15 10:28:24,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.849e+02 2.097e+02 2.469e+02 3.862e+02, threshold=4.194e+02, percent-clipped=1.0 2023-06-15 10:28:42,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=111493.33333333333, ans=0.0 2023-06-15 10:29:09,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=111626.66666666667, ans=0.125 2023-06-15 10:29:11,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=111626.66666666667, ans=0.2 2023-06-15 10:29:26,809 INFO [train.py:988] (1/4) Epoch 32, batch 250, loss[loss=0.2039, simple_loss=0.3001, pruned_loss=0.05387, over 15015.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2973, pruned_loss=0.07221, over 2706264.27 frames. ], batch size: 43, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:29:46,317 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:30:19,882 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:30:23,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=111893.33333333333, ans=0.0 2023-06-15 10:30:53,807 INFO [train.py:988] (1/4) Epoch 32, batch 300, loss[loss=0.2089, simple_loss=0.2938, pruned_loss=0.06205, over 19075.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2972, pruned_loss=0.07245, over 2937571.97 frames. ], batch size: 94, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:30:57,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=112026.66666666667, ans=0.0 2023-06-15 10:31:06,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=112026.66666666667, ans=0.0 2023-06-15 10:31:18,117 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.817e+02 2.017e+02 2.252e+02 3.365e+02, threshold=4.033e+02, percent-clipped=0.0 2023-06-15 10:32:20,596 INFO [train.py:988] (1/4) Epoch 32, batch 350, loss[loss=0.2244, simple_loss=0.2965, pruned_loss=0.07613, over 20493.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2965, pruned_loss=0.0725, over 3123272.67 frames. ], batch size: 160, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:32:20,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=112360.0, ans=0.125 2023-06-15 10:33:04,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-15 10:33:45,063 INFO [train.py:988] (1/4) Epoch 32, batch 400, loss[loss=0.2074, simple_loss=0.2924, pruned_loss=0.06123, over 19336.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2978, pruned_loss=0.07193, over 3249872.41 frames. ], batch size: 98, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:33:45,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=112693.33333333333, ans=0.02 2023-06-15 10:33:49,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=112693.33333333333, ans=0.125 2023-06-15 10:33:55,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=112693.33333333333, ans=0.125 2023-06-15 10:34:09,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.921e+02 2.168e+02 2.475e+02 4.297e+02, threshold=4.337e+02, percent-clipped=1.0 2023-06-15 10:34:45,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=112893.33333333333, ans=0.04949747468305833 2023-06-15 10:34:57,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=112960.0, ans=0.1 2023-06-15 10:35:02,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=112960.0, ans=0.2 2023-06-15 10:35:08,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=112960.0, ans=0.07 2023-06-15 10:35:11,172 INFO [train.py:988] (1/4) Epoch 32, batch 450, loss[loss=0.2107, simple_loss=0.2942, pruned_loss=0.06361, over 19536.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2969, pruned_loss=0.07131, over 3384771.79 frames. ], batch size: 102, lr: 1.02e-02, grad_scale: 32.0 2023-06-15 10:35:28,483 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.00 vs. limit=10.0 2023-06-15 10:35:31,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=113093.33333333333, ans=0.125 2023-06-15 10:35:36,197 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:36:16,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=113226.66666666667, ans=0.125 2023-06-15 10:36:36,132 INFO [train.py:988] (1/4) Epoch 32, batch 500, loss[loss=0.2022, simple_loss=0.2817, pruned_loss=0.06139, over 19822.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2967, pruned_loss=0.07111, over 3486003.42 frames. ], batch size: 115, lr: 1.01e-02, grad_scale: 32.0 2023-06-15 10:36:47,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113360.0, ans=0.1 2023-06-15 10:36:59,361 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.786e+02 2.074e+02 2.314e+02 3.487e+02, threshold=4.148e+02, percent-clipped=0.0 2023-06-15 10:37:12,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=113493.33333333333, ans=10.0 2023-06-15 10:37:44,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113573.33333333333, ans=0.1 2023-06-15 10:37:52,775 INFO [train.py:988] (1/4) Epoch 33, batch 0, loss[loss=0.2199, simple_loss=0.3104, pruned_loss=0.0647, over 11384.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3104, pruned_loss=0.0647, over 11384.00 frames. ], batch size: 32, lr: 9.98e-03, grad_scale: 32.0 2023-06-15 10:37:52,776 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 10:37:58,955 INFO [train.py:1020] (1/4) Epoch 33, validation: loss=0.2021, simple_loss=0.3035, pruned_loss=0.05038, over 143649.00 frames. 2023-06-15 10:37:58,956 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 10:38:43,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113706.66666666667, ans=0.1 2023-06-15 10:39:11,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=113840.0, ans=0.125 2023-06-15 10:39:25,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=113906.66666666667, ans=0.125 2023-06-15 10:39:26,797 INFO [train.py:988] (1/4) Epoch 33, batch 50, loss[loss=0.2063, simple_loss=0.2898, pruned_loss=0.06142, over 18296.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2945, pruned_loss=0.07078, over 852949.24 frames. ], batch size: 74, lr: 9.96e-03, grad_scale: 32.0 2023-06-15 10:39:27,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=113906.66666666667, ans=0.2 2023-06-15 10:39:53,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=113973.33333333333, ans=0.2 2023-06-15 10:40:00,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=114040.0, ans=10.0 2023-06-15 10:40:22,220 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.812e+02 2.020e+02 2.321e+02 4.264e+02, threshold=4.041e+02, percent-clipped=1.0 2023-06-15 10:40:26,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.43 vs. limit=22.5 2023-06-15 10:40:53,380 INFO [train.py:988] (1/4) Epoch 33, batch 100, loss[loss=0.2253, simple_loss=0.2961, pruned_loss=0.07726, over 20318.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2947, pruned_loss=0.06988, over 1511218.16 frames. ], batch size: 149, lr: 9.95e-03, grad_scale: 32.0 2023-06-15 10:41:01,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=114240.0, ans=0.0 2023-06-15 10:41:07,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=114240.0, ans=0.0 2023-06-15 10:41:10,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=114306.66666666667, ans=0.0 2023-06-15 10:41:16,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2023-06-15 10:41:17,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=114306.66666666667, ans=0.125 2023-06-15 10:41:36,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=114373.33333333333, ans=0.0 2023-06-15 10:41:43,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=114440.0, ans=0.04949747468305833 2023-06-15 10:42:19,607 INFO [train.py:988] (1/4) Epoch 33, batch 150, loss[loss=0.2096, simple_loss=0.2953, pruned_loss=0.06193, over 19222.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2964, pruned_loss=0.06893, over 2005640.81 frames. ], batch size: 92, lr: 9.94e-03, grad_scale: 32.0 2023-06-15 10:42:30,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-15 10:43:07,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.74 vs. limit=12.0 2023-06-15 10:43:09,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=114773.33333333333, ans=0.5 2023-06-15 10:43:14,644 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.854e+02 2.036e+02 2.409e+02 3.930e+02, threshold=4.072e+02, percent-clipped=0.0 2023-06-15 10:43:33,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=114840.0, ans=0.125 2023-06-15 10:43:45,584 INFO [train.py:988] (1/4) Epoch 33, batch 200, loss[loss=0.2103, simple_loss=0.2932, pruned_loss=0.06373, over 19204.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2957, pruned_loss=0.06899, over 2416893.74 frames. ], batch size: 92, lr: 9.93e-03, grad_scale: 32.0 2023-06-15 10:44:46,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=115106.66666666667, ans=0.125 2023-06-15 10:45:05,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=115173.33333333333, ans=0.025 2023-06-15 10:45:11,580 INFO [train.py:988] (1/4) Epoch 33, batch 250, loss[loss=0.2331, simple_loss=0.2983, pruned_loss=0.08396, over 20702.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2966, pruned_loss=0.06925, over 2716823.04 frames. ], batch size: 211, lr: 9.92e-03, grad_scale: 32.0 2023-06-15 10:45:17,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=115240.0, ans=0.0 2023-06-15 10:45:37,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=115306.66666666667, ans=0.125 2023-06-15 10:45:37,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=115306.66666666667, ans=0.2 2023-06-15 10:45:44,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=115373.33333333333, ans=0.125 2023-06-15 10:45:48,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=115373.33333333333, ans=0.125 2023-06-15 10:45:49,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=115373.33333333333, ans=0.035 2023-06-15 10:45:55,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.03 vs. limit=6.0 2023-06-15 10:46:06,696 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.758e+02 1.982e+02 2.404e+02 3.997e+02, threshold=3.964e+02, percent-clipped=0.0 2023-06-15 10:46:13,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=115440.0, ans=0.1 2023-06-15 10:46:37,244 INFO [train.py:988] (1/4) Epoch 33, batch 300, loss[loss=0.2051, simple_loss=0.2863, pruned_loss=0.06193, over 19057.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2962, pruned_loss=0.06943, over 2959741.18 frames. ], batch size: 89, lr: 9.90e-03, grad_scale: 32.0 2023-06-15 10:46:49,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=115573.33333333333, ans=0.1 2023-06-15 10:46:51,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=115573.33333333333, ans=0.0 2023-06-15 10:46:54,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=115640.0, ans=0.09899494936611666 2023-06-15 10:47:03,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=15.0 2023-06-15 10:47:13,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=115706.66666666667, ans=0.125 2023-06-15 10:47:42,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-15 10:48:02,183 INFO [train.py:988] (1/4) Epoch 33, batch 350, loss[loss=0.218, simple_loss=0.288, pruned_loss=0.07401, over 20632.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2959, pruned_loss=0.06913, over 3142994.51 frames. ], batch size: 189, lr: 9.89e-03, grad_scale: 32.0 2023-06-15 10:48:14,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=15.0 2023-06-15 10:48:21,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=115973.33333333333, ans=0.0 2023-06-15 10:48:23,730 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:48:34,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-15 10:48:57,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.859e+02 2.087e+02 2.471e+02 4.224e+02, threshold=4.174e+02, percent-clipped=1.0 2023-06-15 10:49:02,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=116106.66666666667, ans=0.0 2023-06-15 10:49:10,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-15 10:49:16,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=116173.33333333333, ans=0.0 2023-06-15 10:49:19,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=116173.33333333333, ans=0.125 2023-06-15 10:49:21,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-15 10:49:28,267 INFO [train.py:988] (1/4) Epoch 33, batch 400, loss[loss=0.2047, simple_loss=0.2889, pruned_loss=0.06021, over 19872.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2962, pruned_loss=0.06974, over 3293885.82 frames. ], batch size: 120, lr: 9.88e-03, grad_scale: 32.0 2023-06-15 10:49:39,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=116240.0, ans=0.125 2023-06-15 10:50:03,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=116373.33333333333, ans=0.0 2023-06-15 10:50:06,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=116373.33333333333, ans=0.125 2023-06-15 10:50:06,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=116373.33333333333, ans=0.2 2023-06-15 10:50:08,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116373.33333333333, ans=0.1 2023-06-15 10:50:35,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=116506.66666666667, ans=0.125 2023-06-15 10:50:37,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=116506.66666666667, ans=0.0 2023-06-15 10:50:47,089 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-15 10:50:48,204 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:50:53,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=116573.33333333333, ans=0.125 2023-06-15 10:50:54,726 INFO [train.py:988] (1/4) Epoch 33, batch 450, loss[loss=0.2475, simple_loss=0.3105, pruned_loss=0.09226, over 20308.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2961, pruned_loss=0.06966, over 3410570.46 frames. ], batch size: 149, lr: 9.87e-03, grad_scale: 32.0 2023-06-15 10:51:14,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=116640.0, ans=0.125 2023-06-15 10:51:50,045 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.786e+02 1.951e+02 2.072e+02 2.874e+02, threshold=3.901e+02, percent-clipped=0.0 2023-06-15 10:52:00,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=116840.0, ans=0.2 2023-06-15 10:52:06,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=116840.0, ans=0.1 2023-06-15 10:52:07,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=116840.0, ans=0.5 2023-06-15 10:52:17,728 INFO [train.py:988] (1/4) Epoch 33, batch 500, loss[loss=0.225, simple_loss=0.2941, pruned_loss=0.07796, over 19946.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2965, pruned_loss=0.06973, over 3473095.79 frames. ], batch size: 126, lr: 9.86e-03, grad_scale: 32.0 2023-06-15 10:52:19,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=116906.66666666667, ans=0.2 2023-06-15 10:52:24,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=116906.66666666667, ans=0.0 2023-06-15 10:52:28,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.63 vs. limit=22.5 2023-06-15 10:53:34,840 INFO [train.py:988] (1/4) Epoch 34, batch 0, loss[loss=0.2194, simple_loss=0.2948, pruned_loss=0.072, over 20559.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2948, pruned_loss=0.072, over 20559.00 frames. ], batch size: 173, lr: 9.70e-03, grad_scale: 32.0 2023-06-15 10:53:34,841 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 10:53:41,140 INFO [train.py:1020] (1/4) Epoch 34, validation: loss=0.2011, simple_loss=0.3024, pruned_loss=0.04991, over 143649.00 frames. 2023-06-15 10:53:41,141 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 10:53:47,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=117126.66666666667, ans=0.2 2023-06-15 10:54:14,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=117260.0, ans=0.125 2023-06-15 10:54:49,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=117326.66666666667, ans=0.0 2023-06-15 10:55:10,948 INFO [train.py:988] (1/4) Epoch 34, batch 50, loss[loss=0.1867, simple_loss=0.2696, pruned_loss=0.05187, over 19557.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2968, pruned_loss=0.07043, over 860371.45 frames. ], batch size: 102, lr: 9.69e-03, grad_scale: 16.0 2023-06-15 10:55:12,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.849e+02 2.100e+02 2.383e+02 3.120e+02, threshold=4.200e+02, percent-clipped=0.0 2023-06-15 10:55:25,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=117460.0, ans=0.0 2023-06-15 10:56:02,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=117593.33333333333, ans=0.5 2023-06-15 10:56:13,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=117660.0, ans=0.125 2023-06-15 10:56:21,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=117726.66666666667, ans=0.125 2023-06-15 10:56:40,502 INFO [train.py:988] (1/4) Epoch 34, batch 100, loss[loss=0.2094, simple_loss=0.2945, pruned_loss=0.06214, over 19479.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.296, pruned_loss=0.06888, over 1516111.59 frames. ], batch size: 105, lr: 9.68e-03, grad_scale: 16.0 2023-06-15 10:56:50,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=117793.33333333333, ans=0.0 2023-06-15 10:56:58,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.06 vs. limit=15.0 2023-06-15 10:57:01,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=117860.0, ans=0.125 2023-06-15 10:57:59,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=118060.0, ans=0.0 2023-06-15 10:58:01,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=118060.0, ans=0.125 2023-06-15 10:58:09,682 INFO [train.py:988] (1/4) Epoch 34, batch 150, loss[loss=0.2132, simple_loss=0.2938, pruned_loss=0.0663, over 19822.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2968, pruned_loss=0.07026, over 2011597.31 frames. ], batch size: 115, lr: 9.67e-03, grad_scale: 16.0 2023-06-15 10:58:11,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.765e+02 1.971e+02 2.282e+02 3.738e+02, threshold=3.942e+02, percent-clipped=0.0 2023-06-15 10:58:52,010 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 10:58:55,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=118260.0, ans=0.2 2023-06-15 10:59:18,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=118326.66666666667, ans=0.09899494936611666 2023-06-15 10:59:37,547 INFO [train.py:988] (1/4) Epoch 34, batch 200, loss[loss=0.2304, simple_loss=0.312, pruned_loss=0.07439, over 18612.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2969, pruned_loss=0.06984, over 2399426.47 frames. ], batch size: 80, lr: 9.65e-03, grad_scale: 16.0 2023-06-15 10:59:42,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=118460.0, ans=0.125 2023-06-15 10:59:42,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=118460.0, ans=0.125 2023-06-15 10:59:46,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=118460.0, ans=0.025 2023-06-15 11:00:02,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=118526.66666666667, ans=0.125 2023-06-15 11:00:34,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=118660.0, ans=0.0 2023-06-15 11:00:58,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=118726.66666666667, ans=0.125 2023-06-15 11:01:07,017 INFO [train.py:988] (1/4) Epoch 34, batch 250, loss[loss=0.2199, simple_loss=0.2983, pruned_loss=0.07073, over 19060.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2961, pruned_loss=0.06961, over 2720807.90 frames. ], batch size: 89, lr: 9.64e-03, grad_scale: 16.0 2023-06-15 11:01:09,079 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.911e+02 2.159e+02 2.494e+02 3.715e+02, threshold=4.319e+02, percent-clipped=0.0 2023-06-15 11:01:10,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118793.33333333333, ans=0.0 2023-06-15 11:01:14,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118793.33333333333, ans=0.1 2023-06-15 11:01:27,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=118860.0, ans=0.125 2023-06-15 11:01:36,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=118860.0, ans=0.0 2023-06-15 11:01:37,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=118860.0, ans=0.125 2023-06-15 11:01:42,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=118926.66666666667, ans=0.0 2023-06-15 11:02:00,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-15 11:02:04,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=118993.33333333333, ans=0.0 2023-06-15 11:02:06,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=118993.33333333333, ans=0.125 2023-06-15 11:02:14,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=118993.33333333333, ans=0.125 2023-06-15 11:02:34,399 INFO [train.py:988] (1/4) Epoch 34, batch 300, loss[loss=0.2333, simple_loss=0.3187, pruned_loss=0.07394, over 16959.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2961, pruned_loss=0.06926, over 2944536.95 frames. ], batch size: 60, lr: 9.63e-03, grad_scale: 16.0 2023-06-15 11:03:05,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.23 vs. limit=10.0 2023-06-15 11:03:20,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=119260.0, ans=0.125 2023-06-15 11:03:53,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=119393.33333333333, ans=0.0 2023-06-15 11:04:03,794 INFO [train.py:988] (1/4) Epoch 34, batch 350, loss[loss=0.2074, simple_loss=0.2889, pruned_loss=0.06293, over 19471.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2952, pruned_loss=0.06902, over 3136487.95 frames. ], batch size: 105, lr: 9.62e-03, grad_scale: 16.0 2023-06-15 11:04:03,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=119460.0, ans=0.0 2023-06-15 11:04:05,498 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.947e+02 2.157e+02 2.600e+02 3.674e+02, threshold=4.313e+02, percent-clipped=0.0 2023-06-15 11:04:15,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=119460.0, ans=0.0 2023-06-15 11:04:16,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=119460.0, ans=0.125 2023-06-15 11:04:52,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119593.33333333333, ans=0.1 2023-06-15 11:05:27,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=119726.66666666667, ans=0.2 2023-06-15 11:05:34,488 INFO [train.py:988] (1/4) Epoch 34, batch 400, loss[loss=0.1954, simple_loss=0.2832, pruned_loss=0.0538, over 19538.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2953, pruned_loss=0.06887, over 3271502.33 frames. ], batch size: 102, lr: 9.61e-03, grad_scale: 32.0 2023-06-15 11:05:46,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=119793.33333333333, ans=0.0 2023-06-15 11:06:03,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=119860.0, ans=0.125 2023-06-15 11:06:19,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=119926.66666666667, ans=0.125 2023-06-15 11:06:22,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=119926.66666666667, ans=0.125 2023-06-15 11:06:26,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=119993.33333333333, ans=0.0 2023-06-15 11:06:46,240 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:06:49,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=120060.0, ans=0.05 2023-06-15 11:07:03,749 INFO [train.py:988] (1/4) Epoch 34, batch 450, loss[loss=0.234, simple_loss=0.3135, pruned_loss=0.0772, over 15128.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2945, pruned_loss=0.06897, over 3384829.02 frames. ], batch size: 43, lr: 9.60e-03, grad_scale: 32.0 2023-06-15 11:07:05,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.842e+02 2.161e+02 2.491e+02 3.686e+02, threshold=4.322e+02, percent-clipped=0.0 2023-06-15 11:07:07,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=120126.66666666667, ans=0.125 2023-06-15 11:07:25,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-06-15 11:08:15,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.00 vs. limit=10.0 2023-06-15 11:08:19,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120393.33333333333, ans=0.125 2023-06-15 11:08:28,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=120460.0, ans=0.125 2023-06-15 11:08:29,406 INFO [train.py:988] (1/4) Epoch 34, batch 500, loss[loss=0.2077, simple_loss=0.2904, pruned_loss=0.0625, over 19540.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2946, pruned_loss=0.06916, over 3478620.60 frames. ], batch size: 102, lr: 9.59e-03, grad_scale: 32.0 2023-06-15 11:08:50,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=120526.66666666667, ans=0.0 2023-06-15 11:09:43,652 INFO [train.py:988] (1/4) Epoch 35, batch 0, loss[loss=0.2152, simple_loss=0.2955, pruned_loss=0.06746, over 19865.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2955, pruned_loss=0.06746, over 19865.00 frames. ], batch size: 120, lr: 9.44e-03, grad_scale: 32.0 2023-06-15 11:09:43,652 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 11:09:49,767 INFO [train.py:1020] (1/4) Epoch 35, validation: loss=0.2016, simple_loss=0.3016, pruned_loss=0.05077, over 143649.00 frames. 2023-06-15 11:09:49,768 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 11:10:21,594 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.805e+02 2.012e+02 2.315e+02 3.975e+02, threshold=4.025e+02, percent-clipped=0.0 2023-06-15 11:10:48,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=120880.0, ans=10.0 2023-06-15 11:11:06,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-15 11:11:17,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=121013.33333333333, ans=0.0 2023-06-15 11:11:18,994 INFO [train.py:988] (1/4) Epoch 35, batch 50, loss[loss=0.2296, simple_loss=0.3023, pruned_loss=0.07848, over 19958.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2964, pruned_loss=0.0682, over 861414.61 frames. ], batch size: 126, lr: 9.43e-03, grad_scale: 32.0 2023-06-15 11:11:47,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=121080.0, ans=0.125 2023-06-15 11:11:50,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=121080.0, ans=0.125 2023-06-15 11:11:59,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.41 vs. limit=10.0 2023-06-15 11:12:09,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.41 vs. limit=12.0 2023-06-15 11:12:25,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-15 11:12:40,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121280.0, ans=0.1 2023-06-15 11:12:47,371 INFO [train.py:988] (1/4) Epoch 35, batch 100, loss[loss=0.2223, simple_loss=0.3106, pruned_loss=0.067, over 18476.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2961, pruned_loss=0.06789, over 1503436.07 frames. ], batch size: 77, lr: 9.42e-03, grad_scale: 32.0 2023-06-15 11:13:01,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121346.66666666667, ans=0.1 2023-06-15 11:13:18,552 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.883e+02 2.096e+02 2.428e+02 4.337e+02, threshold=4.193e+02, percent-clipped=1.0 2023-06-15 11:13:35,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=121480.0, ans=10.0 2023-06-15 11:13:42,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=121546.66666666667, ans=0.125 2023-06-15 11:14:15,098 INFO [train.py:988] (1/4) Epoch 35, batch 150, loss[loss=0.2397, simple_loss=0.3194, pruned_loss=0.07995, over 18471.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2959, pruned_loss=0.0685, over 2000095.12 frames. ], batch size: 77, lr: 9.41e-03, grad_scale: 32.0 2023-06-15 11:14:24,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=121680.0, ans=0.0 2023-06-15 11:14:59,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=121813.33333333333, ans=0.125 2023-06-15 11:15:01,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=121813.33333333333, ans=0.2 2023-06-15 11:15:12,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=121880.0, ans=0.1 2023-06-15 11:15:43,545 INFO [train.py:988] (1/4) Epoch 35, batch 200, loss[loss=0.2126, simple_loss=0.2937, pruned_loss=0.06574, over 19089.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2966, pruned_loss=0.06869, over 2388466.73 frames. ], batch size: 89, lr: 9.40e-03, grad_scale: 32.0 2023-06-15 11:15:45,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=122013.33333333333, ans=0.09899494936611666 2023-06-15 11:16:14,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.845e+02 2.048e+02 2.405e+02 3.914e+02, threshold=4.095e+02, percent-clipped=0.0 2023-06-15 11:16:24,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=22.5 2023-06-15 11:17:01,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=122280.0, ans=0.125 2023-06-15 11:17:09,621 INFO [train.py:988] (1/4) Epoch 35, batch 250, loss[loss=0.2311, simple_loss=0.2682, pruned_loss=0.09697, over 17126.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2957, pruned_loss=0.06851, over 2703825.90 frames. ], batch size: 391, lr: 9.38e-03, grad_scale: 32.0 2023-06-15 11:17:29,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122413.33333333333, ans=0.125 2023-06-15 11:17:56,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122480.0, ans=0.1 2023-06-15 11:18:04,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=122546.66666666667, ans=0.125 2023-06-15 11:18:07,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=122546.66666666667, ans=0.125 2023-06-15 11:18:18,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=122613.33333333333, ans=0.0 2023-06-15 11:18:36,372 INFO [train.py:988] (1/4) Epoch 35, batch 300, loss[loss=0.1997, simple_loss=0.2819, pruned_loss=0.05872, over 19802.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2956, pruned_loss=0.0688, over 2928838.04 frames. ], batch size: 115, lr: 9.37e-03, grad_scale: 32.0 2023-06-15 11:18:58,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-15 11:19:06,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.736e+02 1.889e+02 2.139e+02 2.972e+02, threshold=3.778e+02, percent-clipped=0.0 2023-06-15 11:19:50,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=122946.66666666667, ans=0.125 2023-06-15 11:20:01,986 INFO [train.py:988] (1/4) Epoch 35, batch 350, loss[loss=0.211, simple_loss=0.2927, pruned_loss=0.06465, over 18272.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2944, pruned_loss=0.06868, over 3125238.80 frames. ], batch size: 74, lr: 9.36e-03, grad_scale: 32.0 2023-06-15 11:20:07,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=123013.33333333333, ans=0.125 2023-06-15 11:21:03,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123213.33333333333, ans=0.1 2023-06-15 11:21:24,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-15 11:21:28,670 INFO [train.py:988] (1/4) Epoch 35, batch 400, loss[loss=0.2012, simple_loss=0.2865, pruned_loss=0.05796, over 19102.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2938, pruned_loss=0.06895, over 3284640.07 frames. ], batch size: 94, lr: 9.35e-03, grad_scale: 32.0 2023-06-15 11:21:59,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.852e+02 2.101e+02 2.531e+02 3.269e+02, threshold=4.203e+02, percent-clipped=0.0 2023-06-15 11:22:05,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2023-06-15 11:22:06,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=123480.0, ans=0.025 2023-06-15 11:22:31,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-15 11:22:54,581 INFO [train.py:988] (1/4) Epoch 35, batch 450, loss[loss=0.1996, simple_loss=0.2921, pruned_loss=0.05351, over 18289.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2941, pruned_loss=0.0686, over 3375054.68 frames. ], batch size: 74, lr: 9.34e-03, grad_scale: 32.0 2023-06-15 11:22:58,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=123680.0, ans=0.09899494936611666 2023-06-15 11:23:16,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=123746.66666666667, ans=0.0 2023-06-15 11:23:22,436 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.61 vs. limit=22.5 2023-06-15 11:23:46,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=123880.0, ans=0.0 2023-06-15 11:24:14,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-15 11:24:18,935 INFO [train.py:988] (1/4) Epoch 35, batch 500, loss[loss=0.2382, simple_loss=0.3321, pruned_loss=0.07215, over 15411.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2943, pruned_loss=0.06868, over 3464079.18 frames. ], batch size: 44, lr: 9.33e-03, grad_scale: 32.0 2023-06-15 11:24:19,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=124013.33333333333, ans=0.0 2023-06-15 11:24:19,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=124013.33333333333, ans=0.0 2023-06-15 11:24:30,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=124013.33333333333, ans=0.95 2023-06-15 11:24:46,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=124080.0, ans=0.0 2023-06-15 11:24:47,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.833e+02 2.002e+02 2.181e+02 2.864e+02, threshold=4.004e+02, percent-clipped=0.0 2023-06-15 11:25:33,792 INFO [train.py:988] (1/4) Epoch 36, batch 0, loss[loss=0.2463, simple_loss=0.3136, pruned_loss=0.08945, over 19982.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3136, pruned_loss=0.08945, over 19982.00 frames. ], batch size: 126, lr: 9.19e-03, grad_scale: 32.0 2023-06-15 11:25:33,793 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 11:25:39,900 INFO [train.py:1020] (1/4) Epoch 36, validation: loss=0.2014, simple_loss=0.3017, pruned_loss=0.05055, over 143649.00 frames. 2023-06-15 11:25:39,901 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 11:26:06,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=124293.33333333333, ans=0.09899494936611666 2023-06-15 11:26:08,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124293.33333333333, ans=0.1 2023-06-15 11:26:29,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-15 11:26:32,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=124426.66666666667, ans=10.0 2023-06-15 11:27:05,086 INFO [train.py:988] (1/4) Epoch 36, batch 50, loss[loss=0.2022, simple_loss=0.284, pruned_loss=0.06022, over 19856.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2937, pruned_loss=0.06662, over 851108.36 frames. ], batch size: 120, lr: 9.18e-03, grad_scale: 32.0 2023-06-15 11:27:16,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=124560.0, ans=0.02 2023-06-15 11:27:19,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=124560.0, ans=0.2 2023-06-15 11:27:27,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=124626.66666666667, ans=0.125 2023-06-15 11:27:41,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=124693.33333333333, ans=0.125 2023-06-15 11:27:45,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=124693.33333333333, ans=0.05 2023-06-15 11:27:49,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=22.5 2023-06-15 11:28:02,606 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:28:03,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.93 vs. limit=15.0 2023-06-15 11:28:07,139 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.827e+02 2.009e+02 2.333e+02 3.474e+02, threshold=4.018e+02, percent-clipped=0.0 2023-06-15 11:28:31,569 INFO [train.py:988] (1/4) Epoch 36, batch 100, loss[loss=0.2175, simple_loss=0.2879, pruned_loss=0.07348, over 20329.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2916, pruned_loss=0.06631, over 1503020.51 frames. ], batch size: 149, lr: 9.17e-03, grad_scale: 32.0 2023-06-15 11:28:58,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=124960.0, ans=0.07 2023-06-15 11:29:00,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=124960.0, ans=0.07 2023-06-15 11:29:50,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125160.0, ans=0.1 2023-06-15 11:29:58,758 INFO [train.py:988] (1/4) Epoch 36, batch 150, loss[loss=0.2302, simple_loss=0.3221, pruned_loss=0.06915, over 15658.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2929, pruned_loss=0.06736, over 2000829.04 frames. ], batch size: 44, lr: 9.16e-03, grad_scale: 16.0 2023-06-15 11:30:11,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=125226.66666666667, ans=0.2 2023-06-15 11:30:37,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=125360.0, ans=0.125 2023-06-15 11:30:48,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.73 vs. limit=15.0 2023-06-15 11:31:02,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.954e+02 2.200e+02 2.726e+02 5.615e+02, threshold=4.401e+02, percent-clipped=3.0 2023-06-15 11:31:18,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.41 vs. limit=10.0 2023-06-15 11:31:25,678 INFO [train.py:988] (1/4) Epoch 36, batch 200, loss[loss=0.222, simple_loss=0.3085, pruned_loss=0.06775, over 16448.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2931, pruned_loss=0.06778, over 2393959.36 frames. ], batch size: 52, lr: 9.15e-03, grad_scale: 16.0 2023-06-15 11:31:32,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=125560.0, ans=0.125 2023-06-15 11:31:33,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=125560.0, ans=0.125 2023-06-15 11:31:35,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125560.0, ans=0.125 2023-06-15 11:32:15,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=125760.0, ans=0.0 2023-06-15 11:32:17,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=15.0 2023-06-15 11:32:52,631 INFO [train.py:988] (1/4) Epoch 36, batch 250, loss[loss=0.239, simple_loss=0.3131, pruned_loss=0.08246, over 19540.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2925, pruned_loss=0.06793, over 2710772.90 frames. ], batch size: 102, lr: 9.14e-03, grad_scale: 16.0 2023-06-15 11:33:21,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=125960.0, ans=0.0 2023-06-15 11:33:23,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=125960.0, ans=0.2 2023-06-15 11:33:54,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=126093.33333333333, ans=0.125 2023-06-15 11:33:56,231 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.785e+02 1.964e+02 2.188e+02 3.452e+02, threshold=3.927e+02, percent-clipped=0.0 2023-06-15 11:34:18,451 INFO [train.py:988] (1/4) Epoch 36, batch 300, loss[loss=0.2093, simple_loss=0.297, pruned_loss=0.0608, over 18762.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2925, pruned_loss=0.06746, over 2944802.55 frames. ], batch size: 83, lr: 9.13e-03, grad_scale: 16.0 2023-06-15 11:34:39,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-15 11:34:44,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=126293.33333333333, ans=0.0 2023-06-15 11:35:13,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=126426.66666666667, ans=0.1 2023-06-15 11:35:45,203 INFO [train.py:988] (1/4) Epoch 36, batch 350, loss[loss=0.1996, simple_loss=0.2696, pruned_loss=0.06486, over 20593.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2925, pruned_loss=0.06754, over 3125969.52 frames. ], batch size: 189, lr: 9.12e-03, grad_scale: 16.0 2023-06-15 11:36:01,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-15 11:36:50,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.822e+02 2.085e+02 2.267e+02 3.723e+02, threshold=4.169e+02, percent-clipped=0.0 2023-06-15 11:36:50,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=126760.0, ans=0.125 2023-06-15 11:37:00,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=126826.66666666667, ans=0.0 2023-06-15 11:37:13,984 INFO [train.py:988] (1/4) Epoch 36, batch 400, loss[loss=0.2097, simple_loss=0.2918, pruned_loss=0.06379, over 19528.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2922, pruned_loss=0.06702, over 3265059.98 frames. ], batch size: 102, lr: 9.11e-03, grad_scale: 32.0 2023-06-15 11:37:30,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-15 11:37:46,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=127026.66666666667, ans=0.125 2023-06-15 11:37:53,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=127026.66666666667, ans=0.125 2023-06-15 11:38:03,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=127093.33333333333, ans=0.0 2023-06-15 11:38:40,679 INFO [train.py:988] (1/4) Epoch 36, batch 450, loss[loss=0.2113, simple_loss=0.2928, pruned_loss=0.06495, over 19805.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2927, pruned_loss=0.06737, over 3385722.31 frames. ], batch size: 115, lr: 9.10e-03, grad_scale: 32.0 2023-06-15 11:39:29,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=127360.0, ans=10.0 2023-06-15 11:39:34,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127426.66666666667, ans=0.1 2023-06-15 11:39:34,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=127426.66666666667, ans=0.125 2023-06-15 11:39:43,812 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.789e+02 1.972e+02 2.291e+02 3.379e+02, threshold=3.945e+02, percent-clipped=0.0 2023-06-15 11:40:05,569 INFO [train.py:988] (1/4) Epoch 36, batch 500, loss[loss=0.2537, simple_loss=0.3335, pruned_loss=0.08698, over 16660.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.293, pruned_loss=0.06785, over 3463058.84 frames. ], batch size: 59, lr: 9.09e-03, grad_scale: 32.0 2023-06-15 11:40:20,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127626.66666666667, ans=0.1 2023-06-15 11:40:29,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.65 vs. limit=15.0 2023-06-15 11:40:52,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=127760.0, ans=0.2 2023-06-15 11:41:22,934 INFO [train.py:988] (1/4) Epoch 37, batch 0, loss[loss=0.218, simple_loss=0.2796, pruned_loss=0.07814, over 19995.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2796, pruned_loss=0.07814, over 19995.00 frames. ], batch size: 294, lr: 8.96e-03, grad_scale: 32.0 2023-06-15 11:41:22,934 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 11:41:29,079 INFO [train.py:1020] (1/4) Epoch 37, validation: loss=0.2017, simple_loss=0.3019, pruned_loss=0.05073, over 143649.00 frames. 2023-06-15 11:41:29,081 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 11:41:32,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=127780.0, ans=0.2 2023-06-15 11:41:36,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=127780.0, ans=0.0 2023-06-15 11:41:41,495 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:41:55,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-15 11:41:59,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-15 11:42:10,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-06-15 11:42:22,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127980.0, ans=0.1 2023-06-15 11:42:33,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.20 vs. limit=10.0 2023-06-15 11:42:38,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=128046.66666666667, ans=0.125 2023-06-15 11:42:50,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=128046.66666666667, ans=0.09899494936611666 2023-06-15 11:42:55,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=128113.33333333333, ans=0.07 2023-06-15 11:42:56,855 INFO [train.py:988] (1/4) Epoch 37, batch 50, loss[loss=0.2146, simple_loss=0.3036, pruned_loss=0.06277, over 17060.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2908, pruned_loss=0.06865, over 857185.41 frames. ], batch size: 60, lr: 8.95e-03, grad_scale: 32.0 2023-06-15 11:43:04,216 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.676e+02 1.887e+02 2.171e+02 3.433e+02, threshold=3.773e+02, percent-clipped=0.0 2023-06-15 11:43:20,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=128180.0, ans=0.125 2023-06-15 11:43:27,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-06-15 11:44:07,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=128380.0, ans=0.0 2023-06-15 11:44:17,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=128380.0, ans=0.2 2023-06-15 11:44:24,259 INFO [train.py:988] (1/4) Epoch 37, batch 100, loss[loss=0.2003, simple_loss=0.2801, pruned_loss=0.06024, over 18942.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2913, pruned_loss=0.0667, over 1504707.45 frames. ], batch size: 86, lr: 8.94e-03, grad_scale: 32.0 2023-06-15 11:44:32,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=128446.66666666667, ans=0.0 2023-06-15 11:44:46,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=128513.33333333333, ans=0.125 2023-06-15 11:44:58,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-15 11:45:00,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=128580.0, ans=0.0 2023-06-15 11:45:02,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=22.5 2023-06-15 11:45:14,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=128580.0, ans=0.0 2023-06-15 11:45:44,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2023-06-15 11:45:52,373 INFO [train.py:988] (1/4) Epoch 37, batch 150, loss[loss=0.2053, simple_loss=0.3033, pruned_loss=0.05365, over 15454.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2931, pruned_loss=0.06692, over 2012567.54 frames. ], batch size: 44, lr: 8.93e-03, grad_scale: 32.0 2023-06-15 11:45:56,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=128780.0, ans=0.0 2023-06-15 11:45:59,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.837e+02 2.114e+02 2.317e+02 3.549e+02, threshold=4.229e+02, percent-clipped=0.0 2023-06-15 11:46:30,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=128913.33333333333, ans=15.0 2023-06-15 11:46:51,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=128980.0, ans=0.0 2023-06-15 11:47:01,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-15 11:47:07,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=129046.66666666667, ans=0.02 2023-06-15 11:47:20,596 INFO [train.py:988] (1/4) Epoch 37, batch 200, loss[loss=0.2005, simple_loss=0.2855, pruned_loss=0.05774, over 19870.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2929, pruned_loss=0.0669, over 2407040.64 frames. ], batch size: 120, lr: 8.92e-03, grad_scale: 32.0 2023-06-15 11:47:47,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=129180.0, ans=0.125 2023-06-15 11:47:50,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-15 11:48:37,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.12 vs. limit=22.5 2023-06-15 11:48:48,313 INFO [train.py:988] (1/4) Epoch 37, batch 250, loss[loss=0.2022, simple_loss=0.2868, pruned_loss=0.05876, over 18476.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2923, pruned_loss=0.06668, over 2718211.17 frames. ], batch size: 77, lr: 8.91e-03, grad_scale: 32.0 2023-06-15 11:48:54,949 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.981e+02 2.345e+02 2.837e+02 3.921e+02, threshold=4.691e+02, percent-clipped=0.0 2023-06-15 11:49:03,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=129446.66666666667, ans=0.125 2023-06-15 11:49:13,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129513.33333333333, ans=0.0 2023-06-15 11:49:57,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129713.33333333333, ans=0.1 2023-06-15 11:50:08,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129713.33333333333, ans=0.125 2023-06-15 11:50:16,937 INFO [train.py:988] (1/4) Epoch 37, batch 300, loss[loss=0.1993, simple_loss=0.2832, pruned_loss=0.05766, over 19829.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2906, pruned_loss=0.06702, over 2972295.97 frames. ], batch size: 115, lr: 8.90e-03, grad_scale: 32.0 2023-06-15 11:50:21,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-15 11:50:38,893 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:50:49,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-15 11:51:01,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=129913.33333333333, ans=0.0 2023-06-15 11:51:14,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=129980.0, ans=0.125 2023-06-15 11:51:32,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=130046.66666666667, ans=0.0 2023-06-15 11:51:45,597 INFO [train.py:988] (1/4) Epoch 37, batch 350, loss[loss=0.2265, simple_loss=0.2938, pruned_loss=0.07958, over 20228.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2904, pruned_loss=0.06698, over 3163090.93 frames. ], batch size: 239, lr: 8.89e-03, grad_scale: 32.0 2023-06-15 11:51:52,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.921e+02 2.094e+02 2.447e+02 3.479e+02, threshold=4.189e+02, percent-clipped=0.0 2023-06-15 11:52:12,972 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 11:52:13,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=130180.0, ans=0.125 2023-06-15 11:52:16,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=130180.0, ans=0.0 2023-06-15 11:52:23,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=130246.66666666667, ans=0.04949747468305833 2023-06-15 11:52:51,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-15 11:52:54,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130380.0, ans=0.1 2023-06-15 11:53:11,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=130446.66666666667, ans=0.0 2023-06-15 11:53:13,109 INFO [train.py:988] (1/4) Epoch 37, batch 400, loss[loss=0.1813, simple_loss=0.2684, pruned_loss=0.04707, over 18617.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2893, pruned_loss=0.06646, over 3310019.54 frames. ], batch size: 80, lr: 8.88e-03, grad_scale: 32.0 2023-06-15 11:53:38,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=130513.33333333333, ans=0.1 2023-06-15 11:53:53,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=130580.0, ans=0.2 2023-06-15 11:54:12,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=130646.66666666667, ans=0.2 2023-06-15 11:54:18,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=130646.66666666667, ans=0.2 2023-06-15 11:54:29,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=130713.33333333333, ans=0.025 2023-06-15 11:54:43,385 INFO [train.py:988] (1/4) Epoch 37, batch 450, loss[loss=0.2103, simple_loss=0.2917, pruned_loss=0.06441, over 19339.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2902, pruned_loss=0.06657, over 3406308.77 frames. ], batch size: 98, lr: 8.87e-03, grad_scale: 16.0 2023-06-15 11:54:47,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130780.0, ans=0.125 2023-06-15 11:54:51,578 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.773e+02 2.041e+02 2.312e+02 3.124e+02, threshold=4.082e+02, percent-clipped=0.0 2023-06-15 11:54:57,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130780.0, ans=0.1 2023-06-15 11:54:59,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=130846.66666666667, ans=0.125 2023-06-15 11:55:15,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=130846.66666666667, ans=0.125 2023-06-15 11:55:55,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131046.66666666667, ans=0.1 2023-06-15 11:56:00,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.30 vs. limit=10.0 2023-06-15 11:56:09,198 INFO [train.py:988] (1/4) Epoch 37, batch 500, loss[loss=0.2086, simple_loss=0.2817, pruned_loss=0.0677, over 20803.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2901, pruned_loss=0.06624, over 3495088.56 frames. ], batch size: 211, lr: 8.86e-03, grad_scale: 16.0 2023-06-15 11:56:19,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=131113.33333333334, ans=0.0 2023-06-15 11:56:24,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=131180.0, ans=0.125 2023-06-15 11:56:27,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-06-15 11:57:24,722 INFO [train.py:988] (1/4) Epoch 38, batch 0, loss[loss=0.2131, simple_loss=0.2936, pruned_loss=0.06629, over 19097.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2936, pruned_loss=0.06629, over 19097.00 frames. ], batch size: 89, lr: 8.73e-03, grad_scale: 32.0 2023-06-15 11:57:24,723 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 11:57:31,182 INFO [train.py:1020] (1/4) Epoch 38, validation: loss=0.2046, simple_loss=0.3024, pruned_loss=0.05337, over 143649.00 frames. 2023-06-15 11:57:31,184 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 11:57:40,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=131326.66666666666, ans=0.0 2023-06-15 11:57:45,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=131326.66666666666, ans=0.125 2023-06-15 11:57:47,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=131393.33333333334, ans=0.0 2023-06-15 11:58:14,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.971e+02 2.221e+02 2.654e+02 3.969e+02, threshold=4.441e+02, percent-clipped=0.0 2023-06-15 11:58:16,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131460.0, ans=0.1 2023-06-15 11:58:39,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.13 vs. limit=10.0 2023-06-15 11:58:41,113 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2023-06-15 11:59:00,151 INFO [train.py:988] (1/4) Epoch 38, batch 50, loss[loss=0.2194, simple_loss=0.3014, pruned_loss=0.06873, over 18778.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2905, pruned_loss=0.06442, over 858540.63 frames. ], batch size: 83, lr: 8.72e-03, grad_scale: 16.0 2023-06-15 11:59:16,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-15 11:59:22,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.62 vs. limit=22.5 2023-06-15 11:59:24,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=131726.66666666666, ans=0.125 2023-06-15 11:59:35,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=131793.33333333334, ans=0.125 2023-06-15 11:59:50,842 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:00:07,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=131926.66666666666, ans=0.0 2023-06-15 12:00:08,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-06-15 12:00:10,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.54 vs. limit=22.5 2023-06-15 12:00:17,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=131926.66666666666, ans=0.025 2023-06-15 12:00:26,389 INFO [train.py:988] (1/4) Epoch 38, batch 100, loss[loss=0.2098, simple_loss=0.2893, pruned_loss=0.06516, over 20272.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2898, pruned_loss=0.06465, over 1527325.57 frames. ], batch size: 141, lr: 8.71e-03, grad_scale: 16.0 2023-06-15 12:00:41,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=132060.0, ans=0.0 2023-06-15 12:01:07,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.839e+02 2.066e+02 2.376e+02 4.240e+02, threshold=4.131e+02, percent-clipped=0.0 2023-06-15 12:01:23,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=132193.33333333334, ans=0.0 2023-06-15 12:01:52,144 INFO [train.py:988] (1/4) Epoch 38, batch 150, loss[loss=0.2106, simple_loss=0.2889, pruned_loss=0.06613, over 20079.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2899, pruned_loss=0.06424, over 2025398.38 frames. ], batch size: 133, lr: 8.70e-03, grad_scale: 16.0 2023-06-15 12:01:54,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=132326.66666666666, ans=0.0 2023-06-15 12:02:00,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=132326.66666666666, ans=0.125 2023-06-15 12:02:28,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=132460.0, ans=0.0 2023-06-15 12:03:17,130 INFO [train.py:988] (1/4) Epoch 38, batch 200, loss[loss=0.2124, simple_loss=0.2971, pruned_loss=0.06382, over 19112.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2903, pruned_loss=0.06483, over 2435227.29 frames. ], batch size: 94, lr: 8.69e-03, grad_scale: 16.0 2023-06-15 12:03:58,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.792e+02 1.952e+02 2.227e+02 3.208e+02, threshold=3.904e+02, percent-clipped=0.0 2023-06-15 12:04:14,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=132860.0, ans=0.125 2023-06-15 12:04:31,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=132926.66666666666, ans=0.04949747468305833 2023-06-15 12:04:43,043 INFO [train.py:988] (1/4) Epoch 38, batch 250, loss[loss=0.1819, simple_loss=0.2677, pruned_loss=0.0481, over 19846.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2909, pruned_loss=0.06481, over 2731898.26 frames. ], batch size: 115, lr: 8.68e-03, grad_scale: 16.0 2023-06-15 12:05:06,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.30 vs. limit=22.5 2023-06-15 12:05:36,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-15 12:06:10,259 INFO [train.py:988] (1/4) Epoch 38, batch 300, loss[loss=0.2104, simple_loss=0.293, pruned_loss=0.06394, over 19888.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2896, pruned_loss=0.06492, over 2987425.29 frames. ], batch size: 120, lr: 8.67e-03, grad_scale: 16.0 2023-06-15 12:06:54,540 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.862e+02 2.054e+02 2.308e+02 3.545e+02, threshold=4.107e+02, percent-clipped=0.0 2023-06-15 12:07:28,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=133593.33333333334, ans=0.1 2023-06-15 12:07:31,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133593.33333333334, ans=0.1 2023-06-15 12:07:39,137 INFO [train.py:988] (1/4) Epoch 38, batch 350, loss[loss=0.2029, simple_loss=0.2872, pruned_loss=0.05926, over 18275.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2904, pruned_loss=0.06519, over 3148146.81 frames. ], batch size: 74, lr: 8.66e-03, grad_scale: 16.0 2023-06-15 12:07:50,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=133660.0, ans=0.125 2023-06-15 12:08:09,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=133726.66666666666, ans=0.04949747468305833 2023-06-15 12:09:05,465 INFO [train.py:988] (1/4) Epoch 38, batch 400, loss[loss=0.2132, simple_loss=0.2977, pruned_loss=0.06433, over 20145.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2902, pruned_loss=0.06502, over 3276007.86 frames. ], batch size: 133, lr: 8.65e-03, grad_scale: 32.0 2023-06-15 12:09:13,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=133993.33333333334, ans=0.0 2023-06-15 12:09:20,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=133993.33333333334, ans=0.125 2023-06-15 12:09:20,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=133993.33333333334, ans=0.0 2023-06-15 12:09:23,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=134060.0, ans=0.05 2023-06-15 12:09:39,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-15 12:09:49,034 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.764e+02 2.036e+02 2.337e+02 3.432e+02, threshold=4.071e+02, percent-clipped=0.0 2023-06-15 12:10:29,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=134260.0, ans=0.125 2023-06-15 12:10:30,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-15 12:10:32,630 INFO [train.py:988] (1/4) Epoch 38, batch 450, loss[loss=0.1928, simple_loss=0.2786, pruned_loss=0.05351, over 19867.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2902, pruned_loss=0.06532, over 3403494.61 frames. ], batch size: 120, lr: 8.65e-03, grad_scale: 16.0 2023-06-15 12:11:00,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=134393.33333333334, ans=0.125 2023-06-15 12:11:25,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=134526.66666666666, ans=0.125 2023-06-15 12:11:31,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=134526.66666666666, ans=0.125 2023-06-15 12:11:46,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=134593.33333333334, ans=0.0 2023-06-15 12:11:46,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=134593.33333333334, ans=0.2 2023-06-15 12:11:56,111 INFO [train.py:988] (1/4) Epoch 38, batch 500, loss[loss=0.227, simple_loss=0.3151, pruned_loss=0.06941, over 17615.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2905, pruned_loss=0.06498, over 3491496.81 frames. ], batch size: 67, lr: 8.64e-03, grad_scale: 16.0 2023-06-15 12:12:07,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=134660.0, ans=10.0 2023-06-15 12:12:35,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=134793.33333333334, ans=0.1 2023-06-15 12:12:37,181 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.813e+02 2.102e+02 2.344e+02 3.588e+02, threshold=4.203e+02, percent-clipped=0.0 2023-06-15 12:13:08,851 INFO [train.py:988] (1/4) Epoch 39, batch 0, loss[loss=0.1805, simple_loss=0.2659, pruned_loss=0.04758, over 19705.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2659, pruned_loss=0.04758, over 19705.00 frames. ], batch size: 110, lr: 8.52e-03, grad_scale: 32.0 2023-06-15 12:13:08,851 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 12:13:14,990 INFO [train.py:1020] (1/4) Epoch 39, validation: loss=0.2008, simple_loss=0.3008, pruned_loss=0.05042, over 143649.00 frames. 2023-06-15 12:13:14,990 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 12:13:36,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2023-06-15 12:14:05,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:16,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:19,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=135073.33333333334, ans=0.125 2023-06-15 12:14:27,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=135140.0, ans=0.2 2023-06-15 12:14:42,559 INFO [train.py:988] (1/4) Epoch 39, batch 50, loss[loss=0.1999, simple_loss=0.2885, pruned_loss=0.0557, over 18781.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2876, pruned_loss=0.06409, over 868590.24 frames. ], batch size: 83, lr: 8.51e-03, grad_scale: 16.0 2023-06-15 12:14:43,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-06-15 12:15:11,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.20 vs. limit=22.5 2023-06-15 12:15:14,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-15 12:15:58,883 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.823e+02 2.055e+02 2.289e+02 2.991e+02, threshold=4.109e+02, percent-clipped=0.0 2023-06-15 12:16:09,599 INFO [train.py:988] (1/4) Epoch 39, batch 100, loss[loss=0.1924, simple_loss=0.2772, pruned_loss=0.0538, over 19461.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2892, pruned_loss=0.06475, over 1518668.78 frames. ], batch size: 105, lr: 8.50e-03, grad_scale: 16.0 2023-06-15 12:16:37,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=135606.66666666666, ans=0.125 2023-06-15 12:16:46,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=135673.33333333334, ans=0.0 2023-06-15 12:16:58,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-15 12:16:59,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=135740.0, ans=0.125 2023-06-15 12:17:35,697 INFO [train.py:988] (1/4) Epoch 39, batch 150, loss[loss=0.2509, simple_loss=0.3393, pruned_loss=0.08126, over 10907.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2911, pruned_loss=0.0647, over 2012252.62 frames. ], batch size: 31, lr: 8.49e-03, grad_scale: 16.0 2023-06-15 12:17:41,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=135873.33333333334, ans=0.0 2023-06-15 12:18:16,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-15 12:18:35,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=136073.33333333334, ans=0.125 2023-06-15 12:18:40,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=22.5 2023-06-15 12:18:52,162 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.763e+02 2.014e+02 2.292e+02 3.195e+02, threshold=4.028e+02, percent-clipped=0.0 2023-06-15 12:19:03,517 INFO [train.py:988] (1/4) Epoch 39, batch 200, loss[loss=0.1996, simple_loss=0.2528, pruned_loss=0.07314, over 16924.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2902, pruned_loss=0.06418, over 2404404.28 frames. ], batch size: 391, lr: 8.48e-03, grad_scale: 16.0 2023-06-15 12:19:14,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-06-15 12:19:46,024 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.73 vs. limit=15.0 2023-06-15 12:20:31,131 INFO [train.py:988] (1/4) Epoch 39, batch 250, loss[loss=0.2053, simple_loss=0.2888, pruned_loss=0.06087, over 18762.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2902, pruned_loss=0.06416, over 2713993.94 frames. ], batch size: 83, lr: 8.47e-03, grad_scale: 16.0 2023-06-15 12:20:58,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136606.66666666666, ans=0.125 2023-06-15 12:21:11,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=136673.33333333334, ans=0.125 2023-06-15 12:21:12,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=136673.33333333334, ans=0.125 2023-06-15 12:21:17,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=136673.33333333334, ans=0.0 2023-06-15 12:21:28,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=136740.0, ans=0.2 2023-06-15 12:21:43,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=136806.66666666666, ans=0.125 2023-06-15 12:21:48,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.779e+02 1.952e+02 2.172e+02 3.259e+02, threshold=3.903e+02, percent-clipped=0.0 2023-06-15 12:21:56,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=136806.66666666666, ans=0.05 2023-06-15 12:21:59,970 INFO [train.py:988] (1/4) Epoch 39, batch 300, loss[loss=0.1881, simple_loss=0.2784, pruned_loss=0.04888, over 19682.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.29, pruned_loss=0.06373, over 2945369.53 frames. ], batch size: 110, lr: 8.46e-03, grad_scale: 16.0 2023-06-15 12:22:04,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=136873.33333333334, ans=0.0 2023-06-15 12:22:09,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136873.33333333334, ans=0.125 2023-06-15 12:22:31,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=136940.0, ans=0.125 2023-06-15 12:23:01,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137073.33333333334, ans=0.1 2023-06-15 12:23:26,371 INFO [train.py:988] (1/4) Epoch 39, batch 350, loss[loss=0.1965, simple_loss=0.2828, pruned_loss=0.05515, over 19772.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2896, pruned_loss=0.06395, over 3120472.46 frames. ], batch size: 115, lr: 8.45e-03, grad_scale: 16.0 2023-06-15 12:23:26,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=137206.66666666666, ans=0.125 2023-06-15 12:23:30,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=137206.66666666666, ans=0.1 2023-06-15 12:23:37,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-06-15 12:23:39,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137206.66666666666, ans=0.1 2023-06-15 12:23:46,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=137273.33333333334, ans=0.2 2023-06-15 12:24:04,869 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-15 12:24:06,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=137340.0, ans=0.125 2023-06-15 12:24:38,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=137473.33333333334, ans=0.125 2023-06-15 12:24:44,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.816e+02 2.065e+02 2.366e+02 3.841e+02, threshold=4.130e+02, percent-clipped=0.0 2023-06-15 12:24:49,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=137473.33333333334, ans=0.125 2023-06-15 12:24:55,366 INFO [train.py:988] (1/4) Epoch 39, batch 400, loss[loss=0.2212, simple_loss=0.2865, pruned_loss=0.07798, over 20207.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2893, pruned_loss=0.06409, over 3264418.07 frames. ], batch size: 239, lr: 8.44e-03, grad_scale: 32.0 2023-06-15 12:25:08,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=137540.0, ans=0.2 2023-06-15 12:25:34,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.54 vs. limit=22.5 2023-06-15 12:25:41,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-06-15 12:26:24,761 INFO [train.py:988] (1/4) Epoch 39, batch 450, loss[loss=0.1882, simple_loss=0.2782, pruned_loss=0.04913, over 19686.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2895, pruned_loss=0.06385, over 3385579.52 frames. ], batch size: 110, lr: 8.44e-03, grad_scale: 16.0 2023-06-15 12:26:31,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=137873.33333333334, ans=0.0 2023-06-15 12:26:46,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=137940.0, ans=0.125 2023-06-15 12:26:46,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137940.0, ans=0.125 2023-06-15 12:26:56,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.34 vs. limit=15.0 2023-06-15 12:27:15,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=138073.33333333334, ans=0.1 2023-06-15 12:27:41,496 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.826e+02 2.179e+02 2.454e+02 3.798e+02, threshold=4.358e+02, percent-clipped=0.0 2023-06-15 12:27:49,806 INFO [train.py:988] (1/4) Epoch 39, batch 500, loss[loss=0.194, simple_loss=0.2805, pruned_loss=0.05379, over 18797.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2897, pruned_loss=0.06391, over 3472450.44 frames. ], batch size: 83, lr: 8.43e-03, grad_scale: 16.0 2023-06-15 12:28:12,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-06-15 12:28:30,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138340.0, ans=0.125 2023-06-15 12:28:31,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=138340.0, ans=0.2 2023-06-15 12:29:07,667 INFO [train.py:988] (1/4) Epoch 40, batch 0, loss[loss=0.2239, simple_loss=0.302, pruned_loss=0.07292, over 20278.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.302, pruned_loss=0.07292, over 20278.00 frames. ], batch size: 141, lr: 8.31e-03, grad_scale: 32.0 2023-06-15 12:29:07,668 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 12:29:13,805 INFO [train.py:1020] (1/4) Epoch 40, validation: loss=0.2011, simple_loss=0.3008, pruned_loss=0.05073, over 143649.00 frames. 2023-06-15 12:29:13,807 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 12:29:19,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=138420.0, ans=0.0 2023-06-15 12:29:54,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=138553.33333333334, ans=0.125 2023-06-15 12:30:07,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=138620.0, ans=0.125 2023-06-15 12:30:22,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=138686.66666666666, ans=0.0 2023-06-15 12:30:27,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=138686.66666666666, ans=0.125 2023-06-15 12:30:42,335 INFO [train.py:988] (1/4) Epoch 40, batch 50, loss[loss=0.202, simple_loss=0.2896, pruned_loss=0.05722, over 19113.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2899, pruned_loss=0.06551, over 848220.58 frames. ], batch size: 94, lr: 8.31e-03, grad_scale: 32.0 2023-06-15 12:30:50,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=138753.33333333334, ans=0.035 2023-06-15 12:31:05,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.755e+02 2.069e+02 2.333e+02 3.346e+02, threshold=4.138e+02, percent-clipped=0.0 2023-06-15 12:31:13,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-15 12:31:20,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-06-15 12:31:31,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=138886.66666666666, ans=0.2 2023-06-15 12:31:34,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=138953.33333333334, ans=0.125 2023-06-15 12:32:12,100 INFO [train.py:988] (1/4) Epoch 40, batch 100, loss[loss=0.2143, simple_loss=0.2769, pruned_loss=0.0759, over 19983.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2881, pruned_loss=0.06461, over 1520721.15 frames. ], batch size: 293, lr: 8.30e-03, grad_scale: 32.0 2023-06-15 12:32:27,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=139153.33333333334, ans=0.125 2023-06-15 12:32:27,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=139153.33333333334, ans=0.125 2023-06-15 12:32:53,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=139220.0, ans=0.0 2023-06-15 12:33:00,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-15 12:33:26,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.46 vs. limit=22.5 2023-06-15 12:33:27,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=139353.33333333334, ans=0.125 2023-06-15 12:33:32,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139353.33333333334, ans=0.125 2023-06-15 12:33:38,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=139353.33333333334, ans=0.025 2023-06-15 12:33:41,070 INFO [train.py:988] (1/4) Epoch 40, batch 150, loss[loss=0.2064, simple_loss=0.2861, pruned_loss=0.06337, over 18938.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2883, pruned_loss=0.06438, over 2025581.20 frames. ], batch size: 86, lr: 8.29e-03, grad_scale: 32.0 2023-06-15 12:33:47,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-15 12:33:55,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=15.0 2023-06-15 12:34:00,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=12.0 2023-06-15 12:34:03,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.849e+02 1.991e+02 2.229e+02 4.188e+02, threshold=3.982e+02, percent-clipped=1.0 2023-06-15 12:34:42,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=139620.0, ans=0.125 2023-06-15 12:35:09,532 INFO [train.py:988] (1/4) Epoch 40, batch 200, loss[loss=0.2017, simple_loss=0.2903, pruned_loss=0.05657, over 18652.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2867, pruned_loss=0.06371, over 2435811.94 frames. ], batch size: 80, lr: 8.28e-03, grad_scale: 32.0 2023-06-15 12:35:14,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-15 12:35:19,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.80 vs. limit=22.5 2023-06-15 12:35:27,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=139820.0, ans=0.125 2023-06-15 12:35:42,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139820.0, ans=0.1 2023-06-15 12:35:54,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=139886.66666666666, ans=0.0 2023-06-15 12:36:01,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=139953.33333333334, ans=0.125 2023-06-15 12:36:33,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=140020.0, ans=0.125 2023-06-15 12:36:33,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-06-15 12:36:37,811 INFO [train.py:988] (1/4) Epoch 40, batch 250, loss[loss=0.2077, simple_loss=0.2836, pruned_loss=0.06591, over 19977.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2878, pruned_loss=0.0633, over 2746910.24 frames. ], batch size: 126, lr: 8.27e-03, grad_scale: 32.0 2023-06-15 12:36:39,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=140086.66666666666, ans=0.125 2023-06-15 12:36:48,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=140086.66666666666, ans=0.125 2023-06-15 12:36:49,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.85 vs. limit=22.5 2023-06-15 12:37:01,100 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.820e+02 2.078e+02 2.421e+02 4.152e+02, threshold=4.155e+02, percent-clipped=1.0 2023-06-15 12:37:36,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=140286.66666666666, ans=0.125 2023-06-15 12:38:08,103 INFO [train.py:988] (1/4) Epoch 40, batch 300, loss[loss=0.2108, simple_loss=0.2993, pruned_loss=0.06118, over 16795.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2884, pruned_loss=0.06289, over 2975975.11 frames. ], batch size: 59, lr: 8.26e-03, grad_scale: 32.0 2023-06-15 12:38:08,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140420.0, ans=0.1 2023-06-15 12:38:24,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=140486.66666666666, ans=0.125 2023-06-15 12:38:26,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=140486.66666666666, ans=0.125 2023-06-15 12:38:29,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=140486.66666666666, ans=0.04949747468305833 2023-06-15 12:39:08,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.72 vs. limit=15.0 2023-06-15 12:39:38,309 INFO [train.py:988] (1/4) Epoch 40, batch 350, loss[loss=0.1986, simple_loss=0.2791, pruned_loss=0.05902, over 19311.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2891, pruned_loss=0.06291, over 3164152.68 frames. ], batch size: 98, lr: 8.25e-03, grad_scale: 32.0 2023-06-15 12:39:38,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=140753.33333333334, ans=0.0 2023-06-15 12:39:39,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.01 vs. limit=22.5 2023-06-15 12:40:01,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.757e+02 1.917e+02 2.241e+02 2.935e+02, threshold=3.834e+02, percent-clipped=0.0 2023-06-15 12:40:03,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=140820.0, ans=0.125 2023-06-15 12:40:22,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=140886.66666666666, ans=0.125 2023-06-15 12:40:27,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-15 12:40:33,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2023-06-15 12:40:33,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=140953.33333333334, ans=15.0 2023-06-15 12:40:34,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=140953.33333333334, ans=0.0 2023-06-15 12:40:48,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141020.0, ans=0.125 2023-06-15 12:40:50,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=141020.0, ans=0.0 2023-06-15 12:40:56,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.96 vs. limit=22.5 2023-06-15 12:41:08,539 INFO [train.py:988] (1/4) Epoch 40, batch 400, loss[loss=0.2295, simple_loss=0.3149, pruned_loss=0.07204, over 18256.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2891, pruned_loss=0.06309, over 3305371.18 frames. ], batch size: 74, lr: 8.24e-03, grad_scale: 32.0 2023-06-15 12:41:10,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=141086.66666666666, ans=0.125 2023-06-15 12:41:31,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141153.33333333334, ans=0.1 2023-06-15 12:41:33,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141153.33333333334, ans=0.125 2023-06-15 12:41:44,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=141220.0, ans=15.0 2023-06-15 12:41:45,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=141220.0, ans=0.025 2023-06-15 12:42:16,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=141353.33333333334, ans=0.0 2023-06-15 12:42:36,458 INFO [train.py:988] (1/4) Epoch 40, batch 450, loss[loss=0.2255, simple_loss=0.3162, pruned_loss=0.06737, over 15548.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2885, pruned_loss=0.06276, over 3419480.92 frames. ], batch size: 44, lr: 8.24e-03, grad_scale: 32.0 2023-06-15 12:42:59,295 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.768e+02 1.885e+02 2.206e+02 3.327e+02, threshold=3.770e+02, percent-clipped=0.0 2023-06-15 12:43:19,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=141553.33333333334, ans=0.05 2023-06-15 12:43:22,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=141553.33333333334, ans=0.125 2023-06-15 12:43:36,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=141620.0, ans=0.125 2023-06-15 12:43:43,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.21 vs. limit=22.5 2023-06-15 12:43:54,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-15 12:44:00,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141686.66666666666, ans=0.1 2023-06-15 12:44:00,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=141686.66666666666, ans=0.0 2023-06-15 12:44:03,575 INFO [train.py:988] (1/4) Epoch 40, batch 500, loss[loss=0.2047, simple_loss=0.2831, pruned_loss=0.06311, over 20426.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2879, pruned_loss=0.063, over 3505741.39 frames. ], batch size: 160, lr: 8.23e-03, grad_scale: 32.0 2023-06-15 12:44:15,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=141753.33333333334, ans=0.0 2023-06-15 12:44:20,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=141820.0, ans=0.125 2023-06-15 12:44:27,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-15 12:44:28,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-15 12:44:35,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=141886.66666666666, ans=0.0 2023-06-15 12:44:51,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=141953.33333333334, ans=0.0 2023-06-15 12:45:20,953 INFO [train.py:988] (1/4) Epoch 41, batch 0, loss[loss=0.2017, simple_loss=0.2848, pruned_loss=0.05929, over 19195.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2848, pruned_loss=0.05929, over 19195.00 frames. ], batch size: 92, lr: 8.12e-03, grad_scale: 32.0 2023-06-15 12:45:20,954 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 12:45:28,024 INFO [train.py:1020] (1/4) Epoch 41, validation: loss=0.2002, simple_loss=0.2999, pruned_loss=0.05026, over 143649.00 frames. 2023-06-15 12:45:28,025 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 12:45:45,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-06-15 12:45:46,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.14 vs. limit=15.0 2023-06-15 12:46:12,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=142106.66666666666, ans=0.07 2023-06-15 12:46:18,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=142106.66666666666, ans=0.0 2023-06-15 12:46:21,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.817e+02 2.110e+02 2.443e+02 3.477e+02, threshold=4.219e+02, percent-clipped=0.0 2023-06-15 12:46:21,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=142173.33333333334, ans=0.2 2023-06-15 12:46:41,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=142240.0, ans=0.0 2023-06-15 12:46:57,289 INFO [train.py:988] (1/4) Epoch 41, batch 50, loss[loss=0.2021, simple_loss=0.2893, pruned_loss=0.05749, over 18646.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2866, pruned_loss=0.06279, over 855755.96 frames. ], batch size: 80, lr: 8.11e-03, grad_scale: 32.0 2023-06-15 12:47:12,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-15 12:47:19,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=142373.33333333334, ans=0.125 2023-06-15 12:47:23,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=142373.33333333334, ans=0.125 2023-06-15 12:47:25,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=12.0 2023-06-15 12:47:43,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=142440.0, ans=0.125 2023-06-15 12:48:05,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=142506.66666666666, ans=0.2 2023-06-15 12:48:08,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=142573.33333333334, ans=0.1 2023-06-15 12:48:18,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=142573.33333333334, ans=0.125 2023-06-15 12:48:25,565 INFO [train.py:988] (1/4) Epoch 41, batch 100, loss[loss=0.2166, simple_loss=0.2811, pruned_loss=0.07606, over 19834.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2881, pruned_loss=0.0634, over 1495432.79 frames. ], batch size: 293, lr: 8.10e-03, grad_scale: 32.0 2023-06-15 12:48:46,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=142706.66666666666, ans=0.0 2023-06-15 12:48:50,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-15 12:48:59,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=142773.33333333334, ans=0.125 2023-06-15 12:49:06,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=142773.33333333334, ans=0.125 2023-06-15 12:49:15,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142773.33333333334, ans=0.125 2023-06-15 12:49:16,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-15 12:49:18,941 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.848e+02 2.100e+02 2.504e+02 3.647e+02, threshold=4.200e+02, percent-clipped=0.0 2023-06-15 12:49:22,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=142840.0, ans=0.125 2023-06-15 12:49:53,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=142973.33333333334, ans=0.125 2023-06-15 12:49:54,453 INFO [train.py:988] (1/4) Epoch 41, batch 150, loss[loss=0.2052, simple_loss=0.3065, pruned_loss=0.052, over 15623.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2881, pruned_loss=0.06298, over 2014635.23 frames. ], batch size: 44, lr: 8.09e-03, grad_scale: 32.0 2023-06-15 12:50:26,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=143040.0, ans=0.0 2023-06-15 12:50:26,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.90 vs. limit=15.0 2023-06-15 12:50:36,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=143106.66666666666, ans=0.09899494936611666 2023-06-15 12:51:24,270 INFO [train.py:988] (1/4) Epoch 41, batch 200, loss[loss=0.1993, simple_loss=0.2863, pruned_loss=0.05611, over 19344.00 frames. ], tot_loss[loss=0.207, simple_loss=0.288, pruned_loss=0.06303, over 2413422.08 frames. ], batch size: 98, lr: 8.09e-03, grad_scale: 32.0 2023-06-15 12:51:26,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=143306.66666666666, ans=0.0 2023-06-15 12:51:37,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-06-15 12:51:46,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-06-15 12:51:58,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=143440.0, ans=0.0 2023-06-15 12:52:17,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=143506.66666666666, ans=10.0 2023-06-15 12:52:18,417 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.761e+02 1.969e+02 2.341e+02 3.526e+02, threshold=3.938e+02, percent-clipped=0.0 2023-06-15 12:52:30,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=143506.66666666666, ans=0.125 2023-06-15 12:52:44,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=143573.33333333334, ans=0.125 2023-06-15 12:52:54,199 INFO [train.py:988] (1/4) Epoch 41, batch 250, loss[loss=0.1965, simple_loss=0.2813, pruned_loss=0.05591, over 19390.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2901, pruned_loss=0.06279, over 2711407.42 frames. ], batch size: 98, lr: 8.08e-03, grad_scale: 32.0 2023-06-15 12:54:08,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.03 vs. limit=15.0 2023-06-15 12:54:24,708 INFO [train.py:988] (1/4) Epoch 41, batch 300, loss[loss=0.2068, simple_loss=0.2869, pruned_loss=0.0633, over 18608.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2897, pruned_loss=0.06263, over 2956793.45 frames. ], batch size: 80, lr: 8.07e-03, grad_scale: 32.0 2023-06-15 12:54:43,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2023-06-15 12:54:45,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.13 vs. limit=10.0 2023-06-15 12:55:19,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.842e+02 2.030e+02 2.348e+02 3.359e+02, threshold=4.059e+02, percent-clipped=0.0 2023-06-15 12:55:32,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-06-15 12:55:38,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=144240.0, ans=0.125 2023-06-15 12:55:39,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=144240.0, ans=0.2 2023-06-15 12:55:52,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=144240.0, ans=0.125 2023-06-15 12:55:54,944 INFO [train.py:988] (1/4) Epoch 41, batch 350, loss[loss=0.2009, simple_loss=0.2828, pruned_loss=0.05953, over 20289.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2896, pruned_loss=0.06274, over 3145463.91 frames. ], batch size: 141, lr: 8.06e-03, grad_scale: 32.0 2023-06-15 12:55:58,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=144306.66666666666, ans=0.2 2023-06-15 12:56:09,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.44 vs. limit=10.0 2023-06-15 12:56:41,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=144440.0, ans=0.0 2023-06-15 12:56:54,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.00 vs. limit=10.0 2023-06-15 12:57:14,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=144573.33333333334, ans=0.125 2023-06-15 12:57:25,294 INFO [train.py:988] (1/4) Epoch 41, batch 400, loss[loss=0.2229, simple_loss=0.2668, pruned_loss=0.08946, over 16911.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2888, pruned_loss=0.06243, over 3298348.06 frames. ], batch size: 391, lr: 8.05e-03, grad_scale: 32.0 2023-06-15 12:57:30,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=144640.0, ans=0.0 2023-06-15 12:57:35,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=15.0 2023-06-15 12:57:36,769 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 12:57:41,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2023-06-15 12:57:53,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=144706.66666666666, ans=0.125 2023-06-15 12:58:01,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=144773.33333333334, ans=15.0 2023-06-15 12:58:17,901 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.778e+02 1.933e+02 2.211e+02 3.033e+02, threshold=3.866e+02, percent-clipped=0.0 2023-06-15 12:58:30,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=144840.0, ans=0.125 2023-06-15 12:58:32,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=144840.0, ans=0.125 2023-06-15 12:58:53,411 INFO [train.py:988] (1/4) Epoch 41, batch 450, loss[loss=0.2056, simple_loss=0.2843, pruned_loss=0.06341, over 20104.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2883, pruned_loss=0.0623, over 3404149.05 frames. ], batch size: 133, lr: 8.04e-03, grad_scale: 32.0 2023-06-15 12:58:55,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=144973.33333333334, ans=0.125 2023-06-15 12:59:18,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=15.0 2023-06-15 12:59:19,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145040.0, ans=0.125 2023-06-15 12:59:47,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-15 13:00:03,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145240.0, ans=0.125 2023-06-15 13:00:17,855 INFO [train.py:988] (1/4) Epoch 41, batch 500, loss[loss=0.2151, simple_loss=0.288, pruned_loss=0.07109, over 20569.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2873, pruned_loss=0.06232, over 3472501.22 frames. ], batch size: 173, lr: 8.04e-03, grad_scale: 32.0 2023-06-15 13:00:21,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=145306.66666666666, ans=0.0 2023-06-15 13:01:01,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=145440.0, ans=0.125 2023-06-15 13:01:01,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=145440.0, ans=0.09899494936611666 2023-06-15 13:01:03,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.35 vs. limit=15.0 2023-06-15 13:01:07,583 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.783e+02 1.958e+02 2.209e+02 2.904e+02, threshold=3.915e+02, percent-clipped=0.0 2023-06-15 13:01:07,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=145506.66666666666, ans=0.2 2023-06-15 13:01:34,508 INFO [train.py:988] (1/4) Epoch 42, batch 0, loss[loss=0.2174, simple_loss=0.2972, pruned_loss=0.06885, over 19965.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2972, pruned_loss=0.06885, over 19965.00 frames. ], batch size: 126, lr: 7.93e-03, grad_scale: 32.0 2023-06-15 13:01:34,509 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 13:01:40,652 INFO [train.py:1020] (1/4) Epoch 42, validation: loss=0.1999, simple_loss=0.2992, pruned_loss=0.05028, over 143649.00 frames. 2023-06-15 13:01:40,652 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 13:01:54,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-06-15 13:01:59,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=145586.66666666666, ans=0.125 2023-06-15 13:02:07,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=145586.66666666666, ans=10.0 2023-06-15 13:02:07,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=145586.66666666666, ans=6.0 2023-06-15 13:02:08,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=145586.66666666666, ans=0.125 2023-06-15 13:02:19,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145653.33333333334, ans=0.0 2023-06-15 13:02:19,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145653.33333333334, ans=0.0 2023-06-15 13:03:10,732 INFO [train.py:988] (1/4) Epoch 42, batch 50, loss[loss=0.2193, simple_loss=0.2767, pruned_loss=0.08094, over 20004.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.286, pruned_loss=0.06177, over 842366.14 frames. ], batch size: 293, lr: 7.93e-03, grad_scale: 32.0 2023-06-15 13:03:47,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.35 vs. limit=15.0 2023-06-15 13:04:08,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146053.33333333334, ans=0.1 2023-06-15 13:04:08,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.85 vs. limit=15.0 2023-06-15 13:04:28,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-15 13:04:35,829 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.815e+02 2.002e+02 2.267e+02 3.167e+02, threshold=4.003e+02, percent-clipped=0.0 2023-06-15 13:04:39,177 INFO [train.py:988] (1/4) Epoch 42, batch 100, loss[loss=0.2184, simple_loss=0.3035, pruned_loss=0.06665, over 18275.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2883, pruned_loss=0.06153, over 1488243.59 frames. ], batch size: 74, lr: 7.92e-03, grad_scale: 32.0 2023-06-15 13:04:49,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=146186.66666666666, ans=0.0 2023-06-15 13:05:30,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=146386.66666666666, ans=0.125 2023-06-15 13:05:48,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=146453.33333333334, ans=0.2 2023-06-15 13:06:08,126 INFO [train.py:988] (1/4) Epoch 42, batch 150, loss[loss=0.2144, simple_loss=0.3066, pruned_loss=0.06107, over 17666.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2875, pruned_loss=0.06092, over 2006151.99 frames. ], batch size: 67, lr: 7.91e-03, grad_scale: 32.0 2023-06-15 13:06:10,404 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:06:33,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=146586.66666666666, ans=0.2 2023-06-15 13:06:38,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=146586.66666666666, ans=0.125 2023-06-15 13:06:57,000 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:07:20,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=146786.66666666666, ans=0.0 2023-06-15 13:07:28,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=15.0 2023-06-15 13:07:34,334 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.782e+02 1.960e+02 2.224e+02 3.500e+02, threshold=3.921e+02, percent-clipped=0.0 2023-06-15 13:07:37,719 INFO [train.py:988] (1/4) Epoch 42, batch 200, loss[loss=0.2337, simple_loss=0.3271, pruned_loss=0.07015, over 17618.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2884, pruned_loss=0.06116, over 2387972.78 frames. ], batch size: 67, lr: 7.90e-03, grad_scale: 32.0 2023-06-15 13:07:58,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=146920.0, ans=0.0 2023-06-15 13:08:44,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=147053.33333333334, ans=0.125 2023-06-15 13:08:45,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=147053.33333333334, ans=0.02 2023-06-15 13:08:51,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147120.0, ans=0.1 2023-06-15 13:09:05,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=147186.66666666666, ans=0.125 2023-06-15 13:09:07,790 INFO [train.py:988] (1/4) Epoch 42, batch 250, loss[loss=0.2236, simple_loss=0.3099, pruned_loss=0.06868, over 18434.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2893, pruned_loss=0.06159, over 2700230.46 frames. ], batch size: 77, lr: 7.89e-03, grad_scale: 32.0 2023-06-15 13:09:10,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-06-15 13:09:30,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=147253.33333333334, ans=0.02 2023-06-15 13:09:32,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.42 vs. limit=12.0 2023-06-15 13:09:52,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147320.0, ans=0.1 2023-06-15 13:10:18,351 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:10:32,066 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.737e+02 1.907e+02 2.072e+02 2.821e+02, threshold=3.814e+02, percent-clipped=0.0 2023-06-15 13:10:32,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=147453.33333333334, ans=0.125 2023-06-15 13:10:36,352 INFO [train.py:988] (1/4) Epoch 42, batch 300, loss[loss=0.2134, simple_loss=0.2915, pruned_loss=0.06764, over 18943.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2882, pruned_loss=0.06219, over 2946939.47 frames. ], batch size: 86, lr: 7.88e-03, grad_scale: 32.0 2023-06-15 13:10:38,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147520.0, ans=0.1 2023-06-15 13:10:54,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=147586.66666666666, ans=0.04949747468305833 2023-06-15 13:11:04,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=147586.66666666666, ans=0.0 2023-06-15 13:12:05,699 INFO [train.py:988] (1/4) Epoch 42, batch 350, loss[loss=0.1923, simple_loss=0.2811, pruned_loss=0.05178, over 19468.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2883, pruned_loss=0.06189, over 3113538.29 frames. ], batch size: 105, lr: 7.88e-03, grad_scale: 32.0 2023-06-15 13:12:26,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=147920.0, ans=0.05 2023-06-15 13:12:30,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147920.0, ans=0.125 2023-06-15 13:12:59,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-15 13:13:15,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.32 vs. limit=15.0 2023-06-15 13:13:30,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.764e+02 1.957e+02 2.256e+02 2.981e+02, threshold=3.914e+02, percent-clipped=0.0 2023-06-15 13:13:34,296 INFO [train.py:988] (1/4) Epoch 42, batch 400, loss[loss=0.2264, simple_loss=0.2974, pruned_loss=0.07768, over 20292.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2875, pruned_loss=0.06183, over 3256828.56 frames. ], batch size: 149, lr: 7.87e-03, grad_scale: 32.0 2023-06-15 13:13:45,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=148186.66666666666, ans=0.125 2023-06-15 13:14:10,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=148320.0, ans=0.125 2023-06-15 13:14:30,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-15 13:14:37,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.86 vs. limit=6.0 2023-06-15 13:15:00,264 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:15:03,387 INFO [train.py:988] (1/4) Epoch 42, batch 450, loss[loss=0.2083, simple_loss=0.2572, pruned_loss=0.07973, over 16919.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2869, pruned_loss=0.06167, over 3369055.79 frames. ], batch size: 391, lr: 7.86e-03, grad_scale: 32.0 2023-06-15 13:15:14,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=148520.0, ans=0.035 2023-06-15 13:15:16,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=148520.0, ans=0.125 2023-06-15 13:15:18,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-15 13:15:30,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=148586.66666666666, ans=0.125 2023-06-15 13:16:23,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=148786.66666666666, ans=0.0 2023-06-15 13:16:26,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.894e+02 2.076e+02 2.350e+02 3.042e+02, threshold=4.151e+02, percent-clipped=0.0 2023-06-15 13:16:26,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148786.66666666666, ans=0.1 2023-06-15 13:16:29,399 INFO [train.py:988] (1/4) Epoch 42, batch 500, loss[loss=0.2234, simple_loss=0.32, pruned_loss=0.06347, over 15208.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2863, pruned_loss=0.06178, over 3467618.41 frames. ], batch size: 43, lr: 7.85e-03, grad_scale: 32.0 2023-06-15 13:16:40,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.85 vs. limit=15.0 2023-06-15 13:17:02,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=148986.66666666666, ans=0.125 2023-06-15 13:17:16,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=148986.66666666666, ans=0.125 2023-06-15 13:17:18,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149053.33333333334, ans=0.1 2023-06-15 13:17:19,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149053.33333333334, ans=0.125 2023-06-15 13:17:51,441 INFO [train.py:988] (1/4) Epoch 43, batch 0, loss[loss=0.198, simple_loss=0.2778, pruned_loss=0.05915, over 20660.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2778, pruned_loss=0.05915, over 20660.00 frames. ], batch size: 211, lr: 7.76e-03, grad_scale: 32.0 2023-06-15 13:17:51,441 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 13:17:57,702 INFO [train.py:1020] (1/4) Epoch 43, validation: loss=0.2014, simple_loss=0.3004, pruned_loss=0.05115, over 143649.00 frames. 2023-06-15 13:17:57,703 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 13:18:18,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149140.0, ans=0.1 2023-06-15 13:18:42,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149206.66666666666, ans=0.125 2023-06-15 13:18:48,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=149206.66666666666, ans=0.2 2023-06-15 13:19:09,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=149340.0, ans=0.125 2023-06-15 13:19:26,757 INFO [train.py:988] (1/4) Epoch 43, batch 50, loss[loss=0.1879, simple_loss=0.2802, pruned_loss=0.04785, over 19071.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2859, pruned_loss=0.05992, over 848048.39 frames. ], batch size: 89, lr: 7.75e-03, grad_scale: 32.0 2023-06-15 13:19:53,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.770e+02 1.939e+02 2.280e+02 3.061e+02, threshold=3.878e+02, percent-clipped=0.0 2023-06-15 13:20:11,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=149540.0, ans=0.95 2023-06-15 13:20:16,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149540.0, ans=0.125 2023-06-15 13:20:16,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=149540.0, ans=0.2 2023-06-15 13:20:22,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=149606.66666666666, ans=0.05 2023-06-15 13:20:48,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=149673.33333333334, ans=0.125 2023-06-15 13:20:55,041 INFO [train.py:988] (1/4) Epoch 43, batch 100, loss[loss=0.2063, simple_loss=0.2812, pruned_loss=0.06573, over 20519.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2851, pruned_loss=0.06055, over 1505903.40 frames. ], batch size: 189, lr: 7.74e-03, grad_scale: 32.0 2023-06-15 13:21:31,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=149873.33333333334, ans=0.125 2023-06-15 13:21:42,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.77 vs. limit=22.5 2023-06-15 13:21:46,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=149940.0, ans=0.0 2023-06-15 13:21:50,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=149940.0, ans=0.09899494936611666 2023-06-15 13:22:23,159 INFO [train.py:988] (1/4) Epoch 43, batch 150, loss[loss=0.2098, simple_loss=0.2962, pruned_loss=0.06169, over 18937.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2859, pruned_loss=0.06038, over 1985857.55 frames. ], batch size: 86, lr: 7.73e-03, grad_scale: 32.0 2023-06-15 13:22:29,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=150073.33333333334, ans=0.05 2023-06-15 13:22:50,683 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.767e+02 1.914e+02 2.114e+02 3.326e+02, threshold=3.828e+02, percent-clipped=0.0 2023-06-15 13:23:08,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.18 vs. limit=15.0 2023-06-15 13:23:15,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=150273.33333333334, ans=0.125 2023-06-15 13:23:50,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=150406.66666666666, ans=0.125 2023-06-15 13:23:51,960 INFO [train.py:988] (1/4) Epoch 43, batch 200, loss[loss=0.1973, simple_loss=0.2777, pruned_loss=0.05849, over 19769.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.285, pruned_loss=0.0616, over 2382200.70 frames. ], batch size: 115, lr: 7.72e-03, grad_scale: 32.0 2023-06-15 13:23:59,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=150406.66666666666, ans=0.125 2023-06-15 13:24:01,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150406.66666666666, ans=0.1 2023-06-15 13:24:03,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=150406.66666666666, ans=0.0 2023-06-15 13:24:08,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=150473.33333333334, ans=0.125 2023-06-15 13:24:33,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=150540.0, ans=0.04949747468305833 2023-06-15 13:24:48,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=22.5 2023-06-15 13:24:49,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=150606.66666666666, ans=0.0 2023-06-15 13:25:02,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=150673.33333333334, ans=0.0 2023-06-15 13:25:05,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-15 13:25:20,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=150740.0, ans=0.2 2023-06-15 13:25:21,390 INFO [train.py:988] (1/4) Epoch 43, batch 250, loss[loss=0.2208, simple_loss=0.3113, pruned_loss=0.06513, over 16267.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2852, pruned_loss=0.06085, over 2692022.53 frames. ], batch size: 52, lr: 7.72e-03, grad_scale: 32.0 2023-06-15 13:25:40,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=150806.66666666666, ans=0.0 2023-06-15 13:25:47,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.795e+02 2.042e+02 2.211e+02 3.400e+02, threshold=4.084e+02, percent-clipped=0.0 2023-06-15 13:25:50,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=150806.66666666666, ans=0.0 2023-06-15 13:26:03,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=150873.33333333334, ans=0.125 2023-06-15 13:26:44,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=151006.66666666666, ans=0.125 2023-06-15 13:26:47,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151006.66666666666, ans=0.125 2023-06-15 13:26:50,352 INFO [train.py:988] (1/4) Epoch 43, batch 300, loss[loss=0.1951, simple_loss=0.28, pruned_loss=0.05515, over 19100.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2851, pruned_loss=0.06065, over 2931626.36 frames. ], batch size: 94, lr: 7.71e-03, grad_scale: 32.0 2023-06-15 13:26:50,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=151073.33333333334, ans=0.125 2023-06-15 13:26:57,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=151073.33333333334, ans=0.2 2023-06-15 13:27:26,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151206.66666666666, ans=0.1 2023-06-15 13:28:04,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=151340.0, ans=0.125 2023-06-15 13:28:17,965 INFO [train.py:988] (1/4) Epoch 43, batch 350, loss[loss=0.2043, simple_loss=0.2883, pruned_loss=0.06012, over 19353.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2855, pruned_loss=0.06123, over 3123864.42 frames. ], batch size: 98, lr: 7.70e-03, grad_scale: 64.0 2023-06-15 13:28:44,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.733e+02 1.920e+02 2.080e+02 2.736e+02, threshold=3.841e+02, percent-clipped=0.0 2023-06-15 13:29:25,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=151606.66666666666, ans=0.125 2023-06-15 13:29:47,187 INFO [train.py:988] (1/4) Epoch 43, batch 400, loss[loss=0.2111, simple_loss=0.2957, pruned_loss=0.06321, over 18798.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2862, pruned_loss=0.06179, over 3270235.80 frames. ], batch size: 83, lr: 7.69e-03, grad_scale: 32.0 2023-06-15 13:30:27,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-15 13:30:42,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151940.0, ans=0.1 2023-06-15 13:30:45,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=151940.0, ans=0.04949747468305833 2023-06-15 13:30:53,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.80 vs. limit=6.0 2023-06-15 13:31:16,481 INFO [train.py:988] (1/4) Epoch 43, batch 450, loss[loss=0.1737, simple_loss=0.2628, pruned_loss=0.04224, over 19812.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2859, pruned_loss=0.06139, over 3361916.46 frames. ], batch size: 115, lr: 7.69e-03, grad_scale: 32.0 2023-06-15 13:31:17,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-15 13:31:43,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.899e+02 2.101e+02 2.447e+02 3.803e+02, threshold=4.202e+02, percent-clipped=0.0 2023-06-15 13:31:50,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.01 vs. limit=10.0 2023-06-15 13:32:00,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=152206.66666666666, ans=0.025 2023-06-15 13:32:34,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=152340.0, ans=0.2 2023-06-15 13:32:42,715 INFO [train.py:988] (1/4) Epoch 43, batch 500, loss[loss=0.1995, simple_loss=0.2882, pruned_loss=0.05541, over 19109.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2862, pruned_loss=0.06117, over 3443872.94 frames. ], batch size: 94, lr: 7.68e-03, grad_scale: 32.0 2023-06-15 13:33:03,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2023-06-15 13:33:10,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=152473.33333333334, ans=0.2 2023-06-15 13:33:15,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=152540.0, ans=0.0 2023-06-15 13:34:02,662 INFO [train.py:988] (1/4) Epoch 44, batch 0, loss[loss=0.2035, simple_loss=0.2811, pruned_loss=0.06294, over 20142.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2811, pruned_loss=0.06294, over 20142.00 frames. ], batch size: 133, lr: 7.58e-03, grad_scale: 32.0 2023-06-15 13:34:02,663 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 13:34:08,925 INFO [train.py:1020] (1/4) Epoch 44, validation: loss=0.204, simple_loss=0.3011, pruned_loss=0.05343, over 143649.00 frames. 2023-06-15 13:34:08,925 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 13:34:27,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-15 13:34:38,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=152693.33333333334, ans=0.07 2023-06-15 13:34:38,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=152693.33333333334, ans=10.0 2023-06-15 13:35:06,298 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.836e+02 2.115e+02 2.307e+02 4.215e+02, threshold=4.230e+02, percent-clipped=1.0 2023-06-15 13:35:08,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=152826.66666666666, ans=0.1 2023-06-15 13:35:35,483 INFO [train.py:988] (1/4) Epoch 44, batch 50, loss[loss=0.1995, simple_loss=0.2841, pruned_loss=0.0574, over 19213.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2869, pruned_loss=0.05992, over 851133.32 frames. ], batch size: 92, lr: 7.58e-03, grad_scale: 32.0 2023-06-15 13:35:40,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=152960.0, ans=0.125 2023-06-15 13:35:49,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152960.0, ans=0.1 2023-06-15 13:35:52,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=153026.66666666666, ans=0.0 2023-06-15 13:35:54,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.31 vs. limit=15.0 2023-06-15 13:36:41,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153160.0, ans=0.1 2023-06-15 13:36:45,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=153226.66666666666, ans=0.0 2023-06-15 13:37:03,158 INFO [train.py:988] (1/4) Epoch 44, batch 100, loss[loss=0.2045, simple_loss=0.2771, pruned_loss=0.06589, over 20256.00 frames. ], tot_loss[loss=0.205, simple_loss=0.287, pruned_loss=0.06148, over 1487725.77 frames. ], batch size: 239, lr: 7.57e-03, grad_scale: 32.0 2023-06-15 13:37:08,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-06-15 13:37:11,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=153293.33333333334, ans=0.0 2023-06-15 13:37:23,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=153360.0, ans=0.2 2023-06-15 13:37:36,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153360.0, ans=0.125 2023-06-15 13:37:46,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153426.66666666666, ans=0.1 2023-06-15 13:38:03,099 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.855e+02 2.123e+02 2.461e+02 3.591e+02, threshold=4.246e+02, percent-clipped=0.0 2023-06-15 13:38:05,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=153493.33333333334, ans=0.125 2023-06-15 13:38:18,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=153560.0, ans=0.125 2023-06-15 13:38:21,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=153560.0, ans=0.0 2023-06-15 13:38:29,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=153560.0, ans=0.035 2023-06-15 13:38:31,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=153626.66666666666, ans=0.125 2023-06-15 13:38:32,994 INFO [train.py:988] (1/4) Epoch 44, batch 150, loss[loss=0.2022, simple_loss=0.2888, pruned_loss=0.05779, over 18311.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2852, pruned_loss=0.06069, over 1999499.89 frames. ], batch size: 74, lr: 7.56e-03, grad_scale: 32.0 2023-06-15 13:39:22,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=153760.0, ans=0.125 2023-06-15 13:39:23,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=153760.0, ans=0.125 2023-06-15 13:39:47,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=153893.33333333334, ans=0.125 2023-06-15 13:40:02,722 INFO [train.py:988] (1/4) Epoch 44, batch 200, loss[loss=0.1983, simple_loss=0.2784, pruned_loss=0.05913, over 19968.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2856, pruned_loss=0.06083, over 2367681.95 frames. ], batch size: 126, lr: 7.56e-03, grad_scale: 32.0 2023-06-15 13:40:14,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=153960.0, ans=0.125 2023-06-15 13:40:22,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154026.66666666666, ans=0.1 2023-06-15 13:41:02,108 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.737e+02 1.890e+02 2.056e+02 2.866e+02, threshold=3.780e+02, percent-clipped=0.0 2023-06-15 13:41:02,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=154160.0, ans=0.125 2023-06-15 13:41:30,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154293.33333333334, ans=0.125 2023-06-15 13:41:32,065 INFO [train.py:988] (1/4) Epoch 44, batch 250, loss[loss=0.2001, simple_loss=0.283, pruned_loss=0.0586, over 18925.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2836, pruned_loss=0.06005, over 2685247.02 frames. ], batch size: 86, lr: 7.55e-03, grad_scale: 32.0 2023-06-15 13:41:39,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=154293.33333333334, ans=0.0 2023-06-15 13:41:54,971 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:42:21,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=154426.66666666666, ans=0.0 2023-06-15 13:42:32,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.61 vs. limit=10.0 2023-06-15 13:42:39,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=154493.33333333334, ans=0.125 2023-06-15 13:42:48,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=154560.0, ans=0.025 2023-06-15 13:43:00,303 INFO [train.py:988] (1/4) Epoch 44, batch 300, loss[loss=0.1892, simple_loss=0.2795, pruned_loss=0.04943, over 18793.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2844, pruned_loss=0.05992, over 2920629.59 frames. ], batch size: 83, lr: 7.54e-03, grad_scale: 32.0 2023-06-15 13:43:24,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=154693.33333333334, ans=0.125 2023-06-15 13:43:34,949 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:43:35,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-15 13:43:48,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-15 13:43:52,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=154826.66666666666, ans=0.125 2023-06-15 13:43:56,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=154826.66666666666, ans=0.025 2023-06-15 13:44:00,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.826e+02 2.061e+02 2.441e+02 3.264e+02, threshold=4.122e+02, percent-clipped=0.0 2023-06-15 13:44:30,153 INFO [train.py:988] (1/4) Epoch 44, batch 350, loss[loss=0.2074, simple_loss=0.2551, pruned_loss=0.07982, over 16945.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2846, pruned_loss=0.05948, over 3119942.21 frames. ], batch size: 392, lr: 7.53e-03, grad_scale: 32.0 2023-06-15 13:44:34,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.54 vs. limit=15.0 2023-06-15 13:44:44,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154960.0, ans=0.1 2023-06-15 13:44:46,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.20 vs. limit=15.0 2023-06-15 13:45:14,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155093.33333333334, ans=0.1 2023-06-15 13:45:20,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=155093.33333333334, ans=0.125 2023-06-15 13:45:22,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=155160.0, ans=0.0 2023-06-15 13:45:27,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=155160.0, ans=0.125 2023-06-15 13:45:59,552 INFO [train.py:988] (1/4) Epoch 44, batch 400, loss[loss=0.2073, simple_loss=0.2836, pruned_loss=0.0655, over 20277.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2856, pruned_loss=0.06027, over 3243048.40 frames. ], batch size: 141, lr: 7.53e-03, grad_scale: 32.0 2023-06-15 13:46:08,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=155293.33333333334, ans=0.0 2023-06-15 13:46:13,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=155293.33333333334, ans=0.125 2023-06-15 13:46:27,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.31 vs. limit=10.0 2023-06-15 13:46:28,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155360.0, ans=0.125 2023-06-15 13:46:28,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2023-06-15 13:46:52,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=155493.33333333334, ans=0.125 2023-06-15 13:46:54,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-06-15 13:46:57,030 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.791e+02 1.946e+02 2.236e+02 4.124e+02, threshold=3.892e+02, percent-clipped=1.0 2023-06-15 13:47:04,599 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:47:04,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=155493.33333333334, ans=0.125 2023-06-15 13:47:26,932 INFO [train.py:988] (1/4) Epoch 44, batch 450, loss[loss=0.2087, simple_loss=0.2921, pruned_loss=0.06261, over 18292.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2858, pruned_loss=0.06054, over 3359052.48 frames. ], batch size: 74, lr: 7.52e-03, grad_scale: 32.0 2023-06-15 13:47:27,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=155626.66666666666, ans=0.125 2023-06-15 13:47:42,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155693.33333333334, ans=0.125 2023-06-15 13:47:47,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=155693.33333333334, ans=0.2 2023-06-15 13:47:52,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.28 vs. limit=15.0 2023-06-15 13:47:57,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=155693.33333333334, ans=0.1 2023-06-15 13:47:59,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=155693.33333333334, ans=0.2 2023-06-15 13:48:16,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=155760.0, ans=15.0 2023-06-15 13:48:34,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=155893.33333333334, ans=0.125 2023-06-15 13:48:36,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.50 vs. limit=10.0 2023-06-15 13:48:47,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155893.33333333334, ans=0.125 2023-06-15 13:48:52,721 INFO [train.py:988] (1/4) Epoch 44, batch 500, loss[loss=0.2052, simple_loss=0.3002, pruned_loss=0.05509, over 17661.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2855, pruned_loss=0.06007, over 3454683.28 frames. ], batch size: 67, lr: 7.51e-03, grad_scale: 32.0 2023-06-15 13:49:09,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=156026.66666666666, ans=0.125 2023-06-15 13:49:25,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=156093.33333333334, ans=0.2 2023-06-15 13:49:30,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156093.33333333334, ans=0.1 2023-06-15 13:50:05,185 INFO [train.py:988] (1/4) Epoch 45, batch 0, loss[loss=0.1908, simple_loss=0.2765, pruned_loss=0.05258, over 19336.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2765, pruned_loss=0.05258, over 19336.00 frames. ], batch size: 98, lr: 7.42e-03, grad_scale: 32.0 2023-06-15 13:50:05,186 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 13:50:12,417 INFO [train.py:1020] (1/4) Epoch 45, validation: loss=0.2006, simple_loss=0.2992, pruned_loss=0.05105, over 143649.00 frames. 2023-06-15 13:50:12,418 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 13:50:14,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.838e+02 2.044e+02 2.323e+02 3.630e+02, threshold=4.088e+02, percent-clipped=0.0 2023-06-15 13:50:28,466 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:51:04,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=156373.33333333334, ans=0.015 2023-06-15 13:51:09,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156373.33333333334, ans=0.1 2023-06-15 13:51:13,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=156373.33333333334, ans=0.125 2023-06-15 13:51:19,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=156373.33333333334, ans=0.05 2023-06-15 13:51:24,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2023-06-15 13:51:25,457 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 13:51:27,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=156440.0, ans=0.0 2023-06-15 13:51:41,662 INFO [train.py:988] (1/4) Epoch 45, batch 50, loss[loss=0.1926, simple_loss=0.2826, pruned_loss=0.05127, over 18774.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2812, pruned_loss=0.05765, over 872209.51 frames. ], batch size: 83, lr: 7.41e-03, grad_scale: 32.0 2023-06-15 13:51:57,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=156573.33333333334, ans=0.125 2023-06-15 13:52:34,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=156706.66666666666, ans=0.04949747468305833 2023-06-15 13:52:46,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=156706.66666666666, ans=0.0 2023-06-15 13:53:10,637 INFO [train.py:988] (1/4) Epoch 45, batch 100, loss[loss=0.1964, simple_loss=0.2851, pruned_loss=0.05389, over 18588.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2822, pruned_loss=0.06007, over 1516869.05 frames. ], batch size: 80, lr: 7.41e-03, grad_scale: 32.0 2023-06-15 13:53:12,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.891e+02 2.087e+02 2.341e+02 3.228e+02, threshold=4.175e+02, percent-clipped=0.0 2023-06-15 13:53:14,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=156840.0, ans=0.125 2023-06-15 13:53:37,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156906.66666666666, ans=0.1 2023-06-15 13:54:03,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=157040.0, ans=0.2 2023-06-15 13:54:07,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.47 vs. limit=15.0 2023-06-15 13:54:08,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=157040.0, ans=0.125 2023-06-15 13:54:08,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157040.0, ans=0.1 2023-06-15 13:54:29,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=157106.66666666666, ans=0.125 2023-06-15 13:54:31,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=157106.66666666666, ans=0.0 2023-06-15 13:54:38,939 INFO [train.py:988] (1/4) Epoch 45, batch 150, loss[loss=0.1903, simple_loss=0.2784, pruned_loss=0.05115, over 19693.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2831, pruned_loss=0.05982, over 2020036.15 frames. ], batch size: 110, lr: 7.40e-03, grad_scale: 32.0 2023-06-15 13:54:56,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=157240.0, ans=0.125 2023-06-15 13:55:13,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-06-15 13:55:43,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157373.33333333334, ans=0.125 2023-06-15 13:55:53,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=157440.0, ans=0.125 2023-06-15 13:55:59,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=12.0 2023-06-15 13:56:05,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=157506.66666666666, ans=0.125 2023-06-15 13:56:07,427 INFO [train.py:988] (1/4) Epoch 45, batch 200, loss[loss=0.1849, simple_loss=0.2736, pruned_loss=0.04814, over 19498.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2837, pruned_loss=0.05951, over 2408511.77 frames. ], batch size: 102, lr: 7.39e-03, grad_scale: 32.0 2023-06-15 13:56:09,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.822e+02 1.971e+02 2.216e+02 3.677e+02, threshold=3.943e+02, percent-clipped=0.0 2023-06-15 13:56:10,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=15.0 2023-06-15 13:56:54,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=157640.0, ans=0.0 2023-06-15 13:57:10,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.90 vs. limit=15.0 2023-06-15 13:57:11,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=157706.66666666666, ans=0.0 2023-06-15 13:57:15,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-15 13:57:15,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-15 13:57:35,504 INFO [train.py:988] (1/4) Epoch 45, batch 250, loss[loss=0.1945, simple_loss=0.2749, pruned_loss=0.0571, over 20217.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2829, pruned_loss=0.05915, over 2716675.76 frames. ], batch size: 141, lr: 7.39e-03, grad_scale: 32.0 2023-06-15 13:57:57,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=157906.66666666666, ans=0.0 2023-06-15 13:58:21,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157973.33333333334, ans=0.1 2023-06-15 13:58:26,905 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-06-15 13:58:58,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=158106.66666666666, ans=0.0 2023-06-15 13:59:02,824 INFO [train.py:988] (1/4) Epoch 45, batch 300, loss[loss=0.1934, simple_loss=0.2799, pruned_loss=0.05344, over 19801.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.283, pruned_loss=0.05883, over 2964698.80 frames. ], batch size: 115, lr: 7.38e-03, grad_scale: 32.0 2023-06-15 13:59:04,379 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.818e+02 2.016e+02 2.367e+02 3.194e+02, threshold=4.032e+02, percent-clipped=0.0 2023-06-15 13:59:21,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=158240.0, ans=0.0 2023-06-15 13:59:25,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=158240.0, ans=0.125 2023-06-15 14:00:31,876 INFO [train.py:988] (1/4) Epoch 45, batch 350, loss[loss=0.1899, simple_loss=0.2819, pruned_loss=0.04894, over 19220.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2822, pruned_loss=0.05866, over 3150368.92 frames. ], batch size: 92, lr: 7.37e-03, grad_scale: 32.0 2023-06-15 14:00:32,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=158506.66666666666, ans=0.07 2023-06-15 14:00:46,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=158506.66666666666, ans=0.0 2023-06-15 14:00:50,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=158573.33333333334, ans=0.125 2023-06-15 14:01:15,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=158640.0, ans=0.0 2023-06-15 14:01:25,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=158706.66666666666, ans=0.125 2023-06-15 14:01:28,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=158706.66666666666, ans=0.0 2023-06-15 14:01:39,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=158706.66666666666, ans=0.125 2023-06-15 14:01:48,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158773.33333333334, ans=0.1 2023-06-15 14:02:00,911 INFO [train.py:988] (1/4) Epoch 45, batch 400, loss[loss=0.1999, simple_loss=0.278, pruned_loss=0.06087, over 20583.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2824, pruned_loss=0.05846, over 3291919.90 frames. ], batch size: 189, lr: 7.36e-03, grad_scale: 32.0 2023-06-15 14:02:02,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.797e+02 1.967e+02 2.284e+02 3.128e+02, threshold=3.934e+02, percent-clipped=0.0 2023-06-15 14:02:23,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.08 vs. limit=15.0 2023-06-15 14:02:37,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-15 14:02:52,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.04 vs. limit=10.0 2023-06-15 14:02:52,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=15.0 2023-06-15 14:03:08,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159040.0, ans=0.1 2023-06-15 14:03:25,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=159106.66666666666, ans=0.02 2023-06-15 14:03:28,290 INFO [train.py:988] (1/4) Epoch 45, batch 450, loss[loss=0.2054, simple_loss=0.2855, pruned_loss=0.06267, over 19776.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2836, pruned_loss=0.05856, over 3384364.45 frames. ], batch size: 115, lr: 7.36e-03, grad_scale: 16.0 2023-06-15 14:03:28,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=159173.33333333334, ans=0.025 2023-06-15 14:04:03,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=159306.66666666666, ans=0.125 2023-06-15 14:04:21,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-15 14:04:46,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=159440.0, ans=0.125 2023-06-15 14:04:49,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=159440.0, ans=0.125 2023-06-15 14:04:52,859 INFO [train.py:988] (1/4) Epoch 45, batch 500, loss[loss=0.2079, simple_loss=0.3063, pruned_loss=0.05478, over 15136.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2836, pruned_loss=0.0589, over 3447142.77 frames. ], batch size: 43, lr: 7.35e-03, grad_scale: 16.0 2023-06-15 14:04:56,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.833e+02 2.042e+02 2.435e+02 3.752e+02, threshold=4.085e+02, percent-clipped=0.0 2023-06-15 14:05:03,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=8.0 2023-06-15 14:05:12,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-15 14:05:33,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=159640.0, ans=0.125 2023-06-15 14:05:33,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.58 vs. limit=15.0 2023-06-15 14:06:05,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159726.66666666666, ans=0.0 2023-06-15 14:06:16,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-06-15 14:06:17,152 INFO [train.py:988] (1/4) Epoch 46, batch 0, loss[loss=0.1895, simple_loss=0.2775, pruned_loss=0.05078, over 19673.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2775, pruned_loss=0.05078, over 19673.00 frames. ], batch size: 110, lr: 7.27e-03, grad_scale: 32.0 2023-06-15 14:06:17,152 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 14:06:23,570 INFO [train.py:1020] (1/4) Epoch 46, validation: loss=0.2018, simple_loss=0.3001, pruned_loss=0.05177, over 143649.00 frames. 2023-06-15 14:06:23,571 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 14:06:44,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=159793.33333333334, ans=0.125 2023-06-15 14:06:45,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=159793.33333333334, ans=0.2 2023-06-15 14:07:12,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159860.0, ans=0.1 2023-06-15 14:07:12,373 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:07:13,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-15 14:07:18,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159926.66666666666, ans=0.125 2023-06-15 14:07:39,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=159993.33333333334, ans=0.125 2023-06-15 14:07:53,066 INFO [train.py:988] (1/4) Epoch 46, batch 50, loss[loss=0.2055, simple_loss=0.2943, pruned_loss=0.0583, over 16302.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.284, pruned_loss=0.05753, over 860357.40 frames. ], batch size: 52, lr: 7.26e-03, grad_scale: 32.0 2023-06-15 14:08:25,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.799e+02 2.021e+02 2.409e+02 3.297e+02, threshold=4.042e+02, percent-clipped=0.0 2023-06-15 14:09:19,623 INFO [train.py:988] (1/4) Epoch 46, batch 100, loss[loss=0.1987, simple_loss=0.2791, pruned_loss=0.05917, over 20612.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2849, pruned_loss=0.05867, over 1508305.63 frames. ], batch size: 173, lr: 7.25e-03, grad_scale: 16.0 2023-06-15 14:09:41,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=160460.0, ans=0.125 2023-06-15 14:10:01,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=160526.66666666666, ans=0.0 2023-06-15 14:10:17,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=22.5 2023-06-15 14:10:21,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=160593.33333333334, ans=0.0 2023-06-15 14:10:28,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=160660.0, ans=0.0 2023-06-15 14:10:39,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.46 vs. limit=15.0 2023-06-15 14:10:45,277 INFO [train.py:988] (1/4) Epoch 46, batch 150, loss[loss=0.2067, simple_loss=0.2906, pruned_loss=0.06141, over 11095.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2829, pruned_loss=0.05781, over 2024665.73 frames. ], batch size: 31, lr: 7.24e-03, grad_scale: 16.0 2023-06-15 14:10:53,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.32 vs. limit=15.0 2023-06-15 14:11:03,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=160793.33333333334, ans=0.0 2023-06-15 14:11:04,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=160793.33333333334, ans=0.125 2023-06-15 14:11:18,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=160860.0, ans=0.0 2023-06-15 14:11:19,616 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.724e+02 1.880e+02 2.096e+02 2.711e+02, threshold=3.761e+02, percent-clipped=0.0 2023-06-15 14:11:20,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=160860.0, ans=0.0 2023-06-15 14:12:11,677 INFO [train.py:988] (1/4) Epoch 46, batch 200, loss[loss=0.1823, simple_loss=0.2723, pruned_loss=0.04613, over 19053.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.284, pruned_loss=0.05785, over 2392897.36 frames. ], batch size: 89, lr: 7.24e-03, grad_scale: 16.0 2023-06-15 14:12:33,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=161126.66666666666, ans=0.09899494936611666 2023-06-15 14:12:54,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=161193.33333333334, ans=0.2 2023-06-15 14:13:33,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=161326.66666666666, ans=0.125 2023-06-15 14:13:37,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=161326.66666666666, ans=0.2 2023-06-15 14:13:38,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161393.33333333334, ans=0.1 2023-06-15 14:13:40,049 INFO [train.py:988] (1/4) Epoch 46, batch 250, loss[loss=0.2038, simple_loss=0.2872, pruned_loss=0.06014, over 20483.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2837, pruned_loss=0.05824, over 2698551.69 frames. ], batch size: 160, lr: 7.23e-03, grad_scale: 16.0 2023-06-15 14:13:43,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=161393.33333333334, ans=0.125 2023-06-15 14:13:45,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-15 14:14:01,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=161460.0, ans=0.0 2023-06-15 14:14:04,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161460.0, ans=0.1 2023-06-15 14:14:06,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=12.0 2023-06-15 14:14:15,412 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.802e+02 2.040e+02 2.477e+02 3.551e+02, threshold=4.080e+02, percent-clipped=0.0 2023-06-15 14:14:39,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=161593.33333333334, ans=0.0 2023-06-15 14:15:07,644 INFO [train.py:988] (1/4) Epoch 46, batch 300, loss[loss=0.2037, simple_loss=0.29, pruned_loss=0.05873, over 18302.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2837, pruned_loss=0.05859, over 2935824.73 frames. ], batch size: 74, lr: 7.22e-03, grad_scale: 16.0 2023-06-15 14:15:08,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161726.66666666666, ans=0.1 2023-06-15 14:15:09,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=161726.66666666666, ans=0.0 2023-06-15 14:15:26,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=161793.33333333334, ans=0.125 2023-06-15 14:15:33,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.34 vs. limit=15.0 2023-06-15 14:15:42,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=161860.0, ans=0.125 2023-06-15 14:15:54,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=161860.0, ans=0.125 2023-06-15 14:16:05,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=161926.66666666666, ans=0.0 2023-06-15 14:16:35,643 INFO [train.py:988] (1/4) Epoch 46, batch 350, loss[loss=0.2055, simple_loss=0.2844, pruned_loss=0.06328, over 19073.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2833, pruned_loss=0.05851, over 3130001.28 frames. ], batch size: 94, lr: 7.22e-03, grad_scale: 16.0 2023-06-15 14:16:41,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=162060.0, ans=0.125 2023-06-15 14:16:46,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=162060.0, ans=0.125 2023-06-15 14:17:10,328 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.820e+02 1.979e+02 2.259e+02 3.140e+02, threshold=3.959e+02, percent-clipped=0.0 2023-06-15 14:17:28,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=162260.0, ans=10.0 2023-06-15 14:18:04,281 INFO [train.py:988] (1/4) Epoch 46, batch 400, loss[loss=0.1874, simple_loss=0.2759, pruned_loss=0.0495, over 18609.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2834, pruned_loss=0.05848, over 3276304.87 frames. ], batch size: 80, lr: 7.21e-03, grad_scale: 32.0 2023-06-15 14:18:14,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=162393.33333333334, ans=0.2 2023-06-15 14:18:53,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=162526.66666666666, ans=0.95 2023-06-15 14:19:07,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=162593.33333333334, ans=0.125 2023-06-15 14:19:07,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=162593.33333333334, ans=0.125 2023-06-15 14:19:33,235 INFO [train.py:988] (1/4) Epoch 46, batch 450, loss[loss=0.1869, simple_loss=0.2773, pruned_loss=0.04831, over 19527.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2838, pruned_loss=0.0583, over 3380388.87 frames. ], batch size: 102, lr: 7.20e-03, grad_scale: 32.0 2023-06-15 14:19:42,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-15 14:20:07,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.857e+02 2.090e+02 2.387e+02 3.299e+02, threshold=4.180e+02, percent-clipped=0.0 2023-06-15 14:20:41,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=162993.33333333334, ans=0.0 2023-06-15 14:20:42,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=162993.33333333334, ans=0.1 2023-06-15 14:20:48,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.41 vs. limit=15.0 2023-06-15 14:20:57,269 INFO [train.py:988] (1/4) Epoch 46, batch 500, loss[loss=0.1878, simple_loss=0.2587, pruned_loss=0.05844, over 20366.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2839, pruned_loss=0.05845, over 3457149.49 frames. ], batch size: 239, lr: 7.20e-03, grad_scale: 32.0 2023-06-15 14:21:35,089 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:22:12,879 INFO [train.py:988] (1/4) Epoch 47, batch 0, loss[loss=0.198, simple_loss=0.2752, pruned_loss=0.06041, over 20192.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2752, pruned_loss=0.06041, over 20192.00 frames. ], batch size: 239, lr: 7.11e-03, grad_scale: 32.0 2023-06-15 14:22:12,879 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 14:22:19,331 INFO [train.py:1020] (1/4) Epoch 47, validation: loss=0.2046, simple_loss=0.3006, pruned_loss=0.05427, over 143649.00 frames. 2023-06-15 14:22:19,331 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 14:22:37,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=163346.66666666666, ans=0.0 2023-06-15 14:22:44,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=163346.66666666666, ans=0.2 2023-06-15 14:23:20,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163480.0, ans=0.125 2023-06-15 14:23:23,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.765e+02 1.978e+02 2.218e+02 3.606e+02, threshold=3.956e+02, percent-clipped=0.0 2023-06-15 14:23:31,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=163546.66666666666, ans=0.125 2023-06-15 14:23:46,071 INFO [train.py:988] (1/4) Epoch 47, batch 50, loss[loss=0.2097, simple_loss=0.2843, pruned_loss=0.06752, over 20307.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2826, pruned_loss=0.05837, over 876259.42 frames. ], batch size: 141, lr: 7.11e-03, grad_scale: 32.0 2023-06-15 14:23:49,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=163613.33333333334, ans=0.125 2023-06-15 14:24:04,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.03 vs. limit=15.0 2023-06-15 14:24:11,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=163680.0, ans=0.125 2023-06-15 14:24:26,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=163746.66666666666, ans=10.0 2023-06-15 14:24:43,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=163813.33333333334, ans=0.0 2023-06-15 14:24:47,154 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:25:13,272 INFO [train.py:988] (1/4) Epoch 47, batch 100, loss[loss=0.2112, simple_loss=0.284, pruned_loss=0.06923, over 20281.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2828, pruned_loss=0.05837, over 1529185.72 frames. ], batch size: 149, lr: 7.10e-03, grad_scale: 32.0 2023-06-15 14:25:57,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=164080.0, ans=0.0 2023-06-15 14:26:00,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=22.5 2023-06-15 14:26:04,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:10,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:16,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=164146.66666666666, ans=0.125 2023-06-15 14:26:17,518 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.736e+02 1.928e+02 2.284e+02 3.416e+02, threshold=3.856e+02, percent-clipped=0.0 2023-06-15 14:26:41,348 INFO [train.py:988] (1/4) Epoch 47, batch 150, loss[loss=0.1927, simple_loss=0.2705, pruned_loss=0.05742, over 20646.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2818, pruned_loss=0.05775, over 2030169.04 frames. ], batch size: 211, lr: 7.10e-03, grad_scale: 32.0 2023-06-15 14:26:43,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=164280.0, ans=0.125 2023-06-15 14:26:45,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=164280.0, ans=0.125 2023-06-15 14:27:22,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2023-06-15 14:27:23,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=164413.33333333334, ans=15.0 2023-06-15 14:27:42,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=164480.0, ans=0.125 2023-06-15 14:28:08,858 INFO [train.py:988] (1/4) Epoch 47, batch 200, loss[loss=0.1849, simple_loss=0.2744, pruned_loss=0.04772, over 18608.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2821, pruned_loss=0.05792, over 2417173.12 frames. ], batch size: 80, lr: 7.09e-03, grad_scale: 32.0 2023-06-15 14:28:33,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=164680.0, ans=0.125 2023-06-15 14:28:45,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=164746.66666666666, ans=0.125 2023-06-15 14:29:14,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.808e+02 2.037e+02 2.330e+02 4.045e+02, threshold=4.073e+02, percent-clipped=1.0 2023-06-15 14:29:17,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=164880.0, ans=0.2 2023-06-15 14:29:35,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=164946.66666666666, ans=15.0 2023-06-15 14:29:36,301 INFO [train.py:988] (1/4) Epoch 47, batch 250, loss[loss=0.1716, simple_loss=0.2615, pruned_loss=0.04082, over 19877.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2816, pruned_loss=0.05761, over 2731810.88 frames. ], batch size: 120, lr: 7.08e-03, grad_scale: 32.0 2023-06-15 14:29:46,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=164946.66666666666, ans=0.125 2023-06-15 14:29:53,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-15 14:29:56,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=165013.33333333334, ans=0.2 2023-06-15 14:30:15,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165080.0, ans=0.125 2023-06-15 14:30:20,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165080.0, ans=0.125 2023-06-15 14:30:49,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=165213.33333333334, ans=0.025 2023-06-15 14:31:01,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=165213.33333333334, ans=0.125 2023-06-15 14:31:04,899 INFO [train.py:988] (1/4) Epoch 47, batch 300, loss[loss=0.1904, simple_loss=0.2678, pruned_loss=0.05654, over 20509.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2817, pruned_loss=0.05792, over 2949412.69 frames. ], batch size: 189, lr: 7.08e-03, grad_scale: 32.0 2023-06-15 14:31:16,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=165280.0, ans=15.0 2023-06-15 14:31:27,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=165346.66666666666, ans=0.125 2023-06-15 14:31:46,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=165413.33333333334, ans=0.0 2023-06-15 14:32:06,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=165480.0, ans=0.125 2023-06-15 14:32:11,340 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.863e+02 2.079e+02 2.322e+02 4.081e+02, threshold=4.157e+02, percent-clipped=1.0 2023-06-15 14:32:22,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=165546.66666666666, ans=0.125 2023-06-15 14:32:29,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=165546.66666666666, ans=0.125 2023-06-15 14:32:33,711 INFO [train.py:988] (1/4) Epoch 47, batch 350, loss[loss=0.1929, simple_loss=0.2719, pruned_loss=0.05691, over 20531.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2807, pruned_loss=0.05736, over 3143180.51 frames. ], batch size: 173, lr: 7.07e-03, grad_scale: 32.0 2023-06-15 14:32:42,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=165613.33333333334, ans=0.125 2023-06-15 14:32:52,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.37 vs. limit=15.0 2023-06-15 14:32:52,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=165680.0, ans=15.0 2023-06-15 14:33:03,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=165680.0, ans=0.125 2023-06-15 14:34:01,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-15 14:34:03,142 INFO [train.py:988] (1/4) Epoch 47, batch 400, loss[loss=0.166, simple_loss=0.2546, pruned_loss=0.0387, over 19851.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2804, pruned_loss=0.05766, over 3277635.52 frames. ], batch size: 120, lr: 7.06e-03, grad_scale: 32.0 2023-06-15 14:34:07,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=165946.66666666666, ans=0.0 2023-06-15 14:35:08,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.781e+02 2.023e+02 2.326e+02 3.205e+02, threshold=4.047e+02, percent-clipped=0.0 2023-06-15 14:35:10,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=166146.66666666666, ans=6.0 2023-06-15 14:35:20,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=166213.33333333334, ans=0.0 2023-06-15 14:35:32,249 INFO [train.py:988] (1/4) Epoch 47, batch 450, loss[loss=0.2106, simple_loss=0.2898, pruned_loss=0.06574, over 18487.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2812, pruned_loss=0.0574, over 3383173.08 frames. ], batch size: 77, lr: 7.06e-03, grad_scale: 32.0 2023-06-15 14:36:04,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=166346.66666666666, ans=0.125 2023-06-15 14:36:08,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=166413.33333333334, ans=0.1 2023-06-15 14:36:26,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166480.0, ans=0.0 2023-06-15 14:36:49,984 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-15 14:36:58,235 INFO [train.py:988] (1/4) Epoch 47, batch 500, loss[loss=0.2148, simple_loss=0.3111, pruned_loss=0.05922, over 18319.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2819, pruned_loss=0.0574, over 3462027.92 frames. ], batch size: 72, lr: 7.05e-03, grad_scale: 32.0 2023-06-15 14:37:01,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=166613.33333333334, ans=0.2 2023-06-15 14:37:02,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-15 14:37:21,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=166680.0, ans=0.125 2023-06-15 14:37:29,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-15 14:37:39,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.09 vs. limit=15.0 2023-06-15 14:37:42,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=166746.66666666666, ans=0.2 2023-06-15 14:38:09,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=166826.66666666666, ans=0.125 2023-06-15 14:38:12,644 INFO [train.py:988] (1/4) Epoch 48, batch 0, loss[loss=0.1937, simple_loss=0.2741, pruned_loss=0.05664, over 20308.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2741, pruned_loss=0.05664, over 20308.00 frames. ], batch size: 239, lr: 6.97e-03, grad_scale: 32.0 2023-06-15 14:38:12,644 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 14:38:18,656 INFO [train.py:1020] (1/4) Epoch 48, validation: loss=0.1998, simple_loss=0.298, pruned_loss=0.05082, over 143649.00 frames. 2023-06-15 14:38:18,657 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 14:38:26,900 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.746e+02 1.946e+02 2.285e+02 3.541e+02, threshold=3.892e+02, percent-clipped=0.0 2023-06-15 14:38:48,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=166893.33333333334, ans=0.125 2023-06-15 14:39:47,297 INFO [train.py:988] (1/4) Epoch 48, batch 50, loss[loss=0.1958, simple_loss=0.2722, pruned_loss=0.05968, over 20104.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2806, pruned_loss=0.05683, over 857837.42 frames. ], batch size: 239, lr: 6.96e-03, grad_scale: 32.0 2023-06-15 14:40:28,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167293.33333333334, ans=0.125 2023-06-15 14:40:43,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-15 14:40:53,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=167360.0, ans=0.125 2023-06-15 14:40:53,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=167360.0, ans=0.0 2023-06-15 14:41:03,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167426.66666666666, ans=0.1 2023-06-15 14:41:15,606 INFO [train.py:988] (1/4) Epoch 48, batch 100, loss[loss=0.2034, simple_loss=0.2918, pruned_loss=0.05755, over 16778.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2805, pruned_loss=0.05721, over 1499586.98 frames. ], batch size: 59, lr: 6.96e-03, grad_scale: 32.0 2023-06-15 14:41:25,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.839e+02 2.012e+02 2.249e+02 3.194e+02, threshold=4.023e+02, percent-clipped=0.0 2023-06-15 14:41:31,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=167560.0, ans=0.95 2023-06-15 14:41:37,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=167560.0, ans=0.125 2023-06-15 14:41:50,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=167626.66666666666, ans=0.5 2023-06-15 14:42:00,538 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:42:02,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=167626.66666666666, ans=10.0 2023-06-15 14:42:22,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=167693.33333333334, ans=0.125 2023-06-15 14:42:22,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=167693.33333333334, ans=0.125 2023-06-15 14:42:43,937 INFO [train.py:988] (1/4) Epoch 48, batch 150, loss[loss=0.2173, simple_loss=0.3089, pruned_loss=0.06289, over 16281.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2813, pruned_loss=0.05737, over 2006036.70 frames. ], batch size: 52, lr: 6.95e-03, grad_scale: 32.0 2023-06-15 14:42:55,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=167826.66666666666, ans=0.0 2023-06-15 14:43:06,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2023-06-15 14:43:20,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=167960.0, ans=0.2 2023-06-15 14:43:32,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=167960.0, ans=0.0 2023-06-15 14:43:43,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=168026.66666666666, ans=0.2 2023-06-15 14:43:44,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-15 14:43:54,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.61 vs. limit=22.5 2023-06-15 14:43:59,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2023-06-15 14:44:12,976 INFO [train.py:988] (1/4) Epoch 48, batch 200, loss[loss=0.1955, simple_loss=0.2852, pruned_loss=0.05288, over 19461.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2809, pruned_loss=0.05756, over 2379637.71 frames. ], batch size: 105, lr: 6.95e-03, grad_scale: 32.0 2023-06-15 14:44:21,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.752e+02 1.958e+02 2.186e+02 2.989e+02, threshold=3.915e+02, percent-clipped=0.0 2023-06-15 14:44:22,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=168160.0, ans=0.0 2023-06-15 14:44:33,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-15 14:45:41,717 INFO [train.py:988] (1/4) Epoch 48, batch 250, loss[loss=0.2053, simple_loss=0.2719, pruned_loss=0.06938, over 19923.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2824, pruned_loss=0.05745, over 2690497.62 frames. ], batch size: 293, lr: 6.94e-03, grad_scale: 32.0 2023-06-15 14:45:47,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-15 14:45:50,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=168493.33333333334, ans=0.125 2023-06-15 14:46:15,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=168626.66666666666, ans=0.2 2023-06-15 14:46:15,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=168626.66666666666, ans=0.025 2023-06-15 14:46:18,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-06-15 14:47:05,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=168760.0, ans=0.2 2023-06-15 14:47:07,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=168760.0, ans=0.0 2023-06-15 14:47:10,237 INFO [train.py:988] (1/4) Epoch 48, batch 300, loss[loss=0.2035, simple_loss=0.2813, pruned_loss=0.06284, over 20451.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2814, pruned_loss=0.05706, over 2940624.27 frames. ], batch size: 160, lr: 6.93e-03, grad_scale: 32.0 2023-06-15 14:47:19,386 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.766e+02 2.086e+02 2.478e+02 4.078e+02, threshold=4.173e+02, percent-clipped=1.0 2023-06-15 14:47:19,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=168826.66666666666, ans=0.0 2023-06-15 14:47:21,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=168826.66666666666, ans=0.125 2023-06-15 14:47:31,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-15 14:47:54,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=168960.0, ans=0.0 2023-06-15 14:48:14,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-06-15 14:48:38,579 INFO [train.py:988] (1/4) Epoch 48, batch 350, loss[loss=0.1962, simple_loss=0.2794, pruned_loss=0.05649, over 19531.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2811, pruned_loss=0.05688, over 3113865.29 frames. ], batch size: 102, lr: 6.93e-03, grad_scale: 32.0 2023-06-15 14:49:09,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=169226.66666666666, ans=0.2 2023-06-15 14:49:27,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-15 14:50:04,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-15 14:50:05,737 INFO [train.py:988] (1/4) Epoch 48, batch 400, loss[loss=0.2027, simple_loss=0.2833, pruned_loss=0.06109, over 20311.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2817, pruned_loss=0.05725, over 3263970.93 frames. ], batch size: 149, lr: 6.92e-03, grad_scale: 32.0 2023-06-15 14:50:06,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.27 vs. limit=15.0 2023-06-15 14:50:13,966 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.792e+02 1.982e+02 2.238e+02 3.664e+02, threshold=3.964e+02, percent-clipped=0.0 2023-06-15 14:50:19,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.60 vs. limit=5.0 2023-06-15 14:50:26,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=169560.0, ans=0.2 2023-06-15 14:51:03,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=169693.33333333334, ans=0.125 2023-06-15 14:51:10,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=169693.33333333334, ans=0.2 2023-06-15 14:51:26,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=169760.0, ans=0.125 2023-06-15 14:51:32,753 INFO [train.py:988] (1/4) Epoch 48, batch 450, loss[loss=0.2294, simple_loss=0.322, pruned_loss=0.06845, over 16981.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2812, pruned_loss=0.05716, over 3394440.54 frames. ], batch size: 60, lr: 6.91e-03, grad_scale: 32.0 2023-06-15 14:52:32,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-15 14:52:48,420 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 14:52:49,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=170093.33333333334, ans=10.0 2023-06-15 14:52:57,836 INFO [train.py:988] (1/4) Epoch 48, batch 500, loss[loss=0.1915, simple_loss=0.2663, pruned_loss=0.05834, over 20348.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2811, pruned_loss=0.05705, over 3490478.99 frames. ], batch size: 239, lr: 6.91e-03, grad_scale: 32.0 2023-06-15 14:53:06,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.838e+02 2.045e+02 2.451e+02 3.381e+02, threshold=4.090e+02, percent-clipped=0.0 2023-06-15 14:53:34,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=170293.33333333334, ans=0.0 2023-06-15 14:53:42,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=170293.33333333334, ans=0.125 2023-06-15 14:54:12,515 INFO [train.py:988] (1/4) Epoch 49, batch 0, loss[loss=0.2103, simple_loss=0.2869, pruned_loss=0.06686, over 19947.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2869, pruned_loss=0.06686, over 19947.00 frames. ], batch size: 126, lr: 6.83e-03, grad_scale: 32.0 2023-06-15 14:54:12,516 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 14:54:19,028 INFO [train.py:1020] (1/4) Epoch 49, validation: loss=0.2025, simple_loss=0.2999, pruned_loss=0.05253, over 143649.00 frames. 2023-06-15 14:54:19,029 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 14:54:38,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=170440.0, ans=0.125 2023-06-15 14:54:43,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=170440.0, ans=0.035 2023-06-15 14:55:00,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-15 14:55:17,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=170573.33333333334, ans=0.5 2023-06-15 14:55:20,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-15 14:55:44,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=170640.0, ans=0.125 2023-06-15 14:55:47,873 INFO [train.py:988] (1/4) Epoch 49, batch 50, loss[loss=0.1903, simple_loss=0.2786, pruned_loss=0.05104, over 18643.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2807, pruned_loss=0.0585, over 858243.03 frames. ], batch size: 80, lr: 6.83e-03, grad_scale: 32.0 2023-06-15 14:56:05,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=15.0 2023-06-15 14:56:09,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=170773.33333333334, ans=0.125 2023-06-15 14:56:24,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=170840.0, ans=0.2 2023-06-15 14:56:24,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170840.0, ans=0.1 2023-06-15 14:56:25,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=170840.0, ans=0.125 2023-06-15 14:56:31,039 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.705e+02 1.886e+02 2.163e+02 3.210e+02, threshold=3.772e+02, percent-clipped=0.0 2023-06-15 14:56:41,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-15 14:57:16,254 INFO [train.py:988] (1/4) Epoch 49, batch 100, loss[loss=0.1875, simple_loss=0.2791, pruned_loss=0.04792, over 18315.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2802, pruned_loss=0.05701, over 1506683.14 frames. ], batch size: 74, lr: 6.82e-03, grad_scale: 32.0 2023-06-15 14:57:23,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=171040.0, ans=0.025 2023-06-15 14:57:49,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=171106.66666666666, ans=0.0 2023-06-15 14:58:07,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.20 vs. limit=10.0 2023-06-15 14:58:44,454 INFO [train.py:988] (1/4) Epoch 49, batch 150, loss[loss=0.1986, simple_loss=0.2845, pruned_loss=0.05636, over 19122.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2802, pruned_loss=0.0574, over 2020217.57 frames. ], batch size: 94, lr: 6.81e-03, grad_scale: 32.0 2023-06-15 14:58:59,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=6.0 2023-06-15 14:59:26,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=171506.66666666666, ans=0.2 2023-06-15 14:59:27,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.782e+02 1.923e+02 2.190e+02 3.146e+02, threshold=3.845e+02, percent-clipped=0.0 2023-06-15 15:00:13,681 INFO [train.py:988] (1/4) Epoch 49, batch 200, loss[loss=0.2042, simple_loss=0.3032, pruned_loss=0.05256, over 17635.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2813, pruned_loss=0.0574, over 2405783.26 frames. ], batch size: 67, lr: 6.81e-03, grad_scale: 32.0 2023-06-15 15:00:15,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=171706.66666666666, ans=0.125 2023-06-15 15:00:20,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=171706.66666666666, ans=0.125 2023-06-15 15:00:22,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.18 vs. limit=12.0 2023-06-15 15:01:03,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=171840.0, ans=15.0 2023-06-15 15:01:38,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=171973.33333333334, ans=10.0 2023-06-15 15:01:41,454 INFO [train.py:988] (1/4) Epoch 49, batch 250, loss[loss=0.1911, simple_loss=0.2821, pruned_loss=0.05002, over 19848.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2811, pruned_loss=0.05729, over 2719227.32 frames. ], batch size: 115, lr: 6.80e-03, grad_scale: 32.0 2023-06-15 15:01:42,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=172040.0, ans=0.125 2023-06-15 15:02:16,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=172173.33333333334, ans=0.125 2023-06-15 15:02:22,900 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.787e+02 2.024e+02 2.612e+02 4.231e+02, threshold=4.048e+02, percent-clipped=3.0 2023-06-15 15:02:27,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172173.33333333334, ans=0.1 2023-06-15 15:03:01,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=172306.66666666666, ans=0.125 2023-06-15 15:03:09,733 INFO [train.py:988] (1/4) Epoch 49, batch 300, loss[loss=0.2207, simple_loss=0.3186, pruned_loss=0.06139, over 17836.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2812, pruned_loss=0.05683, over 2959779.39 frames. ], batch size: 68, lr: 6.80e-03, grad_scale: 32.0 2023-06-15 15:03:34,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.63 vs. limit=15.0 2023-06-15 15:04:29,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-06-15 15:04:32,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=172640.0, ans=0.2 2023-06-15 15:04:38,778 INFO [train.py:988] (1/4) Epoch 49, batch 350, loss[loss=0.1934, simple_loss=0.2737, pruned_loss=0.0566, over 18599.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2812, pruned_loss=0.05689, over 3144249.46 frames. ], batch size: 80, lr: 6.79e-03, grad_scale: 32.0 2023-06-15 15:04:42,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=172706.66666666666, ans=0.04949747468305833 2023-06-15 15:04:51,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=172706.66666666666, ans=0.95 2023-06-15 15:05:02,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=172773.33333333334, ans=0.125 2023-06-15 15:05:05,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=172773.33333333334, ans=0.2 2023-06-15 15:05:22,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.759e+02 1.916e+02 2.182e+02 3.623e+02, threshold=3.831e+02, percent-clipped=0.0 2023-06-15 15:05:30,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2023-06-15 15:06:07,880 INFO [train.py:988] (1/4) Epoch 49, batch 400, loss[loss=0.2136, simple_loss=0.3057, pruned_loss=0.06073, over 16792.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2806, pruned_loss=0.0571, over 3278929.70 frames. ], batch size: 59, lr: 6.78e-03, grad_scale: 32.0 2023-06-15 15:06:08,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=173040.0, ans=0.0 2023-06-15 15:06:36,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-15 15:06:49,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=173173.33333333334, ans=0.125 2023-06-15 15:07:13,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=173240.0, ans=0.1 2023-06-15 15:07:37,454 INFO [train.py:988] (1/4) Epoch 49, batch 450, loss[loss=0.163, simple_loss=0.2504, pruned_loss=0.03779, over 18772.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2805, pruned_loss=0.05689, over 3377194.86 frames. ], batch size: 83, lr: 6.78e-03, grad_scale: 32.0 2023-06-15 15:07:37,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=173373.33333333334, ans=0.125 2023-06-15 15:08:07,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=173440.0, ans=0.05 2023-06-15 15:08:20,133 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.766e+02 1.952e+02 2.304e+02 4.249e+02, threshold=3.904e+02, percent-clipped=1.0 2023-06-15 15:08:38,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173573.33333333334, ans=0.1 2023-06-15 15:08:42,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-15 15:09:04,007 INFO [train.py:988] (1/4) Epoch 49, batch 500, loss[loss=0.1961, simple_loss=0.2784, pruned_loss=0.05687, over 20128.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2806, pruned_loss=0.05681, over 3472418.14 frames. ], batch size: 133, lr: 6.77e-03, grad_scale: 32.0 2023-06-15 15:09:27,531 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:09:34,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=173773.33333333334, ans=0.125 2023-06-15 15:09:42,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=173840.0, ans=0.0 2023-06-15 15:09:48,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=173840.0, ans=0.125 2023-06-15 15:10:17,334 INFO [train.py:988] (1/4) Epoch 50, batch 0, loss[loss=0.1863, simple_loss=0.2749, pruned_loss=0.04882, over 19875.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2749, pruned_loss=0.04882, over 19875.00 frames. ], batch size: 120, lr: 6.70e-03, grad_scale: 32.0 2023-06-15 15:10:17,335 INFO [train.py:1011] (1/4) Computing validation loss 2023-06-15 15:10:23,501 INFO [train.py:1020] (1/4) Epoch 50, validation: loss=0.202, simple_loss=0.299, pruned_loss=0.05252, over 143649.00 frames. 2023-06-15 15:10:23,502 INFO [train.py:1021] (1/4) Maximum memory allocated so far is 13795MB 2023-06-15 15:11:07,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=174060.0, ans=0.09899494936611666 2023-06-15 15:11:09,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=174060.0, ans=0.2 2023-06-15 15:11:12,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=12.0 2023-06-15 15:11:24,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=174126.66666666666, ans=0.0 2023-06-15 15:11:31,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=174193.33333333334, ans=0.0 2023-06-15 15:11:33,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.53 vs. limit=22.5 2023-06-15 15:11:33,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.99 vs. limit=15.0 2023-06-15 15:11:33,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.754e+02 1.986e+02 2.353e+02 3.317e+02, threshold=3.972e+02, percent-clipped=0.0 2023-06-15 15:11:34,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=174193.33333333334, ans=0.0 2023-06-15 15:11:36,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.52 vs. limit=6.0 2023-06-15 15:11:50,943 INFO [train.py:988] (1/4) Epoch 50, batch 50, loss[loss=0.1791, simple_loss=0.2728, pruned_loss=0.0427, over 19100.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2808, pruned_loss=0.05538, over 848561.02 frames. ], batch size: 94, lr: 6.69e-03, grad_scale: 32.0 2023-06-15 15:12:12,099 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-15 15:12:46,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=174460.0, ans=0.125 2023-06-15 15:13:17,589 INFO [train.py:988] (1/4) Epoch 50, batch 100, loss[loss=0.1944, simple_loss=0.2779, pruned_loss=0.05542, over 19474.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2778, pruned_loss=0.05535, over 1529640.96 frames. ], batch size: 105, lr: 6.69e-03, grad_scale: 32.0 2023-06-15 15:13:40,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=174660.0, ans=0.05 2023-06-15 15:13:40,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.76 vs. limit=6.0 2023-06-15 15:13:55,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=174726.66666666666, ans=0.2 2023-06-15 15:13:58,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=174726.66666666666, ans=0.0 2023-06-15 15:14:28,213 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.801e+02 1.997e+02 2.286e+02 3.614e+02, threshold=3.994e+02, percent-clipped=0.0 2023-06-15 15:14:33,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=174860.0, ans=0.125 2023-06-15 15:14:43,489 INFO [train.py:988] (1/4) Epoch 50, batch 150, loss[loss=0.2071, simple_loss=0.291, pruned_loss=0.06164, over 18295.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2783, pruned_loss=0.05595, over 2038458.14 frames. ], batch size: 74, lr: 6.68e-03, grad_scale: 32.0 2023-06-15 15:14:56,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=174926.66666666666, ans=0.125 2023-06-15 15:16:06,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=175193.33333333334, ans=0.125 2023-06-15 15:16:09,952 INFO [train.py:988] (1/4) Epoch 50, batch 200, loss[loss=0.1792, simple_loss=0.2662, pruned_loss=0.04616, over 18941.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2792, pruned_loss=0.05592, over 2417892.17 frames. ], batch size: 86, lr: 6.68e-03, grad_scale: 32.0 2023-06-15 15:16:32,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=175326.66666666666, ans=0.125 2023-06-15 15:17:23,285 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.770e+02 1.950e+02 2.283e+02 3.292e+02, threshold=3.901e+02, percent-clipped=0.0 2023-06-15 15:17:33,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=175526.66666666666, ans=0.0 2023-06-15 15:17:37,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175593.33333333334, ans=0.125 2023-06-15 15:17:38,344 INFO [train.py:988] (1/4) Epoch 50, batch 250, loss[loss=0.2081, simple_loss=0.2879, pruned_loss=0.0641, over 20261.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2793, pruned_loss=0.05578, over 2731179.15 frames. ], batch size: 141, lr: 6.67e-03, grad_scale: 32.0 2023-06-15 15:18:01,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175660.0, ans=0.1 2023-06-15 15:18:05,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=175660.0, ans=0.04949747468305833 2023-06-15 15:18:14,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=22.5 2023-06-15 15:18:18,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=175726.66666666666, ans=0.125 2023-06-15 15:18:33,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=175793.33333333334, ans=0.125 2023-06-15 15:18:38,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=12.0 2023-06-15 15:18:53,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=175860.0, ans=0.2 2023-06-15 15:19:00,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=175860.0, ans=0.07 2023-06-15 15:19:06,262 INFO [train.py:988] (1/4) Epoch 50, batch 300, loss[loss=0.1867, simple_loss=0.2729, pruned_loss=0.05022, over 18310.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2798, pruned_loss=0.056, over 2952663.63 frames. ], batch size: 74, lr: 6.66e-03, grad_scale: 16.0 2023-06-15 15:19:06,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=22.5 2023-06-15 15:19:14,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175926.66666666666, ans=0.125 2023-06-15 15:19:26,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=175993.33333333334, ans=0.125 2023-06-15 15:19:30,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175993.33333333334, ans=0.1 2023-06-15 15:19:52,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=176060.0, ans=0.2 2023-06-15 15:20:16,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=176193.33333333334, ans=0.0 2023-06-15 15:20:21,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.741e+02 2.011e+02 2.290e+02 4.002e+02, threshold=4.022e+02, percent-clipped=1.0 2023-06-15 15:20:34,830 INFO [train.py:988] (1/4) Epoch 50, batch 350, loss[loss=0.2033, simple_loss=0.2865, pruned_loss=0.06004, over 20460.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2788, pruned_loss=0.05551, over 3151202.35 frames. ], batch size: 160, lr: 6.66e-03, grad_scale: 16.0 2023-06-15 15:20:46,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=176260.0, ans=15.0 2023-06-15 15:20:56,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176326.66666666666, ans=0.1 2023-06-15 15:21:06,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=15.0 2023-06-15 15:21:29,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=176460.0, ans=0.0 2023-06-15 15:21:42,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=176460.0, ans=0.2 2023-06-15 15:22:03,799 INFO [train.py:988] (1/4) Epoch 50, batch 400, loss[loss=0.1996, simple_loss=0.2924, pruned_loss=0.05336, over 19201.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2793, pruned_loss=0.05575, over 3288517.57 frames. ], batch size: 92, lr: 6.65e-03, grad_scale: 32.0 2023-06-15 15:22:05,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=176593.33333333334, ans=0.125 2023-06-15 15:22:23,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=176660.0, ans=0.125 2023-06-15 15:22:43,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.97 vs. limit=10.0 2023-06-15 15:23:17,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.749e+02 1.909e+02 2.121e+02 2.900e+02, threshold=3.818e+02, percent-clipped=0.0 2023-06-15 15:23:32,288 INFO [train.py:988] (1/4) Epoch 50, batch 450, loss[loss=0.1767, simple_loss=0.2667, pruned_loss=0.04335, over 19107.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2799, pruned_loss=0.05561, over 3386533.76 frames. ], batch size: 94, lr: 6.65e-03, grad_scale: 32.0 2023-06-15 15:23:35,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=176926.66666666666, ans=0.125 2023-06-15 15:23:49,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-15 15:24:08,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=177060.0, ans=0.125 2023-06-15 15:24:57,650 INFO [train.py:988] (1/4) Epoch 50, batch 500, loss[loss=0.2, simple_loss=0.2821, pruned_loss=0.05896, over 19945.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2806, pruned_loss=0.05644, over 3475540.96 frames. ], batch size: 126, lr: 6.64e-03, grad_scale: 32.0 2023-06-15 15:25:13,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.86 vs. limit=22.5 2023-06-15 15:25:24,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177326.66666666666, ans=0.125 2023-06-15 15:25:29,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=177393.33333333334, ans=0.0 2023-06-15 15:25:51,617 INFO [train.py:1201] (1/4) Done!