Stabilizing Transformer Training by Preventing Attention Entropy Collapse Paper • 2303.06296 • Published Mar 11, 2023
The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning Paper • 2307.10907 • Published Jul 20, 2023 • 8
Position Prediction as an Effective Pretraining Strategy Paper • 2207.07611 • Published Jul 15, 2022 • 1
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Paper • 2501.12370 • Published about 1 month ago • 11
Theory, Analysis, and Best Practices for Sigmoid Self-Attention Paper • 2409.04431 • Published Sep 6, 2024 • 1