An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published Jun 13, 2024 • 51
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion Paper • 2406.04338 • Published Jun 6, 2024 • 39
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 94
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published Jan 1 • 101
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published Jan 7 • 52
MatAnyone: Stable Video Matting with Consistent Memory Propagation Paper • 2501.14677 • Published Jan 24 • 32
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features Paper • 2502.04320 • Published Feb 6 • 35
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling Paper • 2502.09509 • Published Feb 13 • 7
Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator Paper • 2502.19204 • Published 26 days ago • 11
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper • 2502.20321 • Published 25 days ago • 29
How far can we go with ImageNet for Text-to-Image generation? Paper • 2502.21318 • Published 24 days ago • 25
AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding Paper • 2503.01063 • Published 21 days ago • 5
Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective Paper • 2503.01933 • Published 21 days ago • 11
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Paper • 2503.04724 • Published 18 days ago • 66
Forgetting Transformer: Softmax Attention with a Forget Gate Paper • 2503.02130 • Published 21 days ago • 28
AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models Paper • 2503.08417 • Published 13 days ago • 7
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models Paper • 2503.09573 • Published 12 days ago • 59
The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation Paper • 2503.10636 • Published 11 days ago • 3
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video Paper • 2503.11647 • Published 10 days ago • 117