-
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
Paper • 2310.16656 • Published • 44 -
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
Paper • 2310.16825 • Published • 33 -
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 42 -
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Paper • 2311.04145 • Published • 35
Collections
Discover the best community collections!
Collections including paper arxiv:2402.12376
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 16 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 27 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 9 -
Conditional Diffusion Distillation
Paper • 2310.01407 • Published • 20
-
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
Paper • 2309.15103 • Published • 42 -
Neural Network Diffusion
Paper • 2402.13144 • Published • 95 -
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation
Paper • 2402.10210 • Published • 35 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 48
-
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Paper • 2309.03895 • Published • 14 -
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
Paper • 2309.16650 • Published • 10 -
CCEdit: Creative and Controllable Video Editing via Diffusion Models
Paper • 2309.16496 • Published • 9 -
FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling
Paper • 2310.15169 • Published • 10
-
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts
Paper • 2309.04354 • Published • 15 -
Vision Transformers Need Registers
Paper • 2309.16588 • Published • 79 -
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Paper • 2309.16414 • Published • 19 -
MotionLM: Multi-Agent Motion Forecasting as Language Modeling
Paper • 2309.16534 • Published • 15