Submitted by chuyi777 90 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models · 1 authors 2
Submitted by zhangshaolei 49 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token · 4 authors 4
Submitted by LXT 42 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos · 10 authors 2
Submitted by LiquidAmmonia 40 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models · 9 authors 2
Submitted by akhaliq 23 Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control · 12 authors 2
Submitted by Forceless 19 PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides · 9 authors 3
Submitted by tnlin 16 OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis · 13 authors 4
Submitted by BoZhang 14 Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback · 9 authors 3
Submitted by julianjuaner 14 Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers · 7 authors 2
Submitted by yyqoni 9 Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model · 8 authors 2
Submitted by ozbro 9 MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting · 6 authors 2
Submitted by mjbuehler 8 Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers · 1 authors 2
Submitted by Tvaranka 5 MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control · 5 authors 2
Submitted by WenhaoWang 3 Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models · 6 authors 2