Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published Jan 7 • 43
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Paper • 2412.07589 • Published Dec 10, 2024 • 47
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing Paper • 2412.04280 • Published Dec 5, 2024 • 14
Reasoning Implicit Sentiment with Chain-of-Thought Prompting Paper • 2305.11255 • Published May 18, 2023 • 1
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter Paper • 2310.12798 • Published Oct 19, 2023 • 4
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning Paper • 2311.18651 • Published Nov 30, 2023
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation Paper • 2308.05095 • Published Aug 9, 2023
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models Paper • 2308.13812 • Published Aug 26, 2023 • 1
Faithful Logical Reasoning via Symbolic Chain-of-Thought Paper • 2405.18357 • Published May 28, 2024 • 2
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 52
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis Paper • 2408.09481 • Published Aug 18, 2024 • 1
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning Paper • 2402.11435 • Published Feb 18, 2024
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration Paper • 2410.20482 • Published Oct 27, 2024 • 1