Collections

4

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Paper • 2410.09335 • Published Oct 12, 2024 • 17
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 36
Emergent properties with repeated examples

Paper • 2410.07041 • Published Oct 9, 2024 • 8
Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9, 2024 • 70

-

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Paper • 2411.04965 • Published Nov 7, 2024 • 66
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4, 2024 • 49
Ultra-Sparse Memory Network

Paper • 2411.12364 • Published Nov 19, 2024 • 22
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 107

-

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Emergent properties with repeated examples

Personalized Visual Instruction Tuning

BitNet a4.8: 4-bit Activations for 1-bit LLMs

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Ultra-Sparse Memory Network

VisionZip: Longer is Better but Not Necessary in Vision Language Models

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

RARe: Retrieval Augmented Retrieval with In-Context Examples

Differential Transformer

PaliGemma 2: A Family of Versatile VLMs for Transfer

VisionZip: Longer is Better but Not Necessary in Vision Language Models

o1-Coder: an o1 Replication for Coding

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

VisionZip: Longer is Better but Not Necessary in Vision Language Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Building and better understanding vision-language models: insights and future directions

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

LLM Pruning and Distillation in Practice: The Minitron Approach

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

A Closer Look into Mixture-of-Experts in Large Language Models

VisionZip: Longer is Better but Not Necessary in Vision Language Models

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Law of Vision Representation in MLLMs

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

An Introduction to Vision-Language Modeling

Matryoshka Multimodal Models