-
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper • 2402.05099 • Published • 20 -
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Paper • 2402.13720 • Published • 7 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 32 -
Your Transformer is Secretly Linear
Paper • 2405.12250 • Published • 153
Collections
Discover the best community collections!
Collections including paper arxiv:2410.00907
-
TinyGSM: achieving >80% on GSM8k with small language models
Paper • 2312.09241 • Published • 39 -
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Paper • 2403.03853 • Published • 63 -
Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction
Paper • 2403.18795 • Published • 20 -
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
Paper • 2404.04478 • Published • 13