-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 17 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 7 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 13 -
OPT: Open Pre-trained Transformer Language Models
Paper • 2205.01068 • Published • 2
Collections
Discover the best community collections!
Collections including paper arxiv:2401.04088
-
Mixtral of Experts
Paper • 2401.04088 • Published • 158 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 51 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 71 -
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
Paper • 2308.14352 • Published
-
Nemotron-4 15B Technical Report
Paper • 2402.16819 • Published • 45 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 55 -
RWKV: Reinventing RNNs for the Transformer Era
Paper • 2305.13048 • Published • 17 -
Reformer: The Efficient Transformer
Paper • 2001.04451 • Published
-
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 105 -
How to Train Data-Efficient LLMs
Paper • 2402.09668 • Published • 42 -
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper • 2402.10193 • Published • 22 -
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Paper • 2402.09727 • Published • 38
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 27 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 51 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 51
-
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 20 -
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 81 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper • 2312.04927 • Published • 2
-
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 56 -
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Paper • 2401.10774 • Published • 55 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 147 -
Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
Paper • 2401.12954 • Published • 30