Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2404.11912

Speculative Decoding

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Paper • 2404.18911 • Published Apr 29, 2024 • 30
Accelerating LLM Inference with Staged Speculative Decoding

Paper • 2308.04623 • Published Aug 8, 2023 • 25
An Emulator for Fine-Tuning Large Language Models using Small Language Models

Paper • 2310.12962 • Published Oct 19, 2023 • 13
The Curious Case of Neural Text Degeneration

Paper • 1904.09751 • Published Apr 22, 2019 • 3

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

Paper • 2404.15420 • Published Apr 23, 2024 • 8
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22, 2024 • 127
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22, 2024 • 256
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Paper • 2404.14047 • Published Apr 22, 2024 • 45

Papers - Speculative Decoding - Draft - Model - SpecInfer

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

Paper • 2305.09781 • Published May 16, 2023 • 4

Papers - Speculative Decoding - Long Context

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17

Papers - Speculative Decoding - Draft - Base Model - JF68M

we utilize a 4K retrieval cache as an intermediate draft cache in our hierarchical system, while leveraging the JackFram/Llama68M (JF68M) [28] model

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17

Papers - Inference - Speculative Decoding - Draft - KV Cache

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17

Papers - KV Cache

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17
SnapKV: LLM Knows What You are Looking for Before Generation

Paper • 2404.14469 • Published Apr 22, 2024 • 24
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Paper • 2312.11514 • Published Dec 12, 2023 • 259
An Evolved Universal Transformer Memory

Paper • 2410.13166 • Published Oct 17, 2024 • 3

Papers - Speculative Decoding - KV Cache

we recognize two memory bottlenecks: model weights and KV cache, and the latter gradually bottleneck(s) as context length increases

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17

Papers - Inference - Speculative Decoding - KV Cache

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18, 2024 • 17

Papers - University - Carnegie Mellon University

Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22, 2024 • 32
Long-context LLMs Struggle with Long In-context Learning

Paper • 2404.02060 • Published Apr 2, 2024 • 36
PIQA: Reasoning about Physical Commonsense in Natural Language

Paper • 1911.11641 • Published Nov 26, 2019 • 2
AQuA: A Benchmarking Tool for Label Quality Assessment

Paper • 2306.09467 • Published Jun 15, 2023 • 1

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs