|
--- |
|
title: README |
|
emoji: π’ |
|
colorFrom: purple |
|
colorTo: purple |
|
sdk: static |
|
pinned: false |
|
--- |
|
|
|
Text-Generation-Inference is a solution build for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including: |
|
|
|
- Tensor Parallelism and custom cuda kernels |
|
- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures |
|
- Quantization with bitsandbytes or gptq |
|
- Continuous batching of incoming requests for increased total throughput |
|
- Accelerated weight loading (start-up time) with safetensors |
|
- Logits warpers (temperature scaling, topk, repetition penalty ...) |
|
- Watermarking with A Watermark for Large Language Models |
|
- Stop sequences, Log probabilities |
|
- Token streaming using Server-Sent Events (SSE) |
|
|
|
<img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" /> |
|
|
|
## Currently optimized architectures |
|
|
|
- [BLOOM](https://huggingface.co/bigscience/bloom) |
|
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl) |
|
- [Galactica](https://huggingface.co/facebook/galactica-120b) |
|
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b) |
|
- [Llama](https://github.com/facebookresearch/llama) |
|
- [OPT](https://huggingface.co/facebook/opt-66b) |
|
- [SantaCoder](https://huggingface.co/bigcode/santacoder) |
|
- [Starcoder](https://huggingface.co/bigcode/starcoder) |
|
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
|
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b) |
|
|
|
## Check out the source code π |
|
- the server backend: https://github.com/huggingface/text-generation-inference |
|
- the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui |
|
|
|
## Check out examples |
|
|
|
- [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm) |
|
- [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co/blog/inference-endpoints-llm) |
|
|
|
|