Spaces:

text-generation-inference
/

README

Running

App Files Files Community

README / README.md

Narsil HF staff

Update README.md

83c15fe over 1 year ago

preview code

raw

history blame contribute delete

2.3 kB

	---
	title: README
	emoji: 🐢
	colorFrom: purple
	colorTo: purple
	sdk: static
	pinned: false
	---

	Text-Generation-Inference is a solution build for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:

	- Tensor Parallelism and custom cuda kernels
	- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
	- Quantization with bitsandbytes or gptq
	- Continuous batching of incoming requests for increased total throughput
	- Accelerated weight loading (start-up time) with safetensors
	- Logits warpers (temperature scaling, topk, repetition penalty ...)
	- Watermarking with A Watermark for Large Language Models
	- Stop sequences, Log probabilities
	- Token streaming using Server-Sent Events (SSE)

	<img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />

	## Currently optimized architectures

	- [BLOOM](https://huggingface.co/bigscience/bloom)
	- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
	- [Galactica](https://huggingface.co/facebook/galactica-120b)
	- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
	- [Llama](https://github.com/facebookresearch/llama)
	- [OPT](https://huggingface.co/facebook/opt-66b)
	- [SantaCoder](https://huggingface.co/bigcode/santacoder)
	- [Starcoder](https://huggingface.co/bigcode/starcoder)
	- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
	- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)

	## Check out the source code 👉
	- the server backend: https://github.com/huggingface/text-generation-inference
	- the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui

	## Check out examples

	- [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm)
	- [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co/blog/inference-endpoints-llm)