Unleashing the Full Potential of ERNIE4.5 using FastDeploy
ERNIE4.5 models released by baidu have been widely adopted by the community, the latest ERNIE-4.5-21B-A3B-Thinking shows it's advantanges in several benchmarks as a lightweight model. While a model's performance is crucial, the real challenge for developers is achieving efficient deployment without compromising the development experience. The purpose-built solution, FastDeploy, one engineered to unlock the full potential of ERNIE4.5 models, becomes essential.
More than just a new open-source project, FastDeploy is a technology forged in the demanding, large-scale production environments of Baidu. It has been used internally to support the deployment of models with tens of billions and even trillions of parameters at scale, demonstrating its proven robustness and scalability.
The Performance Engineered for Your Workload
FastDeploy is not simply an inference server, it's a high-performance engine meticulously designed to address the most critical challenges of LLM deployment, particularly the immense complexity and resource demands of long-context models.
- Extreme Quantization, Simplified. Deploying a massive model like ERNIE4.5 can be daunting, but FastDeploy makes it practical. Beyond standard quantization, we're pioneering techniques like CCQ (Convolutional Code for Extreme Low-bit Quantization). This method packs 4 weights into each INT8 value, namely WINT2, so you can deploy the 2-bits ERNIE-4.5-300B-A47B model on a single GPU, eliminating the complexities of inter-card communication. During inference, the weights are dequantized and decoded in real-time to BF16, which is used for all calculations.
- PLAS. FastDeploy introduces PLAS (Pluggable Lightweight Attention for Sparsity), an innovative attention mechanism that enhances performance and efficiency. PLAS intelligently selects the most relevant parts of the context, dramatically reducing computation. This results in a 48% QPS improvement for the ERNIE4.5-21B-A3B model while maintaining near-lossless accuracy. This optimization can be applied post-training, with only a small, learnable MLP layer, leaving the original model weights untouched.
- A Full Toolkit for Maximum Throughput. Peak performance is a multi-faceted challenge. FastDeploy includes a suite of features like Prefill-Decode Disaggregation, Speculative Decoding, and Context Caching, all working in harmony to maximize throughput and minimize latency. These optimizations allow FastDeploy to achieve really high throughputs for ERNIE4.5.
A Specialized Tool That Works with Your Ecosystem
As a developer, you're likely more familiar with the PyTorch and vLLM ecosystems. This is precisely why we’ve built FastDeploy to be seamlessly interoperable. While it is a specialized, performance-optimized toolkit for ERNIE4.5 models, it is designed to work within the workflows you already know and love.
FastDeploy is fully compatible with the OpenAI API protocol and aligns perfectly with the vLLM interface. This means you get the best of both worlds: you can leverage your existing knowledge and tools to quickly adopt FastDeploy and gain all the performance benefits without a steep learning curve.
Quick Start
The WINT2 quantized model can be automatically downloaded, and quantization-related configuration is included in the model's config.json
file. Therefore, you don't need to specify the --quantization
flag when starting the inference service. The following example is from the FastDeploy WINT2 documentation:
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle \
--tensor-parallel-size 1 \
--use-cudagraph \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-seqs 256
Isn't it time to move beyond general solutions and use a tool engineered for peak performance? Start unleashing the full power of ERNIE4.5 through FastDeploy, there are details in our github repository: FastDeploy.