metadata

pipeline_tag: text-to-image
license: other
license_name: sai-nc-community
license_link: https://huggingface.co/stabilityai/sdxl-turbo/blob/main/LICENSE.TXT
base_model: stabilityai/sdxl-turbo
language:
  - en
tags:
  - stable-diffusion
  - sdxl
  - onnxruntime
  - onnx
  - text-to-image

Stable Diffusion XL Turbo for ONNX Runtime

Introduction

This repository hosts the optimized versions of SDXL Turbo to accelerate inference with ONNX Runtime CUDA execution provider.

See the usage instructions for how to run the SDXL pipeline with the ONNX files hosted in this repository.

Model Description

Developed by: Stability AI
Model type: Diffusion-based text-to-image generative model
License: STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE
Model Description: This is a conversion of the SDXL-Turbo model for ONNX Runtime inference with CUDA execution provider.

The VAE decoder is converted from sdxl-vae-fp16-fix. There are slight discrepancies between its output and that of the original VAE, but the decoded images should be close enough for most purposes.

The Canny control net is converted from diffusers/controlnet-canny-sdxl-1.0.

Performance Comparison

Latency for SDXL-Turbo

Below is average latency of generating an image of size 512x512 using NVIDIA A100-SXM4-80GB GPU:

Engine	Batch Size	Steps	PyTorch 2.1	ONNX Runtime CUDA
Static	1	1	109.4 ms	43.9 ms
Static	4	1	247.0 ms	121.1 ms
Static	1	4	171.1 ms	97.5 ms
Static	4	4	390.5 ms	248.0 ms

Static means the engine is built for the given batch size and image size combination, and CUDA graph is used to speed up. For PyTorch 2.1, the UNet use channel last (NHWC) format, and compile the UNet with mode reduce-overhead. See benchmark script for detail.

Latency for SDXL-Turbo with Canny Control Net

Below is average latency of generating an image of size 512x512 with canny control net using NVIDIA A100-SXM4-80GB GPU:

Engine	Batch Size	Steps	PyTorch 2.1	ONNX Runtime CUDA
Static	1	1	160.0 ms	55.3 ms
Static	4	1	314.9 ms	144.4 ms
Static	1	4	251.9 ms	134.9 ms
Static	4	4	514.2 ms	332.6 ms

Usage Example

Following the demo instructions. Example steps:

Install nvidia-docker using these instructions.
Clone onnxruntime repository.

git clone https://github.com/microsoft/onnxruntime
cd onnxruntime

Download the SDXL ONNX files from this repo

git lfs install
git clone https://huggingface.co/tlwu/sdxl-turbo-onnxruntime

If you want to try canny control net, get model from a branch:

git checkout canny_control_net

Launch the docker

docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.10-py3 /bin/bash

Build ONNX Runtime from source

export CUDACXX=/usr/local/cuda-12.2/bin/nvcc
git config --global --add safe.directory '*'
sh build.sh --config Release  --build_shared_lib --parallel --use_cuda --cuda_version 12.2 \
            --cuda_home /usr/local/cuda-12.2 --cudnn_home /usr/lib/x86_64-linux-gnu/ --build_wheel --skip_tests \
            --use_tensorrt --tensorrt_home /usr/src/tensorrt \
            --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=OFF \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --allow_running_as_root
python3 -m pip install build/Linux/Release/dist/onnxruntime_gpu-*-cp310-cp310-linux_x86_64.whl --force-reinstall

If the GPU is not A100, change CMAKE_CUDA_ARCHITECTURES=80 in the command line according to the GPU compute capacity (like 89 for RTX 4090, or 86 for RTX 3090). If your machine has less than 64GB memory, replace --parallel by --parallel 4 --nvcc_threads 1 to avoid out of memory.

Install libraries and requirements

python3 -m pip install --upgrade pip
cd /workspace/onnxruntime/python/tools/transformers/models/stable_diffusion
python3 -m pip install -r requirements-cuda12.txt
python3 -m pip install --upgrade polygraphy onnx-graphsurgeon --extra-index-url https://pypi.ngc.nvidia.com

Perform ONNX Runtime optimized inference

python3 demo_txt2img_xl.py \
  "starry night over Golden Gate Bridge by van gogh" \
  --version xl-turbo   \
  --work-dir /workspace/sdxl-turbo-onnxruntime

Generate an image using the canny control net:

wget https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png

python3 demo_txt2img_xl.py --controlnet-type canny --controlnet-scale 0.5 --controlnet-image input_image_vermeer.png \
        --version xl-turbo --height 1024 --width 1024 \
        --work-dir /workspace/sdxl-turbo-onnxruntime \
        "portrait of Mona Lisa with mysterious mysterious smile and mountain, river and forest in the background"