tlwu commited on
Commit
8202d1d
·
1 Parent(s): 9b4394a

update doc

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -48,7 +48,7 @@ Below is average latency of generating an image of size 512x512 using NVIDIA A10
48
 
49
 
50
  Static means the engine is built for the given batch size and image size combination, and CUDA graph is used to speed up.
51
-
52
 
53
  #### Latency for SDXL-Turbo with Canny Control Net
54
 
@@ -56,10 +56,10 @@ Below is average latency of generating an image of size 512x512 with canny contr
56
 
57
  | Engine | Batch Size | Steps | PyTorch 2.1 | ONNX Runtime CUDA |
58
  |-------------|------------|------ | ----------------|-------------------|
59
- | Static | 1 | 1 | 160.0 ms | 49.3 ms |
60
- | Static | 4 | 1 | 314.9 ms | 135.3 ms |
61
- | Static | 1 | 4 | 251.9 ms | 123.3 ms |
62
- | Static | 4 | 4 | 514.2 ms | 303.3 ms |
63
 
64
 
65
  ## Usage Example
 
48
 
49
 
50
  Static means the engine is built for the given batch size and image size combination, and CUDA graph is used to speed up.
51
+ For PyTorch 2.1, the UNet use channel last (NHWC) format, and compile the UNet with mode `reduce-overhead`. See [benchmark script](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark_controlnet.py) for detail.
52
 
53
  #### Latency for SDXL-Turbo with Canny Control Net
54
 
 
56
 
57
  | Engine | Batch Size | Steps | PyTorch 2.1 | ONNX Runtime CUDA |
58
  |-------------|------------|------ | ----------------|-------------------|
59
+ | Static | 1 | 1 | 160.0 ms | 55.3 ms |
60
+ | Static | 4 | 1 | 314.9 ms | 144.4 ms |
61
+ | Static | 1 | 4 | 251.9 ms | 134.9 ms |
62
+ | Static | 4 | 4 | 514.2 ms | 332.6 ms |
63
 
64
 
65
  ## Usage Example