update doc
Browse files
README.md
CHANGED
@@ -48,7 +48,7 @@ Below is average latency of generating an image of size 512x512 using NVIDIA A10
|
|
48 |
|
49 |
|
50 |
Static means the engine is built for the given batch size and image size combination, and CUDA graph is used to speed up.
|
51 |
-
|
52 |
|
53 |
#### Latency for SDXL-Turbo with Canny Control Net
|
54 |
|
@@ -56,10 +56,10 @@ Below is average latency of generating an image of size 512x512 with canny contr
|
|
56 |
|
57 |
| Engine | Batch Size | Steps | PyTorch 2.1 | ONNX Runtime CUDA |
|
58 |
|-------------|------------|------ | ----------------|-------------------|
|
59 |
-
| Static | 1 | 1 | 160.0 ms |
|
60 |
-
| Static | 4 | 1 | 314.9 ms |
|
61 |
-
| Static | 1 | 4 | 251.9 ms |
|
62 |
-
| Static | 4 | 4 | 514.2 ms |
|
63 |
|
64 |
|
65 |
## Usage Example
|
|
|
48 |
|
49 |
|
50 |
Static means the engine is built for the given batch size and image size combination, and CUDA graph is used to speed up.
|
51 |
+
For PyTorch 2.1, the UNet use channel last (NHWC) format, and compile the UNet with mode `reduce-overhead`. See [benchmark script](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark_controlnet.py) for detail.
|
52 |
|
53 |
#### Latency for SDXL-Turbo with Canny Control Net
|
54 |
|
|
|
56 |
|
57 |
| Engine | Batch Size | Steps | PyTorch 2.1 | ONNX Runtime CUDA |
|
58 |
|-------------|------------|------ | ----------------|-------------------|
|
59 |
+
| Static | 1 | 1 | 160.0 ms | 55.3 ms |
|
60 |
+
| Static | 4 | 1 | 314.9 ms | 144.4 ms |
|
61 |
+
| Static | 1 | 4 | 251.9 ms | 134.9 ms |
|
62 |
+
| Static | 4 | 4 | 514.2 ms | 332.6 ms |
|
63 |
|
64 |
|
65 |
## Usage Example
|