Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -15,6 +15,10 @@ tags:
|
|
15 |
- Stable Diffusion
|
16 |
- quantization
|
17 |
- fp8
|
|
|
|
|
|
|
|
|
18 |
inference:
|
19 |
parameters:
|
20 |
torch_dtype: torch.float8_e4m3fn
|
@@ -24,8 +28,17 @@ inference:
|
|
24 |
|
25 |
This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
|
26 |
|
27 |
-
#
|
28 |
-
This model has been quantized from the original BFloat16 format to FP8 format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
- **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
|
30 |
- **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
|
31 |
- **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality
|
|
|
15 |
- Stable Diffusion
|
16 |
- quantization
|
17 |
- fp8
|
18 |
+
- 8-bit
|
19 |
+
- e4m3
|
20 |
+
- reduced-precision
|
21 |
+
base_model: Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0
|
22 |
inference:
|
23 |
parameters:
|
24 |
torch_dtype: torch.float8_e4m3fn
|
|
|
28 |
|
29 |
This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
|
30 |
|
31 |
+
# Quantization Details
|
32 |
+
This model has been quantized from the original BFloat16 format to FP8 format using PyTorch's native FP8 support. Here are the specifics:
|
33 |
+
|
34 |
+
- **Quantization Technique**: Native FP8 quantization
|
35 |
+
- **Precision**: E4M3 format (4 bits for exponent, 3 bits for mantissa)
|
36 |
+
- **Library Used**: PyTorch's built-in FP8 support
|
37 |
+
- **Data Type**: `torch.float8_e4m3fn`
|
38 |
+
- **Original Model**: BFloat16 format (Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0)
|
39 |
+
- **Model Size Reduction**: ~50% smaller than the original model
|
40 |
+
|
41 |
+
The benefits of FP8 quantization include:
|
42 |
- **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
|
43 |
- **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
|
44 |
- **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality
|