ABDALLALSWAITI commited on
Commit
ba41584
·
verified ·
1 Parent(s): 8cf6344

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -15,6 +15,10 @@ tags:
15
  - Stable Diffusion
16
  - quantization
17
  - fp8
 
 
 
 
18
  inference:
19
  parameters:
20
  torch_dtype: torch.float8_e4m3fn
@@ -24,8 +28,17 @@ inference:
24
 
25
  This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
26
 
27
- # FP8 Quantization
28
- This model has been quantized from the original BFloat16 format to FP8 format. The benefits include:
 
 
 
 
 
 
 
 
 
29
  - **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
30
  - **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
31
  - **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality
 
15
  - Stable Diffusion
16
  - quantization
17
  - fp8
18
+ - 8-bit
19
+ - e4m3
20
+ - reduced-precision
21
+ base_model: Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0
22
  inference:
23
  parameters:
24
  torch_dtype: torch.float8_e4m3fn
 
28
 
29
  This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
30
 
31
+ # Quantization Details
32
+ This model has been quantized from the original BFloat16 format to FP8 format using PyTorch's native FP8 support. Here are the specifics:
33
+
34
+ - **Quantization Technique**: Native FP8 quantization
35
+ - **Precision**: E4M3 format (4 bits for exponent, 3 bits for mantissa)
36
+ - **Library Used**: PyTorch's built-in FP8 support
37
+ - **Data Type**: `torch.float8_e4m3fn`
38
+ - **Original Model**: BFloat16 format (Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0)
39
+ - **Model Size Reduction**: ~50% smaller than the original model
40
+
41
+ The benefits of FP8 quantization include:
42
  - **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
43
  - **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
44
  - **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality