Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -15,6 +15,10 @@ tags:
 - Stable Diffusion
 - quantization
 - fp8
 inference:
   parameters:
     torch_dtype: torch.float8_e4m3fn
@@ -24,8 +28,17 @@ inference:
 This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
-# FP8 Quantization
-This model has been quantized from the original BFloat16 format to FP8 format. The benefits include:
 - **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
 - **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
 - **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality

 - Stable Diffusion
 - quantization
 - fp8
+- 8-bit
+- e4m3
+- reduced-precision
+base_model: Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0
 inference:
   parameters:
     torch_dtype: torch.float8_e4m3fn
 This repository contains an FP8 quantized version of the [Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0) model. **This is NOT a fine-tuned model** but a direct quantization of the original BFloat16 model to FP8 format for optimized inference performance. We provide an [online demo](https://huggingface.co/spaces/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0).
+# Quantization Details
+This model has been quantized from the original BFloat16 format to FP8 format using PyTorch's native FP8 support. Here are the specifics:
+- **Quantization Technique**: Native FP8 quantization
+- **Precision**: E4M3 format (4 bits for exponent, 3 bits for mantissa)
+- **Library Used**: PyTorch's built-in FP8 support
+- **Data Type**: `torch.float8_e4m3fn`
+- **Original Model**: BFloat16 format (Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0)
+- **Model Size Reduction**: ~50% smaller than the original model
+The benefits of FP8 quantization include:
 - **Reduced Memory Usage**: Approximately 50% smaller model size compared to BFloat16/FP16
 - **Faster Inference**: Potential speed improvements, especially on hardware with FP8 support
 - **Minimal Quality Loss**: Carefully calibrated quantization process to preserve output quality