|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- evanarlian/imagenet_1k_resized_256 |
|
language: |
|
- en |
|
base_model: |
|
- madebyollin/sdxl-vae-fp16-fix |
|
- stabilityai/sdxl-vae |
|
library_name: diffusers |
|
--- |
|
# EQ-SDXL-VAE: open sourced reproduction of EQ-VAE on SDXL-VAE |
|
|
|
**Adv-FT is done and achieve better performance than original SDXL-VAE!!!** |
|
|
|
original paper: https://arxiv.org/abs/2502.09509 <br> |
|
source code of the reproduction: https://github.com/KohakuBlueleaf/HakuLatent |
|
|
|
 |
|
|
|
Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image <br> |
|
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE. |
|
|
|
## Introduction |
|
|
|
EQ-VAE, short for **Equivariance Regularized VAE**, is a novel technique introduced in the paper "[Equivariance Regularized Latent Space for Improved Generative Image Modeling](https://arxiv.org/abs/2502.09509)" to enhance the latent spaces of autoencoders used in generative image models. The core idea behind EQ-VAE is to address a critical limitation in standard autoencoders: their lack of equivariance to semantic-preserving transformations like scaling and rotation. This non-equivariance results in unnecessarily complex latent spaces, making it harder for subsequent generative models (like diffusion models) to learn efficiently and achieve optimal performance. |
|
|
|
This repository provides the model weight of the open-source reproduction of the EQ-VAE method, specifically applied to the **SDXL-VAE**. SDXL-VAE is a powerful variational autoencoder known for its use in the popular Stable Diffusion XL (SDXL) image generation models. By fine-tuning the pre-trained SDXL-VAE with the EQ-VAE regularization, we aim to create a more structured and semantically meaningful latent space. This should lead to benefits such as: |
|
|
|
* **Improved Generative Performance:** A simpler, more equivariant latent space is expected to be easier for generative models to learn from, potentially leading to faster training and improved image quality metrics like FID. |
|
* **Enhanced Latent Space Structure:** EQ-VAE encourages the latent representations to respect spatial transformations, resulting in a smoother and more interpretable latent manifold. |
|
* **Compatibility with Existing Models:** EQ-VAE is designed as a regularization technique that can be applied to pre-trained autoencoders without requiring architectural changes or training from scratch, making it a practical and versatile enhancement. |
|
|
|
This reproduction allows you to experiment with EQ-VAE on SDXL-VAE, replicate the findings of the original paper, and potentially leverage the benefits of equivariance regularization in your own generative modeling projects. For a deeper understanding of the theoretical background and experimental results, please refer to the original EQ-VAE paper linked above. The source code in HakuLatent repository provides a straightforward implementation of the EQ-VAE fine-tuning process for any diffusers vae models. |
|
|
|
## Visual Examples |
|
|
|
Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image <br> |
|
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE. |
|
|
|
|  |  | |
|
| ------------------- | ------------------- | |
|
|  |  | |
|
|
|
## Usage |
|
|
|
This model is heavily finetuned from SDXL-VAE and introduce a totally new latent space. YOU CAN'T USE THIS ON YOUR SDXL MODEL. |
|
|
|
You can try to use this VAE to finetune your sdxl model and expect a better final result, but it may require lot of time to achieve it... |
|
|
|
To utilize this model in your custom code or setup, use `AutoencoderKL` class from diffusers library and use: |
|
|
|
```python |
|
from diffusers import AutoencoderKL |
|
|
|
vae = AutoencoderKL.from_pretrained("KBlueLeaf/EQ-SDXL-VAE").cuda().half() |
|
... |
|
``` |
|
|
|
## Training Setup |
|
|
|
* Base Model: [SDXL-VAE-fp16-fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) |
|
* Dataset: [ImageNet-1k-resized-256](https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256) |
|
* Batch Size: 128 (bs 8, grad acc 16) |
|
* Sample Seen: 3.4M (26500 optimizer step on VAE) |
|
* Discriminator: HakuNLayerDiscriminator with n_layer=4 |
|
* Discriminator startup step: 10000 |
|
* Reconstruction Loss: |
|
* MSE loss |
|
* LPIPS loss |
|
* [ConvNeXt perceptual Loss](https://github.com/sypsyp97/convnext_perceptual_loss) |
|
* loss weights: |
|
* recon loss: 1.0 |
|
* adv(disc) loss: 0.5 |
|
* kl div loss: 1e-7 |
|
* For Adv FT |
|
* recon loss: 1.0 |
|
* MSE Loss: 1.5 |
|
* LPIPS Loss: 0.5 |
|
* ConvNeXt perceptual Loss: 2.0 |
|
* adv loss: 1.0 |
|
* kl div loss: 0.0 |
|
* Encoder freezed |
|
|
|
## Evaluation Results |
|
|
|
We use the validation split and test split (totally 150k images) of imagenet in 256x256 resolution and use MSE loss, PSNR, LPIPS and ConvNeXt perceptual loss as our metric. |
|
|
|
| Metrics | SDXL-VAE | EQ-SDXL-VAE | EQ-SDXL-VAE Adv FT | |
|
| -------- | --------- | ----------- | ------------------ | |
|
| MSE Loss | 3.683e-3 | 3.723e-3 | 3.532e-03 | |
|
| PSNR | 24.4698 | 24.4030 | 24.6364 | |
|
| LPIPS | 0.1316 | 0.1409 | 0.1299 | |
|
| ConvNeXt | 1.305e-03 | 1.548e-03 | 1.322e-03 | |
|
|
|
We can see after the EQ-VAE training without adv loss, the EQ-SDXL-VAE is slightly worse than original VAE. |
|
|
|
While After finetuning with Adversarial Loss enabled with Encoder freezed, the PSNR and LPIPS even improved to be better than original VAE! |
|
|
|
**Note**: This repo contains the weight of EQ-SDXL-VAE Adv FT. |
|
|
|
## Next step |
|
|
|
After the training is done, I will try to train a small T2I on it to check if EQ-VAE do help the training of Image Gen models. |
|
|
|
Also, I will try to train a simple approximation decoder which have only 2x upscale or no upscale for the latent, for fast experience (if needed) |
|
|
|
## References |
|
|
|
[1] [[2502.09509] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling](https://arxiv.org/abs/2502.09509) |
|
|
|
[2] [madebyollin/sdxl-vae-fp16-fix 路 Hugging Face](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) |
|
|
|
[3] [sypsyp97/convnext_perceptual_loss: This package introduces a perceptual loss implementation based on the modern ConvNeXt architecture.](https://github.com/sypsyp97/convnext_perceptual_loss) |
|
|
|
[4] [evanarlian/imagenet_1k_resized_256 路 Datasets at Hugging Face](https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256) |
|
|
|
## Cite |
|
|
|
```bibtex |
|
@misc{kohakublueleaf_eq_sdxl_vae, |
|
author = {Shih-Ying Yeh (KohakuBlueLeaf)}, |
|
title = {EQ-SDXL-VAE: Equivariance Regularized SDXL Variational Autoencoder}, |
|
year = {2024}, |
|
howpublished = {Hugging Face model card}, |
|
url = {https://huggingface.co/KBlueLeaf/EQ-SDXL-VAE}, |
|
note = {Finetuned SDXL-VAE with EQ-VAE regularization for improved latent space equivariance.} |
|
} |
|
``` |
|
|
|
## Acknowledgement |
|
|
|
* [xiaoqianWX](https://huggingface.co/xiaoqianWX): Provide the compute resource. |
|
|
|
* [AmericanPresidentJimmyCarter ](AmericanPresidentJimmyCarter ): Provide implementation of Random Affine transformation. |