File size: 7,013 Bytes

---
license: apache-2.0
datasets:
- evanarlian/imagenet_1k_resized_256
language:
- en
base_model:
- madebyollin/sdxl-vae-fp16-fix
- stabilityai/sdxl-vae
library_name: diffusers
---
# EQ-SDXL-VAE: open sourced reproduction of EQ-VAE on SDXL-VAE

**Adv-FT is done and achieve better performance than original SDXL-VAE!!!**

original paper: https://arxiv.org/abs/2502.09509 <br>
source code of the reproduction: https://github.com/KohakuBlueleaf/HakuLatent

![DEMO thumbnail](images/demo.jpg)

Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image <br>
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.

## Introduction

EQ-VAE, short for **Equivariance Regularized VAE**, is a novel technique introduced in the paper "[Equivariance Regularized Latent Space for Improved Generative Image Modeling](https://arxiv.org/abs/2502.09509)" to enhance the latent spaces of autoencoders used in generative image models.  The core idea behind EQ-VAE is to address a critical limitation in standard autoencoders: their lack of equivariance to semantic-preserving transformations like scaling and rotation. This non-equivariance results in unnecessarily complex latent spaces, making it harder for subsequent generative models (like diffusion models) to learn efficiently and achieve optimal performance.

This repository provides the model weight of the open-source reproduction of the EQ-VAE method, specifically applied to the **SDXL-VAE**.  SDXL-VAE is a powerful variational autoencoder known for its use in the popular Stable Diffusion XL (SDXL) image generation models. By fine-tuning the pre-trained SDXL-VAE with the EQ-VAE regularization, we aim to create a more structured and semantically meaningful latent space.  This should lead to benefits such as:

* **Improved Generative Performance:** A simpler, more equivariant latent space is expected to be easier for generative models to learn from, potentially leading to faster training and improved image quality metrics like FID.
* **Enhanced Latent Space Structure:**  EQ-VAE encourages the latent representations to respect spatial transformations, resulting in a smoother and more interpretable latent manifold.
* **Compatibility with Existing Models:** EQ-VAE is designed as a regularization technique that can be applied to pre-trained autoencoders without requiring architectural changes or training from scratch, making it a practical and versatile enhancement.

This reproduction allows you to experiment with EQ-VAE on SDXL-VAE, replicate the findings of the original paper, and potentially leverage the benefits of equivariance regularization in your own generative modeling projects.  For a deeper understanding of the theoretical background and experimental results, please refer to the original EQ-VAE paper linked above. The source code in HakuLatent repository provides a straightforward implementation of the EQ-VAE fine-tuning process for any diffusers vae models.

## Visual Examples

Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image <br>
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.

| ![](images/demo1.jpg) | ![](images/demo5.jpg) |
| ------------------- | ------------------- |
| ![](images/demo3.jpg) | ![](images/demo2.jpg) |

## Usage

This model is heavily finetuned from SDXL-VAE and introduce a totally new latent space. YOU CAN'T USE THIS ON YOUR SDXL MODEL.

You can try to use this VAE to finetune your sdxl model and expect a better final result, but it may require lot of time to achieve it...

To utilize this model in your custom code or setup, use `AutoencoderKL` class from diffusers library and use:

```python
from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("KBlueLeaf/EQ-SDXL-VAE").cuda().half()
...
```

## Training Setup

* Base Model: [SDXL-VAE-fp16-fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)
* Dataset: [ImageNet-1k-resized-256](https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256)
* Batch Size: 128 (bs 8, grad acc 16)
* Sample Seen: 3.4M (26500 optimizer step on VAE)
* Discriminator: HakuNLayerDiscriminator with n_layer=4
* Discriminator startup step: 10000
* Reconstruction Loss:
  * MSE loss
  * LPIPS loss
  * [ConvNeXt perceptual Loss](https://github.com/sypsyp97/convnext_perceptual_loss)
* loss weights:
  * recon loss: 1.0
  * adv(disc) loss: 0.5
  * kl div loss: 1e-7
* For Adv FT
  * recon loss: 1.0
    * MSE Loss: 1.5
    * LPIPS Loss: 0.5
    * ConvNeXt perceptual Loss: 2.0
  * adv loss: 1.0
  * kl div loss: 0.0
    * Encoder freezed

## Evaluation Results

We use the validation split and test split (totally 150k images) of imagenet in 256x256 resolution and use MSE loss, PSNR, LPIPS and ConvNeXt perceptual loss as our metric.

| Metrics  | SDXL-VAE  | EQ-SDXL-VAE | EQ-SDXL-VAE Adv FT |
| -------- | --------- | ----------- | ------------------ |
| MSE Loss | 3.683e-3  | 3.723e-3    | 3.532e-03          |
| PSNR     | 24.4698   | 24.4030     | 24.6364            |
| LPIPS    | 0.1316    | 0.1409      | 0.1299             |
| ConvNeXt | 1.305e-03 | 1.548e-03   | 1.322e-03          |

We can see after the EQ-VAE training without adv loss, the EQ-SDXL-VAE is slightly worse than original VAE.

While After finetuning with Adversarial Loss enabled with Encoder freezed, the PSNR and LPIPS even improved to be better than original VAE!

**Note**: This repo contains the weight of EQ-SDXL-VAE Adv FT.

## Next step

After the training is done, I will try to train a small T2I on it to check if EQ-VAE do help the training of Image Gen models.

Also, I will try to train a simple approximation decoder which have only 2x upscale or no upscale for the latent, for fast experience (if needed)

## References

[1] [[2502.09509] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling](https://arxiv.org/abs/2502.09509)

[2] [madebyollin/sdxl-vae-fp16-fix · Hugging Face](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)

[3] [sypsyp97/convnext_perceptual_loss: This package introduces a perceptual loss implementation based on the modern ConvNeXt architecture.](https://github.com/sypsyp97/convnext_perceptual_loss)

[4] [evanarlian/imagenet_1k_resized_256 · Datasets at Hugging Face](https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256)

## Cite

```bibtex
@misc{kohakublueleaf_eq_sdxl_vae,
    author       = {Shih-Ying Yeh (KohakuBlueLeaf)},
    title        = {EQ-SDXL-VAE: Equivariance Regularized SDXL Variational Autoencoder},
    year         = {2024},
    howpublished = {Hugging Face model card},
    url          = {https://huggingface.co/KBlueLeaf/EQ-SDXL-VAE},
    note         = {Finetuned SDXL-VAE with EQ-VAE regularization for improved latent space equivariance.}
}
```

## Acknowledgement

* [xiaoqianWX](https://huggingface.co/xiaoqianWX): Provide the compute resource.

* [AmericanPresidentJimmyCarter ](AmericanPresidentJimmyCarter ): Provide implementation of Random Affine transformation.