Vittorio Pippi
commited on
Commit
·
e99f49b
1
Parent(s):
aa91889
Model card
Browse files- README.md +105 -0
- samples/lam_sample.jpg +0 -0
README.md
ADDED
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Emuru Convolutional VAE
|
2 |
+
|
3 |
+
This repository hosts the Emuru Convolutional VAE, which is described in our paper. The model features a convolutional Encoder and Decoder with four layers each. The output channels for these layers are 32, 64, 128, and 256, respectively. The Encoder downsamples an input RGB image \(I \in \mathbb{R}^{3 \times W \times H}\) to a latent representation with a single channel and spatial dimensions \(h \times w\) (where \(h = H/8\) and \(w = W/8\)). This design compresses the style information in the image, enabling a lightweight Transformer Decoder to efficiently handle the latent features.
|
4 |
+
|
5 |
+
### Training Details
|
6 |
+
|
7 |
+
- **Hardware:** NVIDIA RTX 4090
|
8 |
+
- **Iterations:** 60k
|
9 |
+
- **Optimizer:** AdamW with a learning rate of 1e-4
|
10 |
+
- **Loss Components:**
|
11 |
+
- **MAE Loss (\(\mathcal{L}_{MAE}\))** with weight 1
|
12 |
+
- **WID Loss (\(\mathcal{L}_{WID}\))** with weight 0.005
|
13 |
+
- **HTR Loss (\(\mathcal{L}_{HTR}\))** with weight 0.3 (using noisy teacher-forcing with probability 0.3)
|
14 |
+
- **KL Loss (\(\mathcal{L}_{KL}\))** with weight \(\beta = 1\text{e-6}\)
|
15 |
+
|
16 |
+
### Auxiliary Networks
|
17 |
+
|
18 |
+
- **Writer Identification:** A ResNet with 6 blocks, trained until achieving 60% accuracy on a synthetic dataset.
|
19 |
+
- **Handwritten Text Recognition (HTR):** A Transformer Encoder-Decoder trained until reaching a Character Error Rate (CER) of 0.25 on the synthetic dataset.
|
20 |
+
|
21 |
+
### Usage
|
22 |
+
|
23 |
+
You can load the pre-trained Emuru VAE using Diffusers’ `AutoencoderKL` interface with a single line of code:
|
24 |
+
|
25 |
+
```python
|
26 |
+
from diffusers import AutoencoderKL
|
27 |
+
model = AutoencoderKL.from_pretrained("vpippi/emuru_vae")
|
28 |
+
```
|
29 |
+
|
30 |
+
The code snippet below demonstrates how to load an RGB image from disk, encode it into the latent space, decode it back to image space, and save the reconstructed image.
|
31 |
+
|
32 |
+
---
|
33 |
+
|
34 |
+
### Code Example
|
35 |
+
|
36 |
+
```python
|
37 |
+
from diffusers import AutoencoderKL
|
38 |
+
import torch
|
39 |
+
from torchvision.transforms.functional import to_tensor, to_pil_image
|
40 |
+
from PIL import Image
|
41 |
+
|
42 |
+
# Load the pre-trained Emuru VAE from Hugging Face Hub.
|
43 |
+
model = AutoencoderKL.from_pretrained("vpippi/emuru_vae")
|
44 |
+
|
45 |
+
# Function to preprocess an RGB image:
|
46 |
+
# Loads the image, converts it to RGB, and transforms it to a tensor normalized to [0, 1].
|
47 |
+
def preprocess_image(image_path):
|
48 |
+
image = Image.open(image_path).convert("RGB")
|
49 |
+
image_tensor = to_tensor(image).unsqueeze(0) # Add batch dimension
|
50 |
+
return image_tensor
|
51 |
+
|
52 |
+
# Function to postprocess a tensor back to a PIL image for visualization:
|
53 |
+
# Clamps the tensor to [0, 1] and converts it to a PIL image.
|
54 |
+
def postprocess_tensor(tensor):
|
55 |
+
tensor = torch.clamp(tensor, 0, 1).squeeze(0) # Remove batch dimension
|
56 |
+
return to_pil_image(tensor)
|
57 |
+
|
58 |
+
# Example: Encode and decode an image.
|
59 |
+
# Replace with your image path.
|
60 |
+
image_path = "/path/to/image"
|
61 |
+
input_image = preprocess_image(image_path)
|
62 |
+
|
63 |
+
# Encode the image to the latent space.
|
64 |
+
# The encode() method returns an object with a 'latent_dist' attribute.
|
65 |
+
# We sample from this distribution to obtain the latent representation.
|
66 |
+
with torch.no_grad():
|
67 |
+
latent_dist = model.encode(input_image).latent_dist
|
68 |
+
latents = latent_dist.sample()
|
69 |
+
|
70 |
+
# Decode the latent representation back to image space.
|
71 |
+
with torch.no_grad():
|
72 |
+
reconstructed = model.decode(latents).sample
|
73 |
+
|
74 |
+
# Load the original image for comparison.
|
75 |
+
original_image = Image.open(image_path).convert("RGB")
|
76 |
+
# Convert the reconstructed tensor back to a PIL image.
|
77 |
+
reconstructed_image = postprocess_tensor(reconstructed)
|
78 |
+
|
79 |
+
# Save the reconstructed image.
|
80 |
+
reconstructed_image.save("reconstructed_image.png")
|
81 |
+
```
|
82 |
+
|
83 |
+
---
|
84 |
+
|
85 |
+
### Testing with Sample Images
|
86 |
+
|
87 |
+
If you wish to test the model on sample images hosted on the Hugging Face Hub, you can:
|
88 |
+
- **Include sample images in your repository:** Place images in a folder (e.g., `samples/`) and reference them in your code.
|
89 |
+
- **Use the `huggingface_hub` API:** Download images programmatically using the `hf_hub_download` function.
|
90 |
+
|
91 |
+
For example, to download a sample image from your repository:
|
92 |
+
|
93 |
+
```python
|
94 |
+
from huggingface_hub import hf_hub_download
|
95 |
+
from PIL import Image
|
96 |
+
|
97 |
+
# Replace 'vpippi/emuru_vae' and 'samples/lam_sample.jpg' with your details.
|
98 |
+
image_path = hf_hub_download(repo_id="vpippi/emuru_vae", filename="samples/lam_sample.jpg")
|
99 |
+
sample_image = Image.open(image_path).convert("RGB")
|
100 |
+
sample_image.show()
|
101 |
+
```
|
102 |
+
|
103 |
+
This approach allows you to easily test and demonstrate the capabilities of the Emuru VAE using images hosted on the Hugging Face Hub.
|
104 |
+
|
105 |
+
Feel free to modify the preprocessing steps to suit your needs. Enjoy experimenting with the Emuru VAE!
|
samples/lam_sample.jpg
ADDED
![]() |