Vittorio Pippi commited on
Commit
7ad9600
·
1 Parent(s): 935404c

Add README.md with model description and usage instructions for Emuru

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - image-generation
6
+ - text-to-image
7
+ - vae
8
+ - t5
9
+ - conditional-generation
10
+ - generative-modeling
11
+ - image-synthesis
12
+ - image-manipulation
13
+ - design-prototyping
14
+ - research
15
+ - educational
16
+ license: mit
17
+ datasets:
18
+ - blowing-up-groundhogs/font-square-v2
19
+ metrics:
20
+ - FID
21
+ - KID
22
+ - HWD
23
+ - CER
24
+ library_name: t5
25
+ ---
26
+
27
+ # Emuru
28
+
29
+ **Emuru** is a conditional generative model that integrates a T5-based decoder with a Variational Autoencoder (VAE) for image generation conditioned on text and style images. It allows users to combine textual prompts (e.g., style text, generation text) and style images to create new, synthesized images.
30
+
31
+
32
+ ## Model description
33
+
34
+ - **Architecture**:
35
+ Emuru uses a [T5ForConditionalGeneration](https://huggingface.co/docs/transformers/model_doc/t5) as its text decoder and an [AutoencoderKL](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/autoencoder_kl.py) as the VAE backbone. The T5 model encodes textual prompts and partially decoded latent representations, then predicts the next latent tokens. The VAE is used both to encode the initial style image and decode the predicted latent tokens back into an image.
36
+
37
+ - **Inputs**:
38
+ 1. **Style Image**: A reference image, which Emuru encodes to capture its “style” or other visual characteristics.
39
+ 2. **Style Text**: Text describing the style or context.
40
+ 3. **Generation Text**: Text describing the content or object to generate.
41
+
42
+ - **Outputs**:
43
+ 1. A synthesized image that reflects the fused style and text descriptions.
44
+
45
+ - **Tokenization**:
46
+ Emuru uses [AutoTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) to handle the text prompts, which adjusts the T5’s vocabulary and token embeddings accordingly.
47
+
48
+ - **Usage scenarios**:
49
+ - Stylized text-to-image generation
50
+ - Image manipulation or design prototyping based on textual descriptions
51
+ - Research or educational demonstrations of T5-based generative modeling
52
+
53
+
54
+ ## How to use
55
+
56
+ Below is a minimal usage example in Python. You can load the model with `AutoModel.from_pretrained(...)` and simply call `.generate(...)` or `.generate_batch(...)` to create images.
57
+
58
+ ```python
59
+ import torch
60
+ from PIL import Image
61
+ from transformers import AutoModel
62
+ from torchvision.transforms import functional as F
63
+
64
+ # 1. Load the model
65
+ model = AutoModel.from_pretrained("blowing-up-groundhogs/emuru")
66
+ model.cuda() # Move to GPU if available
67
+
68
+ # 2. Prepare your inputs
69
+ style_text = "A beautiful watercolor style"
70
+ gen_text = "A majestic mountain with a rainbow"
71
+ style_img = Image.open("my_style_image.png").convert("RGB")
72
+
73
+ # Convert the style image to a suitable tensor
74
+ style_img = F.to_tensor(style_img)
75
+ style_img = F.resize(style_img, (64, 64)) # Example resize
76
+ style_img = F.normalize(style_img, [0.5], [0.5]) # Normalize to [-1, 1]
77
+ style_img = style_img.unsqueeze(0).cuda()
78
+
79
+ # 3. Generate an image
80
+ generated_pil_image = model.generate(
81
+ style_text=style_text,
82
+ gen_text=gen_text,
83
+ style_img=style_img,
84
+ max_new_tokens=64
85
+ )
86
+
87
+ # 4. Save or display the result
88
+ generated_pil_image.save("generated_image.png")
89
+ ```
90
+
91
+ ### Batch Generation
92
+ You can also generate a batch of images if you have multiple style texts, generation texts, and style images:
93
+
94
+ ```python
95
+ style_texts = ["Style text 1", "Style text 2"]
96
+ gen_texts = ["Gen text 1", "Gen text 2"]
97
+ style_imgs = torch.stack([img1, img2], dim=0) # shape: (batch_size, C, H, W)
98
+ lengths = [img1.size(-1), img2.size(-1)]
99
+
100
+ output_images = model.generate_batch(
101
+ style_texts=style_texts,
102
+ gen_texts=gen_texts,
103
+ style_imgs=style_imgs,
104
+ lengths=lengths,
105
+ max_new_tokens=64
106
+ )
107
+
108
+ # `output_images` is a list of PIL images
109
+ for idx, pil_img in enumerate(output_images):
110
+ pil_img.save(f"batch_generated_image_{idx}.png")
111
+ ```
112
+
113
+
114
+ ## Citation
115
+
116
+ If you use Emuru in your research or wish to refer to it, please cite:
117
+
118
+ ```
119
+ ...
120
+ ```
121
+