aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions Community

aoxo commited on Sep 23, 2024

Commit

e62b5fc

verified ·

1 Parent(s): 4bc5e63

Update README.md

Browse files

Files changed (1) hide show

README.md +120 -5

README.md CHANGED Viewed

@@ -120,17 +120,132 @@ Images and their corresponding style semantic maps were resized to fit the input
 - Number of attention layers: 8
 - Number of transformer encoder layers (feed-forward): 8
 - Number of transformer decoder layers (feed-forward): 8
-- Activation function: ReLU
 - Patch Size: 8
 - Swin Window Size: 7
 - Swin Shift Size: 2
--
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation

 - Number of attention layers: 8
 - Number of transformer encoder layers (feed-forward): 8
 - Number of transformer decoder layers (feed-forward): 8
+- Activation function(s): ReLU, GeLU
 - Patch Size: 8
 - Swin Window Size: 7
 - Swin Shift Size: 2
+- Style Transfer Module: AdaIN
+#### Speeds, Sizes, Times
+**Model size:** There are currently four versions of the model:
+- v1_1: 224M params
+- v1_2: 200M params
+- v1_3: 93M params
+- v2_1: 2.9M params
+**Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
+```python
+223543305
+DataParallel(
+  (module): ViTImage2Image(
+    (patch_embed): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
+    (encoder_layers): ModuleList(
+      (0-7): 8 x TransformerEncoderBlock(
+        (attn): LocationBasedMultiheadAttention(
+          (q_proj): Linear(in_features=768, out_features=768, bias=True)
+          (k_proj): Linear(in_features=768, out_features=768, bias=True)
+          (v_proj): Linear(in_features=768, out_features=768, bias=True)
+          (out_proj): Linear(in_features=768, out_features=768, bias=True)
+          (dropout): Dropout(p=0.1, inplace=False)
+        )
+        (ff): Sequential(
+          (0): Linear(in_features=768, out_features=3072, bias=True)
+          (1): ReLU()
+          (2): Linear(in_features=3072, out_features=768, bias=True)
+        )
+        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (adain): AdaIN(
+          (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
+          (fc): Linear(in_features=768, out_features=1536, bias=True)
+        )
+        (dropout): Dropout(p=0.1, inplace=False)
+      )
+    )
+    (decoder_layers): ModuleList(
+      (0-7): 8 x TransformerDecoderBlock(
+        (attn1): LocationBasedMultiheadAttention(
+          (q_proj): Linear(in_features=768, out_features=768, bias=True)
+          (k_proj): Linear(in_features=768, out_features=768, bias=True)
+          (v_proj): Linear(in_features=768, out_features=768, bias=True)
+          (out_proj): Linear(in_features=768, out_features=768, bias=True)
+          (dropout): Dropout(p=0.1, inplace=False)
+        )
+        (attn2): LocationBasedMultiheadAttention(
+          (q_proj): Linear(in_features=768, out_features=768, bias=True)
+          (k_proj): Linear(in_features=768, out_features=768, bias=True)
+          (v_proj): Linear(in_features=768, out_features=768, bias=True)
+          (out_proj): Linear(in_features=768, out_features=768, bias=True)
+          (dropout): Dropout(p=0.1, inplace=False)
+        )
+        (ff): Sequential(
+          (0): Linear(in_features=768, out_features=3072, bias=True)
+          (1): ReLU()
+          (2): Linear(in_features=3072, out_features=768, bias=True)
+        )
+        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (norm4): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (adain1): AdaIN(
+          (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
+          (fc): Linear(in_features=768, out_features=1536, bias=True)
+        )
+        (adain2): AdaIN(
+          (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
+          (fc): Linear(in_features=768, out_features=1536, bias=True)
+        )
+        (dropout): Dropout(p=0.1, inplace=False)
+      )
+    )
+    (swin_layers): ModuleList(
+      (0-7): 8 x SwinTransformerBlock(
+        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (attn): MultiheadAttention(
+          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+        )
+        (mlp): Sequential(
+          (0): Linear(in_features=768, out_features=3072, bias=True)
+          (1): GELU(approximate='none')
+          (2): Linear(in_features=3072, out_features=768, bias=True)
+        )
+        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      )
+    )
+    (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+    (mlp_head): Sequential(
+      (0): Linear(in_features=768, out_features=3072, bias=True)
+      (1): GELU(approximate='none')
+      (2): Linear(in_features=3072, out_features=768, bias=True)
+    )
+    (refinement): RefinementBlock(
+      (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+      (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
+      (relu): ReLU(inplace=True)
+    )
+    (style_encoder): Sequential(
+      (0): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+      (1): ReLU()
+      (2): AdaptiveAvgPool2d(output_size=1)
+      (3): Flatten(start_dim=1, end_dim=-1)
+      (4): Linear(in_features=768, out_features=768, bias=True)
+    )
+  )
+)
+```
+**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
+**Total Training Compute Throughput:** 4.13 TFLOPS
+**Total Logged Training Time:** ~210 hours (total time split across four models including overhead)
+**Start Time:** 09-13-2024
+**End Time:** 09-21-2024
+**Checkpoint Size:**
+- v1_1: 855 MB
+- v1_2: 764 MB
+- v1_3: 355 MB
+- v2_2: 11 MB
 ## Evaluation