aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions Community

aoxo commited on Oct 13, 2024

Commit

e956d96

verified ·

1 Parent(s): 720961d

Update README.md

Browse files

Files changed (1) hide show

README.md +13 -13

README.md CHANGED Viewed

@@ -191,7 +191,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - v1_3: 93M params
 - v2_1: 2.9M params
 - v3: 252.6M params
-- v4: 454.2M params
 **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
@@ -212,10 +212,10 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - v3_fp16: 505M
 - v3_bf16: 505M
 - v3_int8: 344M
-- v4: 1.69 GB
-- v4_fp16: 866M
-- v4_bf16: 866M
-- v4_int8: 578M
 ## Evaluation Data, Metrics & Results
@@ -270,7 +270,7 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
 The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
-**Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a motion-guided cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 continues to incorporate **Style Adaptive Layer Normalization (SALN)** to enhance feature extraction. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
 ```python
 RealFormerAGA(
@@ -351,17 +351,17 @@ RealFormerAGA(
     (relu): ReLU(inplace=True)
   )
   (final_layer): Conv2d(3, 3, kernel_size=(1, 1), stride=(1, 1))
-  (style_encoder): Sequential(
-    (0): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
-    (1): ReLU()
-    (2): AdaptiveAvgPool2d(output_size=1)
-    (3): Flatten(start_dim=1, end_dim=-1)
-    (4): Linear(in_features=768, out_features=768, bias=True)
   )
 )
 ```
-**v3 Architecture:** The v3 model introduces Style Adaptive Layer Normalization (SALN) & Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
 ```python
 RealFormerv3(

 - v1_3: 93M params
 - v2_1: 2.9M params
 - v3: 252.6M params
+- v4: 651.9M params
 **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
 - v3_fp16: 505M
 - v3_bf16: 505M
 - v3_int8: 344M
+- v4: 2.42 GB
+- v4_fp16: 1.21GB
+- v4_bf16: 1.21GB
+- v4_int8: 766M
 ## Evaluation Data, Metrics & Results
 RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
 The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
+**Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a **optical flow field motion-guided** cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 incorporates a novel **Multi-Scale Style Encoder** to enhance feature extraction, but also continues to leverage features from **SALN** and **LbMhA**. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
 ```python
 RealFormerAGA(
     (relu): ReLU(inplace=True)
   )
   (final_layer): Conv2d(3, 3, kernel_size=(1, 1), stride=(1, 1))
+  (style_encoder): MultiScaleStyleEncoder(
+    (conv): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (pool1): AdaptiveAvgPool2d(output_size=16)
+    (pool2): AdaptiveAvgPool2d(output_size=8)
+    (pool3): AdaptiveAvgPool2d(output_size=4)
+    (fc): Linear(in_features=258048, out_features=768, bias=True)
   )
 )
 ```
+**v3 Architecture:** The v3 model introduces **Style Adaptive Layer Normalization (SALN)** & **Location-based Multi-head Attention (LbMhA)** to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
 ```python
 RealFormerv3(