aoxo
/

Image-to-Image
English
art
aoxo commited on
Commit
e956d96
·
verified ·
1 Parent(s): 720961d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -191,7 +191,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
191
  - v1_3: 93M params
192
  - v2_1: 2.9M params
193
  - v3: 252.6M params
194
- - v4: 454.2M params
195
 
196
  **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
197
 
@@ -212,10 +212,10 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
212
  - v3_fp16: 505M
213
  - v3_bf16: 505M
214
  - v3_int8: 344M
215
- - v4: 1.69 GB
216
- - v4_fp16: 866M
217
- - v4_bf16: 866M
218
- - v4_int8: 578M
219
 
220
  ## Evaluation Data, Metrics & Results
221
 
@@ -270,7 +270,7 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
270
  RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
271
  The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
272
 
273
- **Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a motion-guided cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 continues to incorporate **Style Adaptive Layer Normalization (SALN)** to enhance feature extraction. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
274
 
275
  ```python
276
  RealFormerAGA(
@@ -351,17 +351,17 @@ RealFormerAGA(
351
  (relu): ReLU(inplace=True)
352
  )
353
  (final_layer): Conv2d(3, 3, kernel_size=(1, 1), stride=(1, 1))
354
- (style_encoder): Sequential(
355
- (0): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
356
- (1): ReLU()
357
- (2): AdaptiveAvgPool2d(output_size=1)
358
- (3): Flatten(start_dim=1, end_dim=-1)
359
- (4): Linear(in_features=768, out_features=768, bias=True)
360
  )
361
  )
362
  ```
363
 
364
- **v3 Architecture:** The v3 model introduces Style Adaptive Layer Normalization (SALN) & Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
365
 
366
  ```python
367
  RealFormerv3(
 
191
  - v1_3: 93M params
192
  - v2_1: 2.9M params
193
  - v3: 252.6M params
194
+ - v4: 651.9M params
195
 
196
  **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
197
 
 
212
  - v3_fp16: 505M
213
  - v3_bf16: 505M
214
  - v3_int8: 344M
215
+ - v4: 2.42 GB
216
+ - v4_fp16: 1.21GB
217
+ - v4_bf16: 1.21GB
218
+ - v4_int8: 766M
219
 
220
  ## Evaluation Data, Metrics & Results
221
 
 
270
  RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
271
  The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
272
 
273
+ **Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a **optical flow field motion-guided** cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 incorporates a novel **Multi-Scale Style Encoder** to enhance feature extraction, but also continues to leverage features from **SALN** and **LbMhA**. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
274
 
275
  ```python
276
  RealFormerAGA(
 
351
  (relu): ReLU(inplace=True)
352
  )
353
  (final_layer): Conv2d(3, 3, kernel_size=(1, 1), stride=(1, 1))
354
+ (style_encoder): MultiScaleStyleEncoder(
355
+ (conv): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
356
+ (pool1): AdaptiveAvgPool2d(output_size=16)
357
+ (pool2): AdaptiveAvgPool2d(output_size=8)
358
+ (pool3): AdaptiveAvgPool2d(output_size=4)
359
+ (fc): Linear(in_features=258048, out_features=768, bias=True)
360
  )
361
  )
362
  ```
363
 
364
+ **v3 Architecture:** The v3 model introduces **Style Adaptive Layer Normalization (SALN)** & **Location-based Multi-head Attention (LbMhA)** to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
365
 
366
  ```python
367
  RealFormerv3(