Update README.md
Browse files
README.md
CHANGED
@@ -191,7 +191,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
191 |
- v1_3: 93M params
|
192 |
- v2_1: 2.9M params
|
193 |
- v3: 252.6M params
|
194 |
-
- v4:
|
195 |
|
196 |
**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
|
197 |
|
@@ -212,10 +212,10 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
212 |
- v3_fp16: 505M
|
213 |
- v3_bf16: 505M
|
214 |
- v3_int8: 344M
|
215 |
-
- v4:
|
216 |
-
- v4_fp16:
|
217 |
-
- v4_bf16:
|
218 |
-
- v4_int8:
|
219 |
|
220 |
## Evaluation Data, Metrics & Results
|
221 |
|
@@ -270,7 +270,7 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
270 |
RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
|
271 |
The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
|
272 |
|
273 |
-
**Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a motion-guided cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4
|
274 |
|
275 |
```python
|
276 |
RealFormerAGA(
|
@@ -351,17 +351,17 @@ RealFormerAGA(
|
|
351 |
(relu): ReLU(inplace=True)
|
352 |
)
|
353 |
(final_layer): Conv2d(3, 3, kernel_size=(1, 1), stride=(1, 1))
|
354 |
-
(style_encoder):
|
355 |
-
(
|
356 |
-
(
|
357 |
-
(
|
358 |
-
(
|
359 |
-
(
|
360 |
)
|
361 |
)
|
362 |
```
|
363 |
|
364 |
-
**v3 Architecture:** The v3 model introduces Style Adaptive Layer Normalization (SALN) & Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
|
365 |
|
366 |
```python
|
367 |
RealFormerv3(
|
|
|
191 |
- v1_3: 93M params
|
192 |
- v2_1: 2.9M params
|
193 |
- v3: 252.6M params
|
194 |
+
- v4: 651.9M params
|
195 |
|
196 |
**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
|
197 |
|
|
|
212 |
- v3_fp16: 505M
|
213 |
- v3_bf16: 505M
|
214 |
- v3_int8: 344M
|
215 |
+
- v4: 2.42 GB
|
216 |
+
- v4_fp16: 1.21GB
|
217 |
+
- v4_bf16: 1.21GB
|
218 |
+
- v4_int8: 766M
|
219 |
|
220 |
## Evaluation Data, Metrics & Results
|
221 |
|
|
|
270 |
RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
|
271 |
The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
|
272 |
|
273 |
+
**Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a **optical flow field motion-guided** cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 incorporates a novel **Multi-Scale Style Encoder** to enhance feature extraction, but also continues to leverage features from **SALN** and **LbMhA**. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
|
274 |
|
275 |
```python
|
276 |
RealFormerAGA(
|
|
|
351 |
(relu): ReLU(inplace=True)
|
352 |
)
|
353 |
(final_layer): Conv2d(3, 3, kernel_size=(1, 1), stride=(1, 1))
|
354 |
+
(style_encoder): MultiScaleStyleEncoder(
|
355 |
+
(conv): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
|
356 |
+
(pool1): AdaptiveAvgPool2d(output_size=16)
|
357 |
+
(pool2): AdaptiveAvgPool2d(output_size=8)
|
358 |
+
(pool3): AdaptiveAvgPool2d(output_size=4)
|
359 |
+
(fc): Linear(in_features=258048, out_features=768, bias=True)
|
360 |
)
|
361 |
)
|
362 |
```
|
363 |
|
364 |
+
**v3 Architecture:** The v3 model introduces **Style Adaptive Layer Normalization (SALN)** & **Location-based Multi-head Attention (LbMhA)** to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
|
365 |
|
366 |
```python
|
367 |
RealFormerv3(
|