Update README.md
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ RealFormer is an innovative Vision Transformer (ViT) based architecture that com
|
|
29 |
|
30 |
### Model Sources [optional]
|
31 |
|
32 |
-
- **Dataset:** [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v)
|
33 |
- **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
|
34 |
- **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
|
35 |
|
@@ -94,22 +94,37 @@ visualize_tensor(output, "Output Image")
|
|
94 |
|
95 |
### Training Data
|
96 |
|
97 |
-
The model was trained on
|
98 |
-
|
99 |
-
[More Information Needed]
|
100 |
|
101 |
### Training Procedure
|
102 |
|
103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
104 |
|
105 |
-
#### Preprocessing
|
106 |
-
|
107 |
-
[More Information Needed]
|
108 |
|
|
|
109 |
|
110 |
#### Training Hyperparameters
|
111 |
|
112 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
#### Speeds, Sizes, Times [optional]
|
115 |
|
|
|
29 |
|
30 |
### Model Sources [optional]
|
31 |
|
32 |
+
- **Dataset:** [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling), [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v)
|
33 |
- **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
|
34 |
- **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
|
35 |
|
|
|
94 |
|
95 |
### Training Data
|
96 |
|
97 |
+
The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
|
|
|
|
|
98 |
|
99 |
### Training Procedure
|
100 |
|
101 |
+
- Optimizer: Adam
|
102 |
+
- Learning rate: 0.001
|
103 |
+
- Batch size: 8
|
104 |
+
- Steps per epoch: 3,125
|
105 |
+
- Number of epochs: 100
|
106 |
+
- Total number of steps: 312,500
|
107 |
+
- Loss function: Combined L1 loss, Perpetual Loss, Style Transfer Loss, Total Variation loss
|
108 |
|
109 |
+
#### Preprocessing
|
|
|
|
|
110 |
|
111 |
+
Images and their corresponding style semantic maps were resized to fit the input-output window dimensions (512 x 512). Bit depth has been recorrected to 24bit (3 channel) for images with depth greater than 24bit.
|
112 |
|
113 |
#### Training Hyperparameters
|
114 |
|
115 |
+
- Precision:fp32
|
116 |
+
- Embedded dimensions: 768
|
117 |
+
- Hidden dimensions: 3072
|
118 |
+
- Attention Type: Linear Attention
|
119 |
+
- Number of attention heads: 16
|
120 |
+
- Number of attention layers: 8
|
121 |
+
- Number of transformer encoder layers (feed-forward): 8
|
122 |
+
- Number of transformer decoder layers (feed-forward): 8
|
123 |
+
- Activation function: ReLU
|
124 |
+
- Patch Size: 8
|
125 |
+
- Swin Window Size: 7
|
126 |
+
- Swin Shift Size: 2
|
127 |
+
-
|
128 |
|
129 |
#### Speeds, Sizes, Times [optional]
|
130 |
|