aoxo
/

Image-to-Image
English
art
aoxo commited on
Commit
4bc5e63
·
verified ·
1 Parent(s): 7e9318e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -9
README.md CHANGED
@@ -29,7 +29,7 @@ RealFormer is an innovative Vision Transformer (ViT) based architecture that com
29
 
30
  ### Model Sources [optional]
31
 
32
- - **Dataset:** [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v), [Pre-Training](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling)
33
  - **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
34
  - **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
35
 
@@ -94,22 +94,37 @@ visualize_tensor(output, "Output Image")
94
 
95
  ### Training Data
96
 
97
- The model was trained on two
98
-
99
- [More Information Needed]
100
 
101
  ### Training Procedure
102
 
103
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
 
 
 
104
 
105
- #### Preprocessing [optional]
106
-
107
- [More Information Needed]
108
 
 
109
 
110
  #### Training Hyperparameters
111
 
112
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  #### Speeds, Sizes, Times [optional]
115
 
 
29
 
30
  ### Model Sources [optional]
31
 
32
+ - **Dataset:** [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling), [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v)
33
  - **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
34
  - **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
35
 
 
94
 
95
  ### Training Data
96
 
97
+ The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
 
 
98
 
99
  ### Training Procedure
100
 
101
+ - Optimizer: Adam
102
+ - Learning rate: 0.001
103
+ - Batch size: 8
104
+ - Steps per epoch: 3,125
105
+ - Number of epochs: 100
106
+ - Total number of steps: 312,500
107
+ - Loss function: Combined L1 loss, Perpetual Loss, Style Transfer Loss, Total Variation loss
108
 
109
+ #### Preprocessing
 
 
110
 
111
+ Images and their corresponding style semantic maps were resized to fit the input-output window dimensions (512 x 512). Bit depth has been recorrected to 24bit (3 channel) for images with depth greater than 24bit.
112
 
113
  #### Training Hyperparameters
114
 
115
+ - Precision:fp32
116
+ - Embedded dimensions: 768
117
+ - Hidden dimensions: 3072
118
+ - Attention Type: Linear Attention
119
+ - Number of attention heads: 16
120
+ - Number of attention layers: 8
121
+ - Number of transformer encoder layers (feed-forward): 8
122
+ - Number of transformer decoder layers (feed-forward): 8
123
+ - Activation function: ReLU
124
+ - Patch Size: 8
125
+ - Swin Window Size: 7
126
+ - Swin Shift Size: 2
127
+ -
128
 
129
  #### Speeds, Sizes, Times [optional]
130