aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions Community

aoxo commited on Sep 24, 2024

Commit

cf42685

verified ·

1 Parent(s): c4b9abd

Update README.md

Browse files

Files changed (1) hide show

README.md +76 -101

README.md CHANGED Viewed

@@ -140,10 +140,75 @@ Images and their corresponding style semantic maps were resized to fit the input
 - v1_3: 93M params
 - v2_1: 2.9M params
 **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
 ```python
-223543305
 DataParallel(
   (module): ViTImage2Image(
     (patch_embed): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
@@ -242,115 +307,25 @@ DataParallel(
 )
 ```
-**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
-**Total Training Compute Throughput:** 4.13 TFLOPS
-**Total Logged Training Time:** ~210 hours (total time split across four models including overhead)
-**Start Time:** 09-13-2024
-**End Time:** 09-21-2024
-**Checkpoint Size:**
-- v1_1: 855 MB
-- v1_2: 764 MB
-- v1_3: 355 MB
-- v2_2: 11 MB
-## Evaluation Data, Metrics & Results
-This section covers information on how the model was evaluated at each stage.
-### Evaluation Data
-Evaluation was performed on real-time footage captured from Grand Theft Auto V, Cyberpunk 2077 and WatchDogs 2.
-### Metrics
-- PSNR (Peak Signal-to-Noise Ratio)
-- Combined loss (L1 loss + Total Variation loss)
-### Results
-- In-game ![ingame-car](ingame-car.jpg)
-- Ours ![ours-car](ours-car.jpg)
-- In-game ![ingame-car2](ingame-car2.png)
-- Ours ![ours-car2](ours-car2.png)
-- In-game ![ingame-roads](ingame-roads.png)
-- Ours ![ours-roads](ours-roads.png)
-- In-game ![ingame-roads2](ingame-roads2.png)
-- Ours ![ours-roads2](ours-roads2.png)
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
 ### Compute Infrastructure
-[More Information Needed]
 #### Hardware
-[More Information Needed]
 #### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 - v1_3: 93M params
 - v2_1: 2.9M params
+**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
+**Total Training Compute Throughput:** 4.13 TFLOPS
+**Total Logged Training Time:** ~210 hours (total time split across four models including overhead)
+**Start Time:** 09-13-2024
+**End Time:** 09-21-2024
+**Checkpoint Size:**
+- v1_1: 855 MB
+- v1_2: 764 MB
+- v1_3: 355 MB
+- v2_2: 11 MB
+## Evaluation Data, Metrics & Results
+This section covers information on how the model was evaluated at each stage.
+### Evaluation Data
+Evaluation was performed on real-time footage captured from Grand Theft Auto V, Cyberpunk 2077 and WatchDogs 2.
+### Metrics
+- PSNR (Peak Signal-to-Noise Ratio)
+- Combined loss (L1 loss + Total Variation loss)
+### Results
+- In-game ![ingame-car](ingame-car.jpg)
+- Ours ![ours-car](ours-car.jpg)
+- In-game ![ingame-car2](ingame-car2.png)
+- Ours ![ours-car2](ours-car2.png)
+- In-game ![ingame-roads](ingame-roads.png)
+- Ours ![ours-roads](ours-roads.png)
+- In-game ![ingame-roads2](ingame-roads2.png)
+- Ours ![ours-roads2](ours-roads2.png)
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** 2 x Nvidia T4 16GB GPUs
+- **Hours used:** 210 (per GPU); 420 (combined)
+- **Cloud Provider:** Kaggle
+- **Compute Region:** US
+- **Carbon Emitted:** 8.82 kg CO2
+## Technical Specifications
+### Model Architecture and Objective
+RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
+The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
 **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
 ```python
 DataParallel(
   (module): ViTImage2Image(
     (patch_embed): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
 )
 ```
 ### Compute Infrastructure
 #### Hardware
+2 x Nvidia T4 16GB GPUs
 #### Software
+- PyTorch
+- torchvision
+- einops
+- numpy
+- PIL (Python Imaging Library)
+- matplotlib (for visualization)
+## Model Card Authors
+Alosh Denny
 ## Model Card Contact
+[email protected]