Spaces:

roll-ai
/

Dove

Paused

App Files Files Community

roll-ai commited on Jul 9

Commit

49f0d37

verified ·

1 Parent(s): e942da4

Upload 27 files

Browse files

Files changed (28) hide show

.gitattributes +15 -0
README.md +257 -10
assets/Compare.png +3 -0
assets/Pipeline.png +3 -0
assets/Qualitative-1.png +3 -0
assets/Qualitative-2-1.png +3 -0
assets/Qualitative-2-2.png +3 -0
assets/Qualitative-3-1.png +3 -0
assets/Qualitative-3-2.png +3 -0
assets/Qualitative-4-1.png +3 -0
assets/Qualitative-4-2.png +3 -0
assets/Qualitative-5-1.png +3 -0
assets/Qualitative-5-2.png +3 -0
assets/Quantitative.png +3 -0
assets/Strategy.png +3 -0
datasets/README.md +14 -0
datasets/demo/001.mp4 +0 -0
datasets/demo/002.mp4 +0 -0
datasets/demo/003.mp4 +0 -0
datasets/demo/004.mp4 +0 -0
datasets/demo/005.mp4 +3 -0
datasets/demo/006.mp4 +3 -0
datasets/demo/007.mp4 +0 -0
eval_metrics.py +256 -0
inference.sh +75 -0
inference_script.py +754 -0
pretrained_models/README.md +1 -0
requirements.txt +20 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,18 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/Compare.png filter=lfs diff=lfs merge=lfs -text
+assets/Pipeline.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-1.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-2-1.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-2-2.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-3-1.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-3-2.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-4-1.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-4-2.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-5-1.png filter=lfs diff=lfs merge=lfs -text
+assets/Qualitative-5-2.png filter=lfs diff=lfs merge=lfs -text
+assets/Quantitative.png filter=lfs diff=lfs merge=lfs -text
+assets/Strategy.png filter=lfs diff=lfs merge=lfs -text
+datasets/demo/005.mp4 filter=lfs diff=lfs merge=lfs -text
+datasets/demo/006.mp4 filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,260 @@
 ---
-title: Dove
-emoji: ⚡
-colorFrom: purple
-colorTo: gray
-sdk: gradio
-sdk_version: 5.35.0
-app_file: app.py
-pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution
+[Zheng Chen](https://zhengchen1999.github.io/), [Zichen Zou](https://github.com/zzctmd), [Kewei Zhang](), [Xiongfei Su](https://ieeexplore.ieee.org/author/37086348852), [Xin Yuan](https://en.westlake.edu.cn/faculty/xin-yuan.html), [Yong Guo](https://www.guoyongcs.com/), and [Yulun Zhang](http://yulunzhang.com/), "DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution", 2025
+<div>
+<a href="https://github.com/zhengchen1999/DOVE/releases" target='_blank' style="text-decoration: none;"><img src="https://img.shields.io/github/downloads/zhengchen1999/DOVE/total?color=green&style=flat"></a>
+<a href="https://github.com/zhengchen1999/DOVE" target='_blank' style="text-decoration: none;"><img src="https://visitor-badge.laobi.icu/badge?page_id=zhengchen1999/DOVE"></a>
+<a href="https://github.com/zhengchen1999/DOVE/stargazers" target='_blank' style="text-decoration: none;"><img src="https://img.shields.io/github/stars/zhengchen1999/DOVE?style=social"></a>
+</div>
+[[arXiv](https://arxiv.org/abs/2505.16239)] [[supplementary material](https://github.com/zhengchen1999/DOVE/releases/download/v1/Supplementary_Material.pdf)] [[dataset](https://drive.google.com/drive/folders/1e7CyNzfJBa2saWvPr2HI2q_FJhLIc-Ww?usp=drive_link)] [[pretrained models](https://drive.google.com/drive/folders/1wj9jY0fn6prSWJ7BjJOXfxC0bs8skKbQ?usp=sharing)]
+#### 🔥🔥🔥 News
+- **2025-6-09:** Test datasets, inference scripts, and pretrained models are available. ⭐️⭐️⭐️
+- **2025-5-22:** This repo is released.
 ---
+> **Abstract:** Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent–pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task.
+> Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28×** speed-up over existing methods such as MGLD-VSR.
+![](./assets/Compare.png)
 ---
+<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
+  <tr>
+    <td>
+      <video src="https://github.com/user-attachments/assets/4ad0ca78-6cca-48c0-95a5-5d5554093f7d" controls autoplay loop></video>
+    </td>
+    <td>
+      <video src="https://github.com/user-attachments/assets/e5b5d247-28af-43fd-b32c-1f1b5896d9e7" controls autoplay loop></video>
+    </td>
+  </tr>
+</table>
+---
+### Training Strategy
+![](./assets/Strategy.png)
+---
+### Video Processing Pipeline
+![](./assets/Pipeline.png)
+## 🔖 TODO
+- [x] Release testing code.
+- [x] Release pre-trained models.
+- [ ] Release training code.
+- [ ] Release video processing pipeline.
+- [ ] Release HQ-VSR dataset.
+- [ ] Provide WebUI.
+- [ ] Provide HuggingFace demo.
+## ⚙️ Dependencies
+- Python 3.11
+- PyTorch\>=2.5.0
+- Diffusers
+```bash
+# Clone the github repo and go to the default directory 'DOVE'.
+git clone https://github.com/zhengchen1999/DOVE.git
+conda create -n DOVE python=3.11
+conda activate DOVE
+pip install -r requirements.txt
+pip install diffusers["torch"] transformers
+pip install pyiqa
+```
+## 🔗 Contents
+1. [Datasets](#datasets)
+1. [Models](#models)
+1. Training
+1. [Testing](#testing)
+1. [Results](#results)
+1. [Acknowledgements](#acknowledgements)
+## <a name="datasets"></a>📁 Datasets
+### 🗳️ Test Datasets
+We provide several real-world and synthetic test datasets for evaluation. All datasets follow a consistent directory structure:
+| Dataset |    Type    | # Num |                           Download                           |
+| :------ | :--------: | :---: | :----------------------------------------------------------: |
+| UDM10   | Synthetic  |  10   | [Google Drive](https://drive.google.com/file/d/1AmGVSCwMm_OFPd3DKgNyTwj0GG2H-tG4/view?usp=drive_link) |
+| SPMCS   | Synthetic  |  30   | [Google Drive](https://drive.google.com/file/d/1b2uktCFPKS-R1fTecWcLFcOnmUFIBNWT/view?usp=drive_link) |
+| YouHQ40 | Synthetic  |  40   | [Google Drive](https://drive.google.com/file/d/1zO23UCStxL3htPJQcDUUnUeMvDrysLTh/view?usp=sharing) |
+| RealVSR | Real-world |  50   | [Google Drive](https://drive.google.com/file/d/1wr4tTiCvQlqdYPeU1dmnjb5KFY4VjGCO/view?usp=drive_link) |
+| MVSR4x  | Real-world |  15   | [Google Drive](https://drive.google.com/file/d/16sesBD_9Xx_5Grtx18nosBw1w94KlpQt/view?usp=drive_link) |
+| VideoLQ | Real-world |  50   | [Google Drive](https://drive.google.com/file/d/1lh0vkU_llxE0un1OigJ0DWPQwt1i68Vn/view?usp=drive_link) |
+All datasets are hosted on [here](https://drive.google.com/drive/folders/1yNKG6rtTNtZQY8qL74GoQwA0jgjBUEby?usp=sharing). Make sure the path is correct (`datasets/test/`) before running inference.
+The directory structure is as follows:
+```shell
+datasets/
+└── test/
+    └── [DatasetName]/
+        ├── GT/         # Ground Truth: folder of high-quality frames (one per clip)
+        ├── GT-Video/   # Ground Truth (video version): lossless MKV format
+        ├── LQ/         # Low-quality Input: folder of degraded frames (one per clip)
+        └── LQ-Video/   # Low-Quality Input (video version): lossless MKV format
+```
+## <a name="models"></a>📦 Models
+We provide pretrained weights for DOVE and DOVE-2B.
+| Model Name |               Description               | HuggingFace |                         Google Drive                         | Visual Results                                               |
+| :--------- | :-------------------------------------: | :---------: | :----------------------------------------------------------: | ------------------------------------------------------------ |
+| DOVE       | Base version, built on CogVideoX1.5-5B; |    TODO     | [Download](https://drive.google.com/file/d/1Nl3XoJndMtpu6KPFcskUTkI0qWBiSXF2/view?usp=drive_link) | [Download](https://drive.google.com/drive/folders/1J92X1amVijH9dNWGQcz-6Cx44B7EipWr?usp=drive_link) |
+| DOVE-2B    | Smaller version, based on CogVideoX-2B  |    TODO     |                             TODO                             | TODO                                                         |
+> Place downloaded model files into the `pretrained_models/` folder, e.g., `pretrained_models/DOVE`.
+## <a name="testing"></a>🔨 Testing
+We provide inference commands below. Before running, make sure to download the corresponding pretrained models and test datasets.
+For more options and usage, please refer to [inference_script.py](inference_script.py).
+The full testing commands are provided in the shell script: [inference.sh](inference.sh).
+```shell
+# 🔹 Demo inference
+python inference_script.py \
+    --input_dir datasets/demo \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/demo \
+    --is_vae_st \
+    --save_format yuv420p
+# 🔹 Reproduce paper results
+python inference_script.py \
+    --input_dir datasets/test/UDM10/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/UDM10 \
+    --is_vae_st \
+# 🔹 Evaluate quantitative metrics
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/UDM10 \
+    --metrics psnr,ssim,lpips,dists,clipiqa
+```
+> 💡 If you encounter out-of-memory (OOM) issues, you can enable chunk-based testing by setting the following parameters: tile_size_hw, overlap_hw, chunk_len, and overlap_t.
+>
+> 💡 Default save format is `yuv444p`. If playback fails, try `save_format=yuv420p` (may slightly affect metrics).
+>
+> **TODO:** Add metric computation scripts for FasterVQA, DOVER, and $E^*_{warp}$.
+## <a name="results"></a>🔎 Results
+We achieve state-of-the-art performance on real-world video super-resolution. Visual results are available at [Google Drive](https://drive.google.com/drive/folders/1J92X1amVijH9dNWGQcz-6Cx44B7EipWr?usp=drive_link).
+<details open>
+<summary>Quantitative Results (click to expand)</summary>
+- Results in Tab. 2 of the main paper
+<p align="center">
+  <img width="900" src="assets/Quantitative.png">
+</p>
+</details>
+<details open>
+<summary>Qualitative Results (click to expand)</summary>
+- Results in Fig. 4 of the main paper
+<p align="center">
+  <img width="900" src="assets/Qualitative-1.png">
+</p>
+<details>
+<summary>More Qualitative Results</summary>
+- More results in Fig. 3 of the supplementary material
+<p align="center">
+  <img width="900" src="assets/Qualitative-2-1.png">
+</p>
+- More results in Fig. 4 of the supplementary material
+<p align="center">
+  <img width="900" src="assets/Qualitative-2-2.png">
+</p>
+- More results in Fig. 5 of the supplementary material
+<p align="center">
+  <img width="900" src="assets/Qualitative-3-1.png">
+  <img width="900" src="assets/Qualitative-3-2.png">
+</p>
+- More results in Fig. 6 of the supplementary material
+<p align="center">
+  <img width="900" src="assets/Qualitative-4-1.png">
+  <img width="900" src="assets/Qualitative-4-2.png">
+</p>
+- More results in Fig. 7 of the supplementary material
+<p align="center">
+  <img width="900" src="assets/Qualitative-5-1.png">
+  <img width="900" src="assets/Qualitative-5-2.png">
+</p>
+</details>
+</details>
+## <a name="citation"></a>📎 Citation
+If you find the code helpful in your research or work, please cite the following paper(s).
+```
+@article{chen2025dove,
+  title={DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution},
+  author={Chen, Zheng and Zou, Zichen and Zhang, Kewei and Su, Xiongfei and Yuan, Xin and Guo, Yong and Zhang, Yulun},
+  journal={arXiv preprint arXiv:2505.16239},
+  year={2025}
+}
+```
+## <a name="acknowledgements"></a>💡 Acknowledgements
+This project is based on [CogVideo](https://github.com/THUDM/CogVideo) and [Open-Sora](https://github.com/hpcaitech/Open-Sora).

assets/Compare.png ADDED Viewed

Git LFS Details

SHA256: ac8b786acbad04433ad7aec57b9502d7db88ba982a8cfb5a99a299a8abb4839c
Pointer size: 132 Bytes
Size of remote file: 5.29 MB

assets/Pipeline.png ADDED Viewed

Git LFS Details

SHA256: bff7a4f0dea33326d5d6a757ee12ab4b647b423a83cf24e1771a60f584ac9bdd
Pointer size: 132 Bytes
Size of remote file: 5.28 MB

assets/Qualitative-1.png ADDED Viewed

Git LFS Details

SHA256: 54b239dec2d0e98820ef364e7836e4497b9961ed867cfaa942d4b22b82128a4a
Pointer size: 132 Bytes
Size of remote file: 9.42 MB

assets/Qualitative-2-1.png ADDED Viewed

Git LFS Details

SHA256: 81beb758072810952a99e1959cbbff89d980011895917f4fd550b803d406d3f1
Pointer size: 132 Bytes
Size of remote file: 6.53 MB

assets/Qualitative-2-2.png ADDED Viewed

Git LFS Details

SHA256: 7f98f52085042d95ebfe1ae61defdae1214cd775492163bbfbad6d5b951422df
Pointer size: 132 Bytes
Size of remote file: 7.66 MB

assets/Qualitative-3-1.png ADDED Viewed

Git LFS Details

SHA256: 73d2a09a2ce624a1db5810cc8abd89ee5a62a850a7713c6037f1ae6664330091
Pointer size: 132 Bytes
Size of remote file: 6.45 MB

assets/Qualitative-3-2.png ADDED Viewed

Git LFS Details

SHA256: 4714918f7f05801c30d9ed943d922e57b7441fb0325255983ed88b60f90a8370
Pointer size: 132 Bytes
Size of remote file: 5.1 MB

assets/Qualitative-4-1.png ADDED Viewed

Git LFS Details

SHA256: 3dc6d0b452f3eaa2b71a4d2778863b6b81d7bd5755538354cfa5992e244aafd3
Pointer size: 132 Bytes
Size of remote file: 4.6 MB

assets/Qualitative-4-2.png ADDED Viewed

Git LFS Details

SHA256: fd7055d2ba23242b0a5bfc960df40d8dffd6b8620009647b74306fcc530e4493
Pointer size: 132 Bytes
Size of remote file: 6.21 MB

assets/Qualitative-5-1.png ADDED Viewed

Git LFS Details

SHA256: fd17c8fcc1441b9f48eab8879db02aacbe8b3e2bea2cd0e66605d01ab254483e
Pointer size: 132 Bytes
Size of remote file: 6.13 MB

assets/Qualitative-5-2.png ADDED Viewed

Git LFS Details

SHA256: ce7e4dc356f9270748512b22df6c643f8e6f597fe9d47766a1e9f1852cddc462
Pointer size: 132 Bytes
Size of remote file: 5.08 MB

assets/Quantitative.png ADDED Viewed

Git LFS Details

SHA256: 2f1c45636885b196986ac0646a719488ac73e939696b46830151f65836231c9c
Pointer size: 131 Bytes
Size of remote file: 830 kB

assets/Strategy.png ADDED Viewed

Git LFS Details

SHA256: b3c3a3e382cf96dfb2d36f766ff4662018cf6042a1e02555fa359f44aad98f4d
Pointer size: 132 Bytes
Size of remote file: 4.58 MB

datasets/README.md ADDED Viewed

	@@ -0,0 +1,14 @@

+The directory structure is as follows:
+```shell
+datasets/
+└── demo/
+└── test/
+    └── [DatasetName]/
+        ├── GT/         # Ground Truth: folder of high-quality frames (one per clip)
+        ├── GT-Video/   # Ground Truth (video version): lossless MKV format
+        ├── LQ/         # Low-quality Input: folder of degraded frames (one per clip)
+        └── LQ-Video/   # Low-Quality Input (video version): lossless MKV format
+```
+All datasets are available [here](https://drive.google.com/drive/folders/1yNKG6rtTNtZQY8qL74GoQwA0jgjBUEby?usp=sharing).

datasets/demo/001.mp4 ADDED Viewed

Binary file (62.5 kB). View file

datasets/demo/002.mp4 ADDED Viewed

Binary file (97.3 kB). View file

datasets/demo/003.mp4 ADDED Viewed

Binary file (60.4 kB). View file

datasets/demo/004.mp4 ADDED Viewed

Binary file (79.6 kB). View file

datasets/demo/005.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de2fe395e78a9d556a3763d7a7bdf87102e0c1191e1d146a54d487f78a57d708
+size 268870

datasets/demo/006.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:194615754e590bc57e84f41a11b7bca4d564455ed93b72be700a6300119d34ac
+size 206748

datasets/demo/007.mp4 ADDED Viewed

Binary file (48.4 kB). View file

eval_metrics.py ADDED Viewed

	@@ -0,0 +1,256 @@

+import os
+import cv2
+import json
+import torch
+import pyiqa
+import numpy as np
+from PIL import Image
+from tqdm import tqdm
+from torchvision import transforms
+# 0 ~ 1
+to_tensor = transforms.ToTensor()
+video_exts = ['.mp4', '.avi', '.mov', '.mkv']
+fr_metrics = ['psnr', 'ssim', 'lpips', 'dists']
+def is_video_file(filename):
+    return any(filename.lower().endswith(ext) for ext in video_exts)
+def rgb_to_y(img):
+    # Assumes img is [1, 3, H, W] in [0,1], returns [1, 1, H, W]
+    r, g, b = img[:, 0:1], img[:, 1:2], img[:, 2:3]
+    y = 0.257 * r + 0.504 * g + 0.098 * b + 0.0625
+    return y
+def crop_border(img, crop):
+    return img[:, :, crop:-crop, crop:-crop]
+def read_video_frames(video_path):
+    cap = cv2.VideoCapture(video_path)
+    frames = []
+    while True:
+        ret, frame = cap.read()
+        if not ret:
+            break
+        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+        frames.append(to_tensor(Image.fromarray(rgb)))
+    cap.release()
+    return torch.stack(frames)
+def read_image_folder(folder_path):
+    image_files = sorted([
+        os.path.join(folder_path, f) for f in os.listdir(folder_path)
+        if f.lower().endswith(('.png', '.jpg', '.jpeg'))
+    ])
+    frames = [to_tensor(Image.open(p).convert("RGB")) for p in image_files]
+    return torch.stack(frames)
+def load_sequence(path):
+    if os.path.isdir(path):
+        return read_image_folder(path)
+    elif os.path.isfile(path):
+        if is_video_file(path):
+            return read_video_frames(path)
+        elif path.lower().endswith(('.png', '.jpg', '.jpeg')):
+            # Treat image as a single-frame video
+            img = to_tensor(Image.open(path).convert("RGB"))
+            return img.unsqueeze(0)  # [1, C, H, W]
+    raise ValueError(f"Unsupported input: {path}")
+def crop_img_center(img, target_h, target_w):
+    _, h, w = img.shape
+    top = max((h - target_h) // 2, 0)
+    left = max((w - target_w) // 2, 0)
+    return img[:, top:top+target_h, left:left+target_w]
+def crop_img_top_left(img, target_h, target_w):
+    # Crop image from top-left corner to (target_h, target_w)
+    return img[:, :target_h, :target_w]
+def match_resolution(gt_frames, pred_frames, is_center=False, name=None):
+    t = min(gt_frames.shape[0], pred_frames.shape[0])
+    gt_frames = gt_frames[:t]
+    pred_frames = pred_frames[:t]
+    _, _, h_g, w_g = gt_frames.shape
+    _, _, h_p, w_p = pred_frames.shape
+    target_h = min(h_g, h_p)
+    target_w = min(w_g, w_p)
+    if (h_g != h_p or w_g != w_p) and name:
+        if is_center:
+            print(f"[{name}] Resolution mismatch detected: GT is ({h_g}, {w_g}), Pred is ({h_p}, {w_p}). Both GT and Pred were center cropped to ({target_h}, {target_w}).")
+        else:
+            print(f"[{name}] Resolution mismatch detected: GT is ({h_g}, {w_g}), Pred is ({h_p}, {w_p}). Both GT and Pred were top-left cropped to ({target_h}, {target_w}).")
+    if is_center:
+        gt_frames = torch.stack([crop_img_center(f, target_h, target_w) for f in gt_frames])
+        pred_frames = torch.stack([crop_img_center(f, target_h, target_w) for f in pred_frames])
+    else:
+        gt_frames = torch.stack([crop_img_top_left(f, target_h, target_w) for f in gt_frames])
+        pred_frames = torch.stack([crop_img_top_left(f, target_h, target_w) for f in pred_frames])
+    return gt_frames, pred_frames
+def init_models(metrics, device):
+    models = {}
+    for name in metrics:
+        try:
+            models[name] = pyiqa.create_metric(name).to(device).eval()
+        except Exception as e:
+            print(f"Failed to initialize metric '{name}': {e}")
+    return models
+def compute_metrics(pred_frames, gt_frames, models, device, batch_mode, crop, test_y_channel):
+    if batch_mode:
+        pred_batch = pred_frames.to(device)  # [F, C, H, W]
+        gt_batch = gt_frames.to(device)      # [F, C, H, W]
+        results = {}
+        for name, model in models.items():
+            if name in fr_metrics:
+                pred_eval = pred_batch
+                gt_eval = gt_batch
+                if crop > 0:
+                    pred_eval = crop_border(pred_eval, crop)
+                    gt_eval = crop_border(gt_eval, crop)
+                if test_y_channel:
+                    pred_eval = rgb_to_y(pred_eval)
+                    gt_eval = rgb_to_y(gt_eval)
+                values = model(pred_eval, gt_eval)  # [F]
+            else:
+                values = model(pred_batch)  # no-reference
+            results[name] = round(values.mean().item(), 4)
+        return results
+    else:
+        results = {name: [] for name in models}
+        for pred, gt in zip(pred_frames, gt_frames):
+            pred = pred.unsqueeze(0).to(device)
+            gt = gt.unsqueeze(0).to(device)
+            for name, model in models.items():
+                if name in fr_metrics:
+                    pred_eval = pred
+                    gt_eval = gt
+                    if crop > 0:
+                        pred_eval = crop_border(pred_eval, crop)
+                        gt_eval = crop_border(gt_eval, crop)
+                    if test_y_channel:
+                        pred_eval = rgb_to_y(pred_eval)
+                        gt_eval = rgb_to_y(gt_eval)
+                    value = model(pred_eval, gt_eval).item()
+                else:
+                    value = model(pred).item()
+                results[name].append(value)
+        return {k: round(np.mean(v), 4) for k, v in results.items()}
+def process(gt_root, pred_root, out_path, metrics, batch_mode, crop, test_y_channel, is_center):
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"Using device: {device}")
+    models = init_models(metrics, device)
+    has_gt = bool(gt_root and os.path.exists(gt_root))
+    if has_gt:
+        gt_files = {os.path.splitext(f)[0]: os.path.join(gt_root, f) for f in os.listdir(gt_root)}
+    pred_files = {os.path.splitext(f)[0]: os.path.join(pred_root, f) for f in os.listdir(pred_root)}
+    pred_names = sorted(pred_files.keys())
+    results = {}
+    aggregate = {metric: [] for metric in metrics}
+    for name in tqdm(pred_names, desc="Evaluating"):
+        # # valida
+        # name_hr = name.replace('_CAT_A_x4', '').replace('img_', 'img')
+        name_hr = name
+        if has_gt and name_hr not in gt_files:
+            print(f"Skipping {name_hr}: no matching GT file.")
+            continue
+        pred_path = pred_files[name]
+        gt_path = gt_files[name_hr] if has_gt else None
+        try:
+            pred_frames = load_sequence(pred_path)
+            if has_gt:
+                gt_frames = load_sequence(gt_path)
+                gt_frames, pred_frames = match_resolution(gt_frames, pred_frames, is_center=is_center, name=name)
+                scores = compute_metrics(pred_frames, gt_frames, models, device, batch_mode, crop, test_y_channel)
+            else:
+                nr_models = {k: v for k, v in models.items() if k not in fr_metrics}
+                if not nr_models:
+                    print(f"Skipping {name}: GT is not provided and no NR-IQA metrics found.")
+                    continue
+                dummy_gt = pred_frames
+                scores = compute_metrics(pred_frames, dummy_gt, nr_models, device, batch_mode, crop, test_y_channel)
+            results[name] = scores
+            for k in scores:
+                aggregate[k].append(scores[k])
+        except Exception as e:
+            print(f"Error processing {name}: {e}")
+    print("\nPer-sample Results:")
+    for name in sorted(results):
+        print(f"{name}: " + ", ".join(f"{k}={v:.4f}" for k, v in results[name].items()))
+    print("\nOverall Average Results:")
+    count = len(results)
+    if count > 0:
+        overall_avg = {k: round(np.mean(v), 4) for k, v in aggregate.items()}
+        for k, v in overall_avg.items():
+            print(f"{k.upper()}: {v:.4f}")
+    else:
+        overall_avg = {}
+        print("No valid samples were processed.")
+    print(f"\nProcessed {count} samples.")
+    output = {
+        "per_sample": results,
+        "average": overall_avg,
+        "count": count
+    }
+    os.makedirs(out_path, exist_ok=True)
+    out_name = 'metrics_'
+    for metric in metrics:
+        out_name += f"{metric}_"
+    out_name = out_name.rstrip('_') + '.json'
+    out_path = os.path.join(out_path, out_name)
+    with open(out_path, 'w') as f:
+        json.dump(output, f, indent=2)
+    print(f"Results saved to: {out_path}")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--gt', type=str, default='', help='Path to GT folder (optional for NR-IQA)')
+    parser.add_argument('--pred', type=str, required=True, help='Path to predicted results folder')
+    parser.add_argument('--out', type=str, default='', help='Path to save JSON output (as directory)')
+    parser.add_argument('--metrics', type=str, default='psnr,ssim,clipiqa',
+                        help='Comma-separated list of metrics: psnr,ssim,clipiqa,lpips,...')
+    parser.add_argument('--batch_mode', action='store_true', help='Use batch mode for metrics computation')
+    parser.add_argument('--crop', type=int, default=0, help='Crop border size for PSNR/SSIM')
+    parser.add_argument('--test_y_channel', action='store_true', help='Use Y channel for PSNR/SSIM')
+    parser.add_argument('--is_center', action='store_true', help='Use center crop for PSNR/SSIM')
+    args = parser.parse_args()
+    if args.out == '':
+        out = args.pred
+    else:
+        out = args.out
+    metric_list = [m.strip().lower() for m in args.metrics.split(',')]
+    process(args.gt, args.pred, out, metric_list, args.batch_mode, args.crop, args.test_y_channel, args.is_center)

inference.sh ADDED Viewed

	@@ -0,0 +1,75 @@

+#!/usr/bin/env bash
+# UDM10
+python inference_script.py \
+    --input_dir datasets/test/UDM10/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/UDM10 \
+    --is_vae_st \
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/UDM10 \
+    --metrics psnr,ssim,lpips,dists,clipiqa
+# SPMCS
+python inference_script.py \
+    --input_dir datasets/test/SPMCS/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/SPMCS \
+    --is_vae_st \
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/SPMCS \
+    --metrics psnr,ssim,lpips,dists,clipiqa
+# YouHQ40
+python inference_script.py \
+    --input_dir datasets/test/YouHQ40/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/YouHQ40 \
+    --is_vae_st \
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/YouHQ40 \
+    --metrics psnr,ssim,lpips,dists,clipiqa
+# RealVSR
+python inference_script.py \
+    --input_dir datasets/test/RealVSR/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/RealVSR \
+    --is_vae_st \
+    --upscale 1 \
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/RealVSR \
+    --metrics psnr,ssim,lpips,dists,clipiqa
+# MVSR4x
+python inference_script.py \
+    --input_dir datasets/test/MVSR4x/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/MVSR4x \
+    --is_vae_st \
+    --upscale 1 \
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/MVSR4x \
+    --metrics psnr,ssim,lpips,dists,clipiqa
+# VideoLQ
+python inference_script.py \
+    --input_dir datasets/test/VideoLQ/LQ-Video \
+    --model_path pretrained_models/DOVE \
+    --output_path results/DOVE/VideoLQ \
+    --is_vae_st \
+python eval_metrics.py \
+    --gt datasets/test/UDM10/GT \
+    --pred results/DOVE/VideoLQ \
+    --metrics clipiqa

inference_script.py ADDED Viewed

	@@ -0,0 +1,754 @@

+from pathlib import Path
+import argparse
+import logging
+import torch
+from torchvision import transforms
+from torchvision.io import write_video
+from tqdm import tqdm
+from diffusers import (
+    CogVideoXDPMScheduler,
+    CogVideoXPipeline,
+)
+from transformers import set_seed
+from typing import Dict, Tuple
+from diffusers.models.embeddings import get_3d_rotary_pos_embed
+import json
+import os
+import cv2
+from PIL import Image
+from pathlib import Path
+import pyiqa
+import imageio.v3 as iio
+import glob
+# Must import after torch because this can sometimes lead to a nasty segmentation fault, or stack smashing error
+# Very few bug reports but it happens. Look in decord Github issues for more relevant information.
+import decord  # isort:skip
+decord.bridge.set_bridge("torch")
+logging.basicConfig(level=logging.INFO)
+# 0 ~ 1
+to_tensor = transforms.ToTensor()
+video_exts = ['.mp4', '.avi', '.mov', '.mkv']
+fr_metrics = ['psnr', 'ssim', 'lpips', 'dists']
+def no_grad(func):
+    def wrapper(*args, **kwargs):
+        with torch.no_grad():
+            return func(*args, **kwargs)
+    return wrapper
+def is_video_file(filename):
+    return any(filename.lower().endswith(ext) for ext in video_exts)
+def read_video_frames(video_path):
+    cap = cv2.VideoCapture(video_path)
+    frames = []
+    while True:
+        ret, frame = cap.read()
+        if not ret:
+            break
+        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+        frames.append(to_tensor(Image.fromarray(rgb)))
+    cap.release()
+    return torch.stack(frames)
+def read_image_folder(folder_path):
+    image_files = sorted([
+        os.path.join(folder_path, f) for f in os.listdir(folder_path)
+        if f.lower().endswith(('.png', '.jpg', '.jpeg'))
+    ])
+    frames = [to_tensor(Image.open(p).convert("RGB")) for p in image_files]
+    return torch.stack(frames)
+def load_sequence(path):
+    # return a tensor of shape [F, C, H, W] // 0, 1
+    if os.path.isdir(path):
+        return read_image_folder(path)
+    elif os.path.isfile(path):
+        if is_video_file(path):
+            return read_video_frames(path)
+        elif path.lower().endswith(('.png', '.jpg', '.jpeg')):
+            # Treat image as a single-frame video
+            img = to_tensor(Image.open(path).convert("RGB"))
+            return img.unsqueeze(0)  # [1, C, H, W]
+    raise ValueError(f"Unsupported input: {path}")
+@no_grad
+def compute_metrics(pred_frames, gt_frames, metrics_model, metric_accumulator, file_name):
+    print(f"\n\n[{file_name}] Metrics:", end=" ")
+    for name, model in metrics_model.items():
+        scores = []
+        for i in range(pred_frames.shape[0]):
+            pred = pred_frames[i].unsqueeze(0)
+            if gt_frames != None:
+                gt = gt_frames[i].unsqueeze(0)
+            if name in fr_metrics:
+                score = model(pred, gt).item()
+            else:
+                score = model(pred).item()
+            scores.append(score)
+        val = sum(scores) / len(scores)
+        metric_accumulator[name].append(val)
+        print(f"{name.upper()}={val:.4f}", end="  ")
+    print()
+def save_frames_as_png(video, output_dir, fps=8):
+    """
+    Save video frames as PNG sequence.
+    Args:
+        video (torch.Tensor): shape [B, C, F, H, W], float in [0, 1]
+        output_dir (str): directory to save PNG files
+        fps (int): kept for API compatibility
+    """
+    video = video[0]  # Remove batch dimension
+    video = video.permute(1, 2, 3, 0)  # [F, H, W, C]
+    os.makedirs(output_dir, exist_ok=True)
+    frames = (video * 255).clamp(0, 255).to(torch.uint8).cpu().numpy()
+    for i, frame in enumerate(frames):
+        filename = os.path.join(output_dir, f"{i:03d}.png")
+        Image.fromarray(frame).save(filename)
+def save_video_with_imageio_lossless(video, output_path, fps=8):
+    """
+    Save a video tensor to .mkv using imageio.v3.imwrite with ffmpeg backend.
+    Args:
+        video (torch.Tensor): shape [B, C, F, H, W], float in [0, 1]
+        output_path (str): where to save the .mkv file
+        fps (int): frames per second
+    """
+    video = video[0]
+    video = video.permute(1, 2, 3, 0)
+    frames = (video * 255).clamp(0, 255).to(torch.uint8).cpu().numpy()
+    iio.imwrite(
+        output_path,
+        frames,
+        fps=fps,
+        codec='libx264rgb',
+        pixelformat='rgb24',
+        macro_block_size=None,
+        ffmpeg_params=['-crf', '0'],
+    )
+def save_video_with_imageio(video, output_path, fps=8, format='yuv444p'):
+    """
+    Save a video tensor to .mp4 using imageio.v3.imwrite with ffmpeg backend.
+    Args:
+        video (torch.Tensor): shape [B, C, F, H, W], float in [0, 1]
+        output_path (str): where to save the .mp4 file
+        fps (int): frames per second
+    """
+    video = video[0]
+    video = video.permute(1, 2, 3, 0)
+    frames = (video * 255).clamp(0, 255).to(torch.uint8).cpu().numpy()
+    if format == 'yuv444p':
+        iio.imwrite(
+            output_path,
+            frames,
+            fps=fps,
+            codec='libx264',
+            pixelformat='yuv444p',
+            macro_block_size=None,
+            ffmpeg_params=['-crf', '0'],
+        )
+    else:
+        iio.imwrite(
+            output_path,
+            frames,
+            fps=fps,
+            codec='libx264',
+            pixelformat='yuv420p',
+            macro_block_size=None,
+            ffmpeg_params=['-crf', '10'],
+        )
+def preprocess_video_match(
+    video_path: Path | str,
+    is_match: bool = False,
+) -> torch.Tensor:
+    """
+    Loads a single video.
+    Args:
+        video_path: Path to the video file.
+    Returns:
+        A torch.Tensor with shape [F, C, H, W] where:
+          F = number of frames
+          C = number of channels (3 for RGB)
+          H = height
+          W = width
+    """
+    if isinstance(video_path, str):
+        video_path = Path(video_path)
+    video_reader = decord.VideoReader(uri=video_path.as_posix())
+    video_num_frames = len(video_reader)
+    frames = video_reader.get_batch(list(range(video_num_frames)))
+    F, H, W, C = frames.shape
+    original_shape = (F, H, W, C)
+    pad_f = 0
+    pad_h = 0
+    pad_w = 0
+    if is_match:
+        remainder = (F - 1) % 8
+        if remainder != 0:
+            last_frame = frames[-1:]
+            pad_f = 8 - remainder
+            repeated_frames = last_frame.repeat(pad_f, 1, 1, 1)
+            frames = torch.cat([frames, repeated_frames], dim=0)
+        pad_h = (16 - H % 16) % 16
+        pad_w = (16 - W % 16) % 16
+        if pad_h > 0 or pad_w > 0:
+            # pad = (w_left, w_right, h_top, h_bottom)
+            frames = torch.nn.functional.pad(frames, pad=(0, 0, 0, pad_w, 0, pad_h))  # pad right and bottom
+    # to F, C, H, W
+    return frames.float().permute(0, 3, 1, 2).contiguous(), pad_f, pad_h, pad_w, original_shape
+def remove_padding_and_extra_frames(video, pad_F, pad_H, pad_W):
+    if pad_F > 0:
+        video = video[:, :, :-pad_F, :, :]
+    if pad_H > 0:
+        video = video[:, :, :, :-pad_H, :]
+    if pad_W > 0:
+        video = video[:, :, :, :, :-pad_W]
+    return video
+def make_temporal_chunks(F, chunk_len, overlap_t=8):
+    """
+    Args:
+        F: total number of frames
+        chunk_len: int, chunk length in time (excluding overlap)
+        overlap: int, number of overlapping frames between chunks
+    Returns:
+        time_chunks: List of (start_t, end_t) tuples
+    """
+    if chunk_len == 0:
+        return [(0, F)]
+    effective_stride = chunk_len - overlap_t
+    if effective_stride <= 0:
+        raise ValueError("chunk_len must be greater than overlap")
+    chunk_starts = list(range(0, F - overlap_t, effective_stride))
+    if chunk_starts[-1] + chunk_len < F:
+        chunk_starts.append(F - chunk_len)
+    time_chunks = []
+    for i, t_start in enumerate(chunk_starts):
+        t_end = min(t_start + chunk_len, F)
+        time_chunks.append((t_start, t_end))
+    if len(time_chunks) >= 2 and time_chunks[-1][1] - time_chunks[-1][0] < chunk_len:
+        last = time_chunks.pop()
+        prev_start, _ = time_chunks[-1]
+        time_chunks[-1] = (prev_start, last[1])
+    return time_chunks
+def make_spatial_tiles(H, W, tile_size_hw, overlap_hw=(32, 32)):
+    """
+    Args:
+        H, W: height and width of the frame
+        tile_size_hw: Tuple (tile_height, tile_width)
+        overlap_hw: Tuple (overlap_height, overlap_width)
+    Returns:
+        spatial_tiles: List of (start_h, end_h, start_w, end_w) tuples
+    """
+    tile_height, tile_width = tile_size_hw
+    overlap_h, overlap_w = overlap_hw
+    if tile_height == 0 or tile_width == 0:
+        return [(0, H, 0, W)]
+    tile_stride_h = tile_height - overlap_h
+    tile_stride_w = tile_width - overlap_w
+    if tile_stride_h <= 0 or tile_stride_w <= 0:
+        raise ValueError("Tile size must be greater than overlap")
+    h_tiles = list(range(0, H - overlap_h, tile_stride_h))
+    if not h_tiles or h_tiles[-1] + tile_height < H:
+        h_tiles.append(H - tile_height)
+     # Merge last row if needed
+    if len(h_tiles) >= 2 and h_tiles[-1] + tile_height > H:
+        h_tiles.pop()
+    w_tiles = list(range(0, W - overlap_w, tile_stride_w))
+    if not w_tiles or w_tiles[-1] + tile_width < W:
+        w_tiles.append(W - tile_width)
+    # Merge last column if needed
+    if len(w_tiles) >= 2 and w_tiles[-1] + tile_width > W:
+        w_tiles.pop()
+    spatial_tiles = []
+    for h_start in h_tiles:
+        h_end = min(h_start + tile_height, H)
+        if h_end + tile_stride_h > H:
+            h_end = H
+        for w_start in w_tiles:
+            w_end = min(w_start + tile_width, W)
+            if w_end + tile_stride_w > W:
+                w_end = W
+            spatial_tiles.append((h_start, h_end, w_start, w_end))
+    return spatial_tiles
+def get_valid_tile_region(t_start, t_end, h_start, h_end, w_start, w_end,
+                          video_shape, overlap_t, overlap_h, overlap_w):
+    _, _, F, H, W = video_shape
+    t_len = t_end - t_start
+    h_len = h_end - h_start
+    w_len = w_end - w_start
+    valid_t_start = 0 if t_start == 0 else overlap_t // 2
+    valid_t_end = t_len if t_end == F else t_len - overlap_t // 2
+    valid_h_start = 0 if h_start == 0 else overlap_h // 2
+    valid_h_end = h_len if h_end == H else h_len - overlap_h // 2
+    valid_w_start = 0 if w_start == 0 else overlap_w // 2
+    valid_w_end = w_len if w_end == W else w_len - overlap_w // 2
+    out_t_start = t_start + valid_t_start
+    out_t_end = t_start + valid_t_end
+    out_h_start = h_start + valid_h_start
+    out_h_end = h_start + valid_h_end
+    out_w_start = w_start + valid_w_start
+    out_w_end = w_start + valid_w_end
+    return {
+        "valid_t_start": valid_t_start, "valid_t_end": valid_t_end,
+        "valid_h_start": valid_h_start, "valid_h_end": valid_h_end,
+        "valid_w_start": valid_w_start, "valid_w_end": valid_w_end,
+        "out_t_start": out_t_start, "out_t_end": out_t_end,
+        "out_h_start": out_h_start, "out_h_end": out_h_end,
+        "out_w_start": out_w_start, "out_w_end": out_w_end,
+    }
+def prepare_rotary_positional_embeddings(
+    height: int,
+    width: int,
+    num_frames: int,
+    transformer_config: Dict,
+    vae_scale_factor_spatial: int,
+    device: torch.device,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    grid_height = height // (vae_scale_factor_spatial * transformer_config.patch_size)
+    grid_width = width // (vae_scale_factor_spatial * transformer_config.patch_size)
+    if transformer_config.patch_size_t is None:
+        base_num_frames = num_frames
+    else:
+        base_num_frames = (
+            num_frames + transformer_config.patch_size_t - 1
+        ) // transformer_config.patch_size_t
+    freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
+        embed_dim=transformer_config.attention_head_dim,
+        crops_coords=None,
+        grid_size=(grid_height, grid_width),
+        temporal_size=base_num_frames,
+        grid_type="slice",
+        max_size=(grid_height, grid_width),
+        device=device,
+    )
+    return freqs_cos, freqs_sin
+@no_grad
+def process_video(
+    pipe: CogVideoXPipeline,
+    video: torch.Tensor,
+    prompt: str = '',
+    noise_step: int = 0,
+    sr_noise_step: int = 399,
+):
+    # SR the video frames based on the prompt.
+    # `num_frames` is the Number of frames to generate.
+    # Decode video
+    video = video.to(pipe.vae.device, dtype=pipe.vae.dtype)
+    latent_dist = pipe.vae.encode(video).latent_dist
+    latent = latent_dist.sample() * pipe.vae.config.scaling_factor
+    patch_size_t = pipe.transformer.config.patch_size_t
+    if patch_size_t is not None:
+        ncopy = latent.shape[2] % patch_size_t
+        # Copy the first frame ncopy times to match patch_size_t
+        first_frame = latent[:, :, :1, :, :]  # Get first frame [B, C, 1, H, W]
+        latent = torch.cat([first_frame.repeat(1, 1, ncopy, 1, 1), latent], dim=2)
+        assert latent.shape[2] % patch_size_t == 0
+    batch_size, num_channels, num_frames, height, width = latent.shape
+    # Get prompt embeddings
+    prompt_token_ids = pipe.tokenizer(
+        prompt,
+        padding="max_length",
+        max_length=pipe.transformer.config.max_text_seq_length,
+        truncation=True,
+        add_special_tokens=True,
+        return_tensors="pt",
+    )
+    prompt_token_ids = prompt_token_ids.input_ids
+    prompt_embedding = pipe.text_encoder(
+        prompt_token_ids.to(latent.device)
+    )[0]
+    _, seq_len, _ = prompt_embedding.shape
+    prompt_embedding = prompt_embedding.view(batch_size, seq_len, -1).to(dtype=latent.dtype)
+    latent = latent.permute(0, 2, 1, 3, 4)
+    # Add noise to latent (Select)
+    if noise_step != 0:
+        noise = torch.randn_like(latent)
+        add_timesteps = torch.full(
+            (batch_size,),
+            fill_value=noise_step,
+            dtype=torch.long,
+            device=latent.device,
+        )
+        latent = pipe.scheduler.add_noise(latent, noise, add_timesteps)
+    timesteps = torch.full(
+        (batch_size,),
+        fill_value=sr_noise_step,
+        dtype=torch.long,
+        device=latent.device,
+    )
+    # Prepare rotary embeds
+    vae_scale_factor_spatial = 2 ** (len(pipe.vae.config.block_out_channels) - 1)
+    transformer_config = pipe.transformer.config
+    rotary_emb = (
+        prepare_rotary_positional_embeddings(
+            height=height * vae_scale_factor_spatial,
+            width=width * vae_scale_factor_spatial,
+            num_frames=num_frames,
+            transformer_config=transformer_config,
+            vae_scale_factor_spatial=vae_scale_factor_spatial,
+            device=latent.device,
+        )
+        if pipe.transformer.config.use_rotary_positional_embeddings
+        else None
+    )
+    # Predict noise
+    predicted_noise = pipe.transformer(
+        hidden_states=latent,
+        encoder_hidden_states=prompt_embedding,
+        timestep=timesteps,
+        image_rotary_emb=rotary_emb,
+        return_dict=False,
+    )[0]
+    latent_generate = pipe.scheduler.get_velocity(
+        predicted_noise, latent, timesteps
+    )
+    # generate video
+    if patch_size_t is not None and ncopy > 0:
+        latent_generate = latent_generate[:, ncopy:, :, :, :]
+    # [B, C, F, H, W]
+    video_generate = pipe.decode_latents(latent_generate)
+    video_generate = (video_generate * 0.5 + 0.5).clamp(0.0, 1.0)
+    return video_generate
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="VSR using DOVE")
+    parser.add_argument("--input_dir", type=str)
+    parser.add_argument("--input_json", type=str, default=None)
+    parser.add_argument("--gt_dir", type=str, default=None)
+    parser.add_argument("--eval_metrics", type=str, default='') # 'psnr,ssim,lpips,dists,clipiqa,musiq,maniqa,niqe'
+    parser.add_argument("--model_path", type=str)
+    parser.add_argument("--lora_path", type=str, default=None, help="The path of the LoRA weights to be used")
+    parser.add_argument("--output_path", type=str, default="./results", help="The path save generated video")
+    parser.add_argument("--fps", type=int, default=16, help="The frames per second for the generated video")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="The data type for computation")
+    parser.add_argument("--seed", type=int, default=42, help="The seed for reproducibility")
+    parser.add_argument("--upscale_mode", type=str, default="bilinear")
+    parser.add_argument("--upscale", type=int, default=4)
+    parser.add_argument("--noise_step", type=int, default=0)
+    parser.add_argument("--sr_noise_step", type=int, default=399)
+    parser.add_argument("--is_cpu_offload", action="store_true", help="Enable CPU offload for the model")
+    parser.add_argument("--is_vae_st", action="store_true", help="Enable VAE slicing and tiling")
+    parser.add_argument("--png_save", action="store_true", help="Save output as PNG sequence")
+    parser.add_argument("--save_format", type=str, default="yuv444p", help="Save output as PNG sequence")
+    # Crop and Tiling Parameters
+    parser.add_argument("--tile_size_hw", type=int, nargs=2, default=(0, 0), help="Tile size for spatial tiling (height, width)")
+    parser.add_argument("--overlap_hw", type=int, nargs=2, default=(32, 32))
+    parser.add_argument("--chunk_len", type=int, default=0, help="Chunk length for temporal chunking")
+    parser.add_argument("--overlap_t", type=int, default=8)
+    args = parser.parse_args()
+    if args.dtype == "float16":
+        dtype = torch.float16
+    elif args.dtype == "bfloat16":
+        dtype = torch.bfloat16
+    elif args.dtype == "float32":
+        dtype = torch.float32
+    else:
+        raise ValueError("Invalid dtype. Choose from 'float16', 'bfloat16', or 'float32'.")
+    if args.chunk_len > 0:
+        print(f"Chunking video into {args.chunk_len} frames with {args.overlap_t} overlap")
+        overlap_t = args.overlap_t
+    else:
+        overlap_t = 0
+    if args.tile_size_hw != (0, 0):
+        print(f"Tiling video into {args.tile_size_hw} frames with {args.overlap_hw} overlap")
+        overlap_hw = args.overlap_hw
+    else:
+        overlap_hw = (0, 0)
+    # Set seed
+    set_seed(args.seed)
+    if args.input_json is not None:
+        with open(args.input_json, 'r') as f:
+            video_prompt_dict = json.load(f)
+    else:
+        video_prompt_dict = {}
+    # Get all video files from input directory
+    video_files = []
+    for ext in video_exts:
+        video_files.extend(glob.glob(os.path.join(args.input_dir, f'*{ext}')))
+    video_files = sorted(video_files)  # Sort files for consistent ordering
+    if not video_files:
+        raise ValueError(f"No video files found in {args.input_dir}")
+    os.makedirs(args.output_path, exist_ok=True)
+    # 1.  Load the pre-trained CogVideoX pipeline with the specified precision (bfloat16).
+    # add device_map="balanced" in the from_pretrained function and remove the enable_model_cpu_offload()
+    # function to use Multi GPUs.
+    pipe = CogVideoXPipeline.from_pretrained(args.model_path, torch_dtype=dtype)
+    # If you're using with lora, add this code
+    if args.lora_path:
+        print(f"Loading LoRA weights from {args.lora_path}")
+        pipe.load_lora_weights(
+            args.lora_path, weight_name="pytorch_lora_weights.safetensors", adapter_name="test_1"
+        )
+        pipe.fuse_lora(components=["transformer"], lora_scale=1.0) # lora_scale = lora_alpha / rank
+    # 2. Set Scheduler.
+    # Can be changed to `CogVideoXDPMScheduler` or `CogVideoXDDIMScheduler`.
+    # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B.
+    # using `CogVideoXDPMScheduler` for CogVideoX-5B / CogVideoX-5B-I2V.
+    # pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+    pipe.scheduler = CogVideoXDPMScheduler.from_config(
+        pipe.scheduler.config, timestep_spacing="trailing"
+    )
+    # 3. Enable CPU offload for the model.
+    # turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
+    # and enable to("cuda")
+    if args.is_cpu_offload:
+        # pipe.enable_model_cpu_offload()
+        pipe.enable_sequential_cpu_offload()
+    else:
+        pipe.to("cuda")
+    if args.is_vae_st:
+        pipe.vae.enable_slicing()
+        pipe.vae.enable_tiling()
+    # pipe.transformer.eval()
+    # torch.set_grad_enabled(False)
+    # 4. Set the metircs
+    if args.eval_metrics != '':
+        metrics_list = [m.strip().lower() for m in args.eval_metrics.split(',')]
+        metrics_models = {}
+        for name in metrics_list:
+            try:
+                metrics_models[name] = pyiqa.create_metric(name).to(pipe.device).eval()
+            except Exception as e:
+                print(f"Failed to initialize metric '{name}': {e}")
+        metric_accumulator = {name: [] for name in metrics_list}
+    else:
+        metrics_models = None
+        metric_accumulator = None
+    for video_path in tqdm(video_files, desc="Processing videos"):
+        video_name = os.path.basename(video_path)
+        prompt = video_prompt_dict.get(video_name, "")
+        if os.path.exists(video_path):
+            # Read video
+            # [F, C, H, W]
+            video, pad_f, pad_h, pad_w, original_shape = preprocess_video_match(video_path, is_match=True)
+            H_, W_ = video.shape[2], video.shape[3]
+            video = torch.nn.functional.interpolate(video, size=(H_*args.upscale, W_*args.upscale), mode=args.upscale_mode, align_corners=False)
+            __frame_transform = transforms.Compose(
+                [transforms.Lambda(lambda x: x / 255.0 * 2.0 - 1.0)] # -1, 1
+            )
+            video = torch.stack([__frame_transform(f) for f in video], dim=0)
+            video = video.unsqueeze(0)
+            # [B, C, F, H, W]
+            video = video.permute(0, 2, 1, 3, 4).contiguous()
+            _B, _C, _F, _H, _W = video.shape
+            time_chunks = make_temporal_chunks(_F, args.chunk_len, overlap_t)
+            spatial_tiles = make_spatial_tiles(_H, _W, args.tile_size_hw, overlap_hw)
+            output_video = torch.zeros_like(video)
+            write_count = torch.zeros_like(video, dtype=torch.int)
+            print(f"Process video: {video_name} | Prompt: {prompt} | Frame: {_F} (ori: {original_shape[0]}; pad: {pad_f}) | Target Resolution: {_H}, {_W} (ori: {original_shape[1]*args.upscale}, {original_shape[2]*args.upscale}; pad: {pad_h}, {pad_w}) | Chunk Num: {len(time_chunks)*len(spatial_tiles)}")
+            for t_start, t_end in time_chunks:
+                for h_start, h_end, w_start, w_end in spatial_tiles:
+                    video_chunk = video[:, :, t_start:t_end, h_start:h_end, w_start:w_end]
+                    # print(f"video_chunk: {video_chunk.shape} | t: {t_start}:{t_end} | h: {h_start}:{h_end} | w: {w_start}:{w_end}")
+                    # [B, C, F, H, W]
+                    _video_generate = process_video(
+                        pipe=pipe,
+                        video=video_chunk,
+                        prompt=prompt,
+                        noise_step=args.noise_step,
+                        sr_noise_step=args.sr_noise_step,
+                    )
+                    region = get_valid_tile_region(
+                        t_start, t_end, h_start, h_end, w_start, w_end,
+                        video_shape=video.shape,
+                        overlap_t=overlap_t,
+                        overlap_h=overlap_hw[0],
+                        overlap_w=overlap_hw[1],
+                    )
+                    output_video[:, :, region["out_t_start"]:region["out_t_end"],
+                                    region["out_h_start"]:region["out_h_end"],
+                                    region["out_w_start"]:region["out_w_end"]] = \
+                    _video_generate[:, :, region["valid_t_start"]:region["valid_t_end"],
+                                    region["valid_h_start"]:region["valid_h_end"],
+                                    region["valid_w_start"]:region["valid_w_end"]]
+                    write_count[:, :, region["out_t_start"]:region["out_t_end"],
+                                    region["out_h_start"]:region["out_h_end"],
+                                    region["out_w_start"]:region["out_w_end"]] += 1
+            video_generate = output_video
+            if (write_count == 0).any():
+                print("Error: Lack of write in region !!!")
+                exit()
+            if (write_count > 1).any():
+                print("Error: Write count > 1 in region !!!")
+                exit()
+            video_generate = remove_padding_and_extra_frames(video_generate, pad_f, pad_h*4, pad_w*4)
+            file_name = os.path.basename(video_path)
+            output_path = os.path.join(args.output_path, file_name)
+            if metrics_models is not None:
+                #  [1, C, F, H, W] -> [F, C, H, W]
+                pred_frames = video_generate[0]
+                pred_frames = pred_frames.permute(1, 0, 2, 3).contiguous()
+                if args.gt_dir is not None:
+                    gt_frames = load_sequence(os.path.join(args.gt_dir, file_name))
+                else:
+                    gt_frames = None
+                compute_metrics(pred_frames, gt_frames, metrics_models, metric_accumulator, file_name)
+            if args.png_save:
+                # Save as PNG sequence
+                output_dir = output_path.rsplit('.', 1)[0]  # Remove extension
+                save_frames_as_png(video_generate, output_dir, fps=args.fps)
+            else:
+                output_path = output_path.replace('.mkv', '.mp4')
+                save_video_with_imageio(video_generate, output_path, fps=args.fps, format=args.save_format)
+        else:
+            print(f"Warning: {video_name} not found in {args.input_dir}")
+    if metrics_models is not None:
+        print("\n=== Overall Average Metrics ===")
+        count = len(next(iter(metric_accumulator.values())))
+        overall_avg = {metric: 0 for metric in metrics_list}
+        out_name = 'metrics_'
+        for metric in metrics_list:
+            out_name += f"{metric}_"
+            scores = metric_accumulator[metric]
+            if scores:
+                avg = sum(scores) / len(scores)
+                overall_avg[metric] = avg
+                print(f"{metric.upper()}: {avg:.4f}")
+        out_name = out_name.rstrip('_') + '.json'
+        out_path = os.path.join(args.output_path, out_name)
+        output = {
+            "per_sample": metric_accumulator,
+            "average": overall_avg,
+            "count": count
+        }
+        with open(out_path, 'w') as f:
+            json.dump(output, f, indent=2)
+    print("All videos processed.")

pretrained_models/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ Place pretrained models here.

requirements.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+accelerate>=1.1.1
+transformers>=4.46.2
+numpy==1.26.0
+torch>=2.5.0
+torchvision>=0.20.0
+sentencepiece>=0.2.0
+SwissArmyTransformer>=0.4.12
+gradio>=5.5.0
+imageio>=2.35.1
+imageio-ffmpeg>=0.5.1
+openai>=1.54.0
+moviepy>=2.0.0
+scikit-video>=1.1.11
+pydantic>=2.10.3
+wandb
+peft
+opencv-python
+decord
+av
+torchdiffeq