liujie
first commit
4817c60
from datetime import datetime
import pytz
ABOUT_TEXT = """
## Overview
The prompt-video pairs are sourced from [VideoGen-Eval](https://ailab-cvc.github.io/VideoGen-Eval/), a dataset featuring a diverse range of prompts and videos generated by state-of-the-art video diffusion models (VDMs). Our benchmark comprises 26.5k video pairs, each annotated with a corresponding preference label.
<img src="https://i.postimg.cc/J7XhVLTh/image.png" alt="Video Duration and Resolution in VideoGen-RewardBench" style="width:400px;">
We report two accuracy metrics: ties-included accuracy **(w/ Ties)** and ties-excluded accuracy **(w/o Ties)**.
- For ties-excluded accuracy, we exclude all data labeled as ”ties” and use only data labeled as ”A wins” or ”B wins” for calculation. We compute the rewards for each prompt-video pair, convert the relative reward relationships into binary labels, and calculate classification accuracy.
- For ties-included accuracy, we adopt Algorithm 1 proposed by [Ties Matter](https://arxiv.org/pdf/2305.14324). This method traverses all possible tie thresholds, calculates three-class accuracy for each threshold, and selects the highest accuracy as the final metric. See [calc_accuracy](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L22) for the implementation of ties-included accuracy.
We include multiple types of reward models in this evaluation:
1. **Sequence Classifiers** (Seq. Classifier): A model that takes in a prompt and a video and outputs a score.
2. **Custom Classifiers**: Research models with different architectures and training objectives.
3. **Random**: Random choice baseline.
4. **Generative**: Prompting fine-tuned models to choose between two answers.
Note: Models with (*) after the Model are independently submitted model scores which have not been verified by the VideoGen-RewardBench team.
## Acknowledgments
Our leaderboard is built on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench). The prompt-video pairs are sourced from [VideoGen-Eval](https://ailab-cvc.github.io/VideoGen-Eval/). We sincerely thank all the contributors!
"""
# Get Pacific time zone (handles PST/PDT automatically)
pacific_tz = pytz.timezone('America/Los_Angeles')
current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")
TOP_TEXT = f"""# VideoGen-RewardBench: Evaluating Reward Models for Video Generation
### Evaluating the capabilities of reward models for video generation.
[Code](https://github.com/KwaiVGI/VideoAlign) | [Project](https://gongyeliu.github.io/videoalign/) | [Eval. Dataset](https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench) | [Paper](https://arxiv.org/abs/2501.13918) | Total models: {{}} | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): {current_time}
"""
SUBMIT_TEXT = r"""
## How to Submit Your Results on VideoGen-RewardBench
Please follow the steps below to submit your reward model's results:
### Step 1: Create an Issue
Open an issue in the [VideoAlign GitHub repository](https://github.com/KwaiVGI/VideoAlign/issues).
### Step 2: Calculate Accuracy Metrics
Use our provided scripts to compute your model's accuracy:
- **Ties-Included Accuracy (w/ Ties):** Use [calc_accuracy_with_ties](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L22C5-L22C28)
- **Ties-Excluded Accuracy (w/o Ties):** Use [calc_accuracy_without_ties](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L87)
### Step 3: Provide Your Results in the Issue
Within the issue, include your reward model's results in JSON format. For example:
```json
{
"with_tie": {
"overall": 61.26,
"vq": 59.68,
"mq": 66.03,
"ta": 53.80
},
"without_tie": {
"overall": 73.59,
"vq": 75.66,
"mq": 74.70,
"ta": 72.20
},
"model": "VideoReward",
"model_link": "https://huggingface.co/KwaiVGI/VideoReward",
"model_type": "Seq. Classifiers"
}
```
Additionally, please include any relevant information about your model (e.g., a brief description, methodology, etc.).
### Step 4: Review and Leaderboard Update
We will review your issue promptly and update the leaderboard accordingly.
"""