Spaces:

KwaiVGI
/

VideoGen-RewardBench

Running

liujie

first commit

4817c60 5 months ago

4.23 kB

	from datetime import datetime
	import pytz

	ABOUT_TEXT = """
	## Overview

	The prompt-video pairs are sourced from [VideoGen-Eval](https://ailab-cvc.github.io/VideoGen-Eval/), a dataset featuring a diverse range of prompts and videos generated by state-of-the-art video diffusion models (VDMs). Our benchmark comprises 26.5k video pairs, each annotated with a corresponding preference label.
	<img src="https://i.postimg.cc/J7XhVLTh/image.png" alt="Video Duration and Resolution in VideoGen-RewardBench" style="width:400px;">


	We report two accuracy metrics: ties-included accuracy (w/ Ties) and ties-excluded accuracy (w/o Ties).

	- For ties-excluded accuracy, we exclude all data labeled as ”ties” and use only data labeled as ”A wins” or ”B wins” for calculation. We compute the rewards for each prompt-video pair, convert the relative reward relationships into binary labels, and calculate classification accuracy.
	- For ties-included accuracy, we adopt Algorithm 1 proposed by [Ties Matter](https://arxiv.org/pdf/2305.14324). This method traverses all possible tie thresholds, calculates three-class accuracy for each threshold, and selects the highest accuracy as the final metric. See [calc_accuracy](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L22) for the implementation of ties-included accuracy.

	We include multiple types of reward models in this evaluation:
	1. Sequence Classifiers (Seq. Classifier): A model that takes in a prompt and a video and outputs a score.
	2. Custom Classifiers: Research models with different architectures and training objectives.
	3. Random: Random choice baseline.
	4. Generative: Prompting fine-tuned models to choose between two answers.

	Note: Models with (*) after the Model are independently submitted model scores which have not been verified by the VideoGen-RewardBench team.

	## Acknowledgments

	Our leaderboard is built on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench). The prompt-video pairs are sourced from [VideoGen-Eval](https://ailab-cvc.github.io/VideoGen-Eval/). We sincerely thank all the contributors!
	"""

	# Get Pacific time zone (handles PST/PDT automatically)
	pacific_tz = pytz.timezone('America/Los_Angeles')
	current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")

	TOP_TEXT = f"""# VideoGen-RewardBench: Evaluating Reward Models for Video Generation
	### Evaluating the capabilities of reward models for video generation.
	[Code](https://github.com/KwaiVGI/VideoAlign) \| [Project](https://gongyeliu.github.io/videoalign/) \| [Eval. Dataset](https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench) \| [Paper](https://arxiv.org/abs/2501.13918) \| Total models: {{}} \| * Unverified models \| ⚠️ Dataset Contamination \| Last restart (PST): {current_time}
	"""

	SUBMIT_TEXT = r"""
	## How to Submit Your Results on VideoGen-RewardBench

	Please follow the steps below to submit your reward model's results:

	### Step 1: Create an Issue
	Open an issue in the [VideoAlign GitHub repository](https://github.com/KwaiVGI/VideoAlign/issues).

	### Step 2: Calculate Accuracy Metrics
	Use our provided scripts to compute your model's accuracy:
	- Ties-Included Accuracy (w/ Ties): Use [calc_accuracy_with_ties](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L22C5-L22C28)
	- Ties-Excluded Accuracy (w/o Ties): Use [calc_accuracy_without_ties](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L87)

	### Step 3: Provide Your Results in the Issue
	Within the issue, include your reward model's results in JSON format. For example:

	```json
	{
	"with_tie": {
	"overall": 61.26,
	"vq": 59.68,
	"mq": 66.03,
	"ta": 53.80
	},
	"without_tie": {
	"overall": 73.59,
	"vq": 75.66,
	"mq": 74.70,
	"ta": 72.20
	},
	"model": "VideoReward",
	"model_link": "https://huggingface.co/KwaiVGI/VideoReward",
	"model_type": "Seq. Classifiers"
	}
	```

	Additionally, please include any relevant information about your model (e.g., a brief description, methodology, etc.).

	### Step 4: Review and Leaderboard Update
	We will review your issue promptly and update the leaderboard accordingly.

	"""