from datetime import datetime import pytz ABOUT_TEXT = """ ## Overview The prompt-video pairs are sourced from [VideoGen-Eval](https://ailab-cvc.github.io/VideoGen-Eval/), a dataset featuring a diverse range of prompts and videos generated by state-of-the-art video diffusion models (VDMs). Our benchmark comprises 26.5k video pairs, each annotated with a corresponding preference label. Video Duration and Resolution in VideoGen-RewardBench We report two accuracy metrics: ties-included accuracy **(w/ Ties)** and ties-excluded accuracy **(w/o Ties)**. - For ties-excluded accuracy, we exclude all data labeled as ”ties” and use only data labeled as ”A wins” or ”B wins” for calculation. We compute the rewards for each prompt-video pair, convert the relative reward relationships into binary labels, and calculate classification accuracy. - For ties-included accuracy, we adopt Algorithm 1 proposed by [Ties Matter](https://arxiv.org/pdf/2305.14324). This method traverses all possible tie thresholds, calculates three-class accuracy for each threshold, and selects the highest accuracy as the final metric. See [calc_accuracy](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L22) for the implementation of ties-included accuracy. We include multiple types of reward models in this evaluation: 1. **Sequence Classifiers** (Seq. Classifier): A model that takes in a prompt and a video and outputs a score. 2. **Custom Classifiers**: Research models with different architectures and training objectives. 3. **Random**: Random choice baseline. 4. **Generative**: Prompting fine-tuned models to choose between two answers. Note: Models with (*) after the Model are independently submitted model scores which have not been verified by the VideoGen-RewardBench team. ## Acknowledgments Our leaderboard is built on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench). The prompt-video pairs are sourced from [VideoGen-Eval](https://ailab-cvc.github.io/VideoGen-Eval/). We sincerely thank all the contributors! """ # Get Pacific time zone (handles PST/PDT automatically) pacific_tz = pytz.timezone('America/Los_Angeles') current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y") TOP_TEXT = f"""# VideoGen-RewardBench: Evaluating Reward Models for Video Generation ### Evaluating the capabilities of reward models for video generation. [Code](https://github.com/KwaiVGI/VideoAlign) | [Project](https://gongyeliu.github.io/videoalign/) | [Eval. Dataset](https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench) | [Paper](https://arxiv.org/abs/2501.13918) | Total models: {{}} | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): {current_time} """ SUBMIT_TEXT = r""" ## How to Submit Your Results on VideoGen-RewardBench Please follow the steps below to submit your reward model's results: ### Step 1: Create an Issue Open an issue in the [VideoAlign GitHub repository](https://github.com/KwaiVGI/VideoAlign/issues). ### Step 2: Calculate Accuracy Metrics Use our provided scripts to compute your model's accuracy: - **Ties-Included Accuracy (w/ Ties):** Use [calc_accuracy_with_ties](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L22C5-L22C28) - **Ties-Excluded Accuracy (w/o Ties):** Use [calc_accuracy_without_ties](https://github.com/KwaiVGI/VideoAlign/blob/main/calc_accuracy.py#L87) ### Step 3: Provide Your Results in the Issue Within the issue, include your reward model's results in JSON format. For example: ```json { "with_tie": { "overall": 61.26, "vq": 59.68, "mq": 66.03, "ta": 53.80 }, "without_tie": { "overall": 73.59, "vq": 75.66, "mq": 74.70, "ta": 72.20 }, "model": "VideoReward", "model_link": "https://huggingface.co/KwaiVGI/VideoReward", "model_type": "Seq. Classifiers" } ``` Additionally, please include any relevant information about your model (e.g., a brief description, methodology, etc.). ### Step 4: Review and Leaderboard Update We will review your issue promptly and update the leaderboard accordingly. """