Update README.md

f877d1d verified 12 days ago

4.86 kB

	---
	license: mit
	datasets:
	- CodeGoat24/HPD
	- CodeGoat24/LiFT-HRA
	- CodeGoat24/OIP
	- CodeGoat24/EvalMuse
	- CodeGoat24/ShareGPTVideo-DPO
	- CodeGoat24/VideoFeedback
	- CodeGoat24/LLaVA-Critic-113k
	- CodeGoat24/VideoDPO
	base_model:
	- Qwen/Qwen2.5-VL-32B-Instruct
	---


	# UnifiedReward-qwen-32B
	We are actively gathering feedback from the community to improve our models. We welcome your input and encourage you to stay updated through our repository!!

	## Model Summary

	`UnifiedReward-qwen-32b` is the first unified reward model based on [Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment.

	For further details, please refer to the following resources:
	- 📰 Paper: https://arxiv.org/pdf/2503.05236
	- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/
	- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
	- 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
	- 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)


	## 🏁 Compared with Current Reward Models

	\| Reward Model \| Method\| Image Generation \| Image Understanding \| Video Generation \| Video Understanding
	\| :-----: \| :-----: \|:-----: \|:-----: \| :-----: \| :-----: \|
	\| [PickScore](https://github.com/yuvalkirstain/PickScore) \|Point \| √ \| \| \|\|
	\| [HPS](https://github.com/tgxs002/HPSv2) \| Point \| √ \| \|\|\|
	\| [ImageReward](https://github.com/THUDM/ImageReward) \| Point\| √\| \|\|\|
	\| [LLaVA-Critic](https://huggingface.co/lmms-lab/llava-critic-7b) \| Pair/Point \| \| √ \|\|\|
	\| [IXC-2.5-Reward](https://github.com/InternLM/InternLM-XComposer) \| Pair/Point \| \| √ \|\|√\|
	\| [VideoScore](https://github.com/TIGER-AI-Lab/VideoScore) \| Point \| \| \|√ \|\|
	\| [LiFT](https://github.com/CodeGoat24/LiFT) \| Point \| \| \|√\| \|
	\| [VisionReward](https://github.com/THUDM/VisionReward) \| Point \|√ \| \|√\|\|
	\| [VideoReward](https://github.com/KwaiVGI/VideoAlign) \| Point \| \| \|√ \|\|
	\| UnifiedReward (Ours) \| Pair/Point \| √ \| √ \|√\|√\|


	### Quick Start
	All pair rank and point score inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward).

	We take image understanding assessment as example here:
	~~~python
	import json
	import random
	import torch
	import tqdm
	from PIL import Image
	import warnings
	import os
	from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
	from qwen_vl_utils import process_vision_info

	warnings.filterwarnings("ignore")

	model_path = "CodeGoat24/UnifiedReward-qwen-32b"
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_path, torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_path)


	url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
	image = Image.open(requests.get(url, stream=True).raw)

	prompt_text = f'Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\nQuestion: [What this image presents?]\nThe first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\nThe second response: [This is a handwritten number seven.]\nASSISTANT:\n'

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": prompt_text},
	],
	}
	]

	chat_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)

	inputs = processor(
	text=[chat_input],
	images=image_inputs,
	videos=video_inputs,
	return_tensors="pt",
	padding=True
	).to("cuda")

	with torch.no_grad():
	generated_ids = model.generate(**inputs, max_new_tokens=4096)
	generated_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output = processor.batch_decode(generated_trimmed, skip_special_tokens=True)[0]


	print(output)
	~~~


	## Citation

	```
	@article{UnifiedReward,
	title={Unified Reward Model for Multimodal Understanding and Generation.},
	author={Wang, Yibin and Zang, Yuhang, and Li, Hao and Jin, Cheng and Wang Jiaqi},
	journal={arXiv preprint arXiv:2503.05236},
	year={2025}
	}
	```