alibaba-pai
/

EasyAnimateV5-Reward-LoRAs

Model card Files Files and versions Community

EasyAnimateV5-Reward-LoRAs / README.md

hkunzhe

upload README.md

c296301 verified 3 months ago

preview code

raw

history blame contribute delete

13.5 kB

	# EasyAnimateV5-12b-zh-InP-Reward-LoRAs
	## Introduction
	We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [EasyAnimateV5](https://github.com/aigc-apps/EasyAnimate/tree/main/easyanimate) for better alignment with human preferences.
	We provide pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.

	For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/EasyAnimate).

	\| Name \| Base Model \| Reward Model \| Hugging Face \| Description \|
	\|--\|--\|--\|--\|--\|
	\| EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors \| EasyAnimateV5-12b-zh-InP \| [HPS v2.1](https://github.com/tgxs002/HPSv2) \| [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors) \| Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.\|
	\| EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors \| EasyAnimateV5-7b-zh-InP \| [HPS v2.1](https://github.com/tgxs002/HPSv2) \| [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors) \| Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 3,500 steps.\|
	\| EasyAnimateV5-12b-zh-InP-MPS.safetensors \| EasyAnimateV5-12b-zh-InP \| [MPS](https://github.com/Kwai-Kolors/MPS) \| [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-MPS.safetensors) \| Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.\|
	\| EasyAnimateV5-7b-zh-InP-MPS.safetensors \| EasyAnimateV5-7b-zh-InP \| [MPS](https://github.com/Kwai-Kolors/MPS) \| [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-MPS.safetensors) \| Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 2,000 steps.\|

	## Demo
	### EasyAnimateV5-12b-zh-InP

	<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
	<thead>
	<tr>
	<th style="text-align: center;" width="10%">Prompt</sup></th>
	<th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP</th>
	<th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> HPSv2.1 Reward LoRA</th>
	<th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> MPS Reward LoRA</th>
	</tr>
	</thead>
	<tr>
	<td>
	Porcelain rabbit hopping by a golden cactus
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/c7ee83b2-0329-4853-b47d-e8e1550f1164" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/1fea5b95-05dd-44cf-aec2-5c104e3afa8d" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/de14593a-daae-4a3e-8231-7df2108065d5" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	<tr>
	<td>
	Yellow rubber duck floating next to a blue bath towel
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/c146fe30-ddcc-4e26-8659-885efd48136f" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/bd4a0a5c-cfe0-4a04-835b-1a3613926a6d" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/f5076984-9661-4670-9ca5-abc33b7d66c0" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	<tr>
	<td>
	An elephant sprays water with its trunk, a lion sitting nearby
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/139bc722-d8bb-42cb-b043-99334f320496" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/87edf580-f1f3-4be2-931e-e53306ca9087" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/a38581c2-f4b3-4905-93af-debb3aec6488" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	<tr>
	<td>
	A fish swims gracefully in a tank as a horse gallops outside
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/0383cdd5-1d9c-4b62-bde9-7a0423c8f863" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/efaee3eb-c361-4167-8952-92853a13df24" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/4cd406e3-8348-4589-8c07-43379547e1e1" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	</table>

	### EasyAnimateV5-7b-zh-InP

	<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
	<thead>
	<tr>
	<th style="text-align: center;" width="10%">Prompt</th>
	<th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP</th>
	<th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> HPSv2.1 Reward LoRA</th>
	<th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> MPS Reward LoRA</th>
	</tr>
	</thead>
	<tr>
	<td>
	Crystal cake shimmering beside a metal apple
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/25ae8abe-2e53-4557-b3f0-a72c247603e2" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/26f47c9b-e8f6-4768-978f-56fb47de4f2f" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/56166d66-4645-409e-b236-48ea25e8400b" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	<tr>
	<td>
	Elderly artist with a white beard painting on a white canvas
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/7e0d7153-036a-4a40-b726-218760837ce7" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/314a68e8-57e3-437e-9acc-656da5f73853" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/d045e3e8-c9bd-4833-9a00-6decd50047d9" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	<tr>
	<td>
	Porcelain rabbit hopping by a golden cactus
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/93890751-2ae7-4d55-82dc-7f992c8ad9b4" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/932ef7e4-c8a9-4153-94a8-8975d872701e" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/be0a01aa-a0c7-45a1-9db2-3b718c0be272" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	<tr>
	<td>
	Green parrot perching on a brown chair
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/74a41dd4-8375-44be-8242-11287037c484" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/fd76e645-4ae3-427f-ac7b-9712e6dae4dd" width="100%" controls autoplay loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/6a7a0c11-1a78-4d51-90c4-814d1f4fb338" width="100%" controls autoplay loop></video>
	</td>
	</tr>
	</table>

	> [!NOTE]
	> The above test prompts are from <a href="https://github.com/KaiyueSun98/T2V-CompBench">T2V-CompBench</a>. All videos are generated with lora weight 0.7.

	## Quick Start
	We provide an example inference code to run EasyAnimateV5-12b-zh-InP with its HPS2.1 reward LoRA.

	```python
	import torch
	from diffusers import DDIMScheduler
	from omegaconf import OmegaConf
	from transformers import BertModel, BertTokenizer, T5EncoderModel, T5Tokenizer

	from easyanimate.models import AutoencoderKLMagvit, EasyAnimateTransformer3DModel
	from easyanimate.pipeline.pipeline_easyanimate_multi_text_encoder_inpaint import EasyAnimatePipeline_Multi_Text_Encoder_Inpaint
	from easyanimate.utils.lora_utils import merge_lora
	from easyanimate.utils.utils import get_image_to_video_latent, save_videos_grid
	from easyanimate.utils.fp8_optimization import convert_weight_dtype_wrapper

	# GPU memory mode, which can be choosen in [model_cpu_offload, model_cpu_offload_and_qfloat8, sequential_cpu_offload].
	GPU_memory_mode = "model_cpu_offload"
	# Download from https://raw.githubusercontent.com/aigc-apps/EasyAnimate/refs/heads/main/config/easyanimate_video_v5_magvit_multi_text_encoder.yaml
	config_path = "config/easyanimate_video_v5_magvit_multi_text_encoder.yaml"
	model_path = "alibaba-pai/EasyAnimateV5-12b-zh-InP"
	lora_path = "alibaba-pai/EasyAnimateV5-Reward-LoRAs/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors"
	weight_dtype = torch.bfloat16
	lora_weight = 0.7

	prompt = "A panda eats bamboo while a monkey swings from branch to branch"
	sample_size = [512, 512]
	video_length = 49

	config = OmegaConf.load(config_path)
	transformer_additional_kwargs = OmegaConf.to_container(config['transformer_additional_kwargs'])
	if weight_dtype == torch.float16:
	transformer_additional_kwargs["upcast_attention"] = True
	transformer = EasyAnimateTransformer3DModel.from_pretrained_2d(
	model_path,
	subfolder="transformer",
	transformer_additional_kwargs=transformer_additional_kwargs,
	torch_dtype=torch.float8_e4m3fn if GPU_memory_mode == "model_cpu_offload_and_qfloat8" else weight_dtype,
	low_cpu_mem_usage=True,
	)
	vae = AutoencoderKLMagvit.from_pretrained(
	model_path, subfolder="vae", vae_additional_kwargs=OmegaConf.to_container(config['vae_kwargs'])
	).to(weight_dtype)
	if config['vae_kwargs'].get('vae_type', 'AutoencoderKL') == 'AutoencoderKLMagvit' and weight_dtype == torch.float16:
	vae.upcast_vae = True

	pipeline = EasyAnimatePipeline_Multi_Text_Encoder_Inpaint.from_pretrained(
	model_path,
	text_encoder=BertModel.from_pretrained(model_path, subfolder="text_encoder").to(weight_dtype),
	text_encoder_2=T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder_2").to(weight_dtype),
	tokenizer=BertTokenizer.from_pretrained(model_path, subfolder="tokenizer"),
	tokenizer_2=T5Tokenizer.from_pretrained(model_path, subfolder="tokenizer_2"),
	vae=vae,
	transformer=transformer,
	scheduler=DDIMScheduler.from_pretrained(model_path, subfolder="scheduler"),
	torch_dtype=weight_dtype
	)
	if GPU_memory_mode == "sequential_cpu_offload":
	pipeline.enable_sequential_cpu_offload()
	elif GPU_memory_mode == "model_cpu_offload_and_qfloat8":
	pipeline.enable_model_cpu_offload()
	convert_weight_dtype_wrapper(pipeline.transformer, weight_dtype)
	else:
	pipeline.enable_model_cpu_offload()
	pipeline = merge_lora(pipeline, lora_path, lora_weight)

	generator = torch.Generator(device="cuda").manual_seed(42)
	input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
	sample = pipeline(
	prompt,
	video_length = video_length,
	negative_prompt = "bad detailed",
	height = sample_size[0],
	width = sample_size[1],
	generator = generator,
	guidance_scale = 7.0,
	num_inference_steps = 50,
	video = input_video,
	mask_video = input_video_mask,
	).videos

	save_videos_grid(sample, "samples/output.mp4", fps=8)
	```

	## Limitations
	1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
	The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
	2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
	evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
	in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.

	## References
	<ol>
	<li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
	<li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
	</ol>