Spaces:
Runtime error
Runtime error
# Video Caption | |
English | [็ฎไฝไธญๆ](./README_zh-CN.md) | |
The folder contains codes for dataset preprocessing (i.e., video splitting, filtering, and recaptioning), and beautiful prompt used by CogVideoX-Fun. | |
The entire process supports distributed parallel processing, capable of handling large-scale datasets. | |
Meanwhile, we are collaborating with [Data-Juicer](https://github.com/modelscope/data-juicer/blob/main/docs/DJ_SORA.md), | |
allowing you to easily perform video data processing on [Aliyun PAI-DLC](https://help.aliyun.com/zh/pai/user-guide/video-preprocessing/). | |
# Table of Content | |
- [Video Caption](#video-caption) | |
- [Table of Content](#table-of-content) | |
- [Quick Start](#quick-start) | |
- [Setup](#setup) | |
- [Data Preprocessing](#data-preprocessing) | |
- [Data Preparation](#data-preparation) | |
- [Video Splitting](#video-splitting) | |
- [Video Filtering](#video-filtering) | |
- [Video Recaptioning](#video-recaptioning) | |
- [Beautiful Prompt (For CogVideoX-Fun Inference)](#beautiful-prompt-for-cogvideox-inference) | |
- [Batched Inference](#batched-inference) | |
- [OpenAI Server](#openai-server) | |
## Quick Start | |
### Setup | |
AliyunDSW or Docker is recommended to setup the environment, please refer to [Quick Start](../../README.md#quick-start). | |
You can also refer to the image build process in the [Dockerfile](../../Dockerfile.ds) to configure the conda environment and other dependencies locally. | |
Since the video recaptioning depends on [llm-awq](https://github.com/mit-han-lab/llm-awq) for faster and memory efficient inference, | |
the minimum GPU requirment should be RTX 3060 or A2 (CUDA Compute Capability >= 8.0). | |
```shell | |
# pull image | |
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun | |
# enter image | |
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun | |
# clone code | |
git clone https://github.com/aigc-apps/CogVideoX-Fun.git | |
# enter video_caption | |
cd CogVideoX-Fun/cogvideox/video_caption | |
``` | |
### Data Preprocessing | |
#### Data Preparation | |
Place the downloaded videos into a folder under [datasets](./datasets/) (preferably without nested structures, as the video names are used as unique IDs in subsequent processes). | |
Taking Panda-70M as an example, the entire dataset directory structure is shown as follows: | |
``` | |
๐ฆ datasets/ | |
โโโ ๐ panda_70m/ | |
โ โโโ ๐ videos/ | |
โ โ โโโ ๐ data/ | |
โ โ โ โโโ ๐ --C66yU3LjM_2.mp4 | |
โ โ โ โโโ ๐ ... | |
``` | |
#### Video Splitting | |
CogVideoX-Fun utilizes [PySceneDetect](https://github.com/Breakthrough/PySceneDetect) to identify scene changes within the video | |
and performs video splitting via FFmpeg based on certain threshold values to ensure consistency of the video clip. | |
Video clips shorter than 3 seconds will be discarded, and those longer than 10 seconds will be splitted recursively. | |
The entire workflow of video splitting is in the [stage_1_video_splitting.sh](./scripts/stage_1_video_splitting.sh). | |
After running | |
```shell | |
sh scripts/stage_1_video_splitting.sh | |
``` | |
the video clips are obtained in `cogvideox/video_caption/datasets/panda_70m/videos_clips/data/`. | |
#### Video Filtering | |
Based on the videos obtained in the previous step, CogVideoX-Fun provides a simple yet effective pipeline to filter out high-quality videos for recaptioning. | |
The overall process is as follows: | |
- Aesthetic filtering: Filter out videos with poor content (blurry, dim, etc.) by calculating the average aesthetic score of uniformly sampled 4 frames via [aesthetic-predictor-v2-5](https://github.com/discus0434/aesthetic-predictor-v2-5). | |
- Text filtering: Use [EasyOCR](https://github.com/JaidedAI/EasyOCR) to calculate the text area proportion of the middle frame to filter out videos with a large area of text. | |
- Motion filtering: Calculate interframe optical flow differences to filter out videos that move too slowly or too quickly. | |
The entire workflow of video filtering is in the [stage_2_video_filtering.sh](./scripts/stage_2_video_filtering.sh). | |
After running | |
```shell | |
sh scripts/stage_2_video_filtering.sh | |
``` | |
the aesthetic score, text score, and motion score of videos will be saved in the corresponding meta files in the folder `cogvideox/video_caption/datasets/panda_70m/videos_clips/`. | |
> [!NOTE] | |
> The computation of the aesthetic score depends on the [google/siglip-so400m-patch14-384 model](https://huggingface.co/google/siglip-so400m-patch14-384). | |
Please run `HF_ENDPOINT=https://hf-mirror.com sh scripts/stage_2_video_filtering.sh` if you cannot access to huggingface.com. | |
#### Video Recaptioning | |
After obtaining the aboved high-quality filtered videos, CogVideoX-Fun utilizes [VILA1.5](https://github.com/NVlabs/VILA) to perform video recaptioning. | |
Subsequently, the recaptioning results are rewritten by LLMs to better meet with the requirements of video generation tasks. | |
Finally, an advanced VideoCLIPXL model is developed to filter out video-caption pairs with poor alignment, resulting in the final training dataset. | |
Please download the video caption model from [VILA1.5](https://huggingface.co/collections/Efficient-Large-Model/vila-on-pre-training-for-visual-language-models-65d8022a3a52cd9bcd62698e) of the appropriate size based on the GPU memory of your machine. | |
For A100 with 40G VRAM, you can download [VILA1.5-40b-AWQ](https://huggingface.co/Efficient-Large-Model/VILA1.5-40b-AWQ) by running | |
```shell | |
# Add HF_ENDPOINT=https://hf-mirror.com before the command if you cannot access to huggingface.com | |
huggingface-cli download Efficient-Large-Model/VILA1.5-40b-AWQ --local-dir-use-symlinks False --local-dir /PATH/TO/VILA_MODEL | |
``` | |
Optionally, you can prepare local LLMs to rewrite the recaption results. | |
For example, you can download [Meta-Llama-3-8B-Instruct](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct) by running | |
```shell | |
# Add HF_ENDPOINT=https://hf-mirror.com before the command if you cannot access to huggingface.com | |
huggingface-cli download NousResearch/Meta-Llama-3-8B-Instruct --local-dir-use-symlinks False --local-dir /PATH/TO/REWRITE_MODEL | |
``` | |
The entire workflow of video recaption is in the [stage_3_video_recaptioning.sh](./scripts/stage_3_video_recaptioning.sh). | |
After running | |
```shell | |
VILA_MODEL_PATH=/PATH/TO/VILA_MODEL REWRITE_MODEL_PATH=/PATH/TO/REWRITE_MODEL sh scripts/stage_3_video_recaptioning.sh | |
``` | |
the final train file is obtained in `cogvideox/video_caption/datasets/panda_70m/videos_clips/meta_train_info.json`. | |
### Beautiful Prompt (For CogVideoX-Fun Inference) | |
Beautiful Prompt aims to rewrite and beautify the user-uploaded prompt via LLMs, mapping it to the style of CogVideoX-Fun's training captions, | |
making it more suitable as the inference prompt and thus improving the quality of the generated videos. | |
We support batched inference with local LLMs or OpenAI compatible server based on [vLLM](https://github.com/vllm-project/vllm) for beautiful prompt. | |
#### Batched Inference | |
1. Prepare original prompts in a jsonl file `cogvideox/video_caption/datasets/original_prompt.jsonl` with the following format: | |
```json | |
{"prompt": "A stylish woman in a black leather jacket, red dress, and boots walks confidently down a damp Tokyo street."} | |
{"prompt": "An underwater world with realistic fish and other creatures of the sea."} | |
{"prompt": "a monarch butterfly perched on a tree trunk in the forest."} | |
{"prompt": "a child in a room with a bottle of wine and a lamp."} | |
{"prompt": "two men in suits walking down a hallway."} | |
``` | |
2. Then you can perform beautiful prompt by running | |
```shell | |
# Meta-Llama-3-8B-Instruct is sufficient for this task. | |
# Download it from https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct or https://www.modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct to /path/to/your_llm | |
python caption_rewrite.py \ | |
--video_metadata_path datasets/original_prompt.jsonl \ | |
--caption_column "prompt" \ | |
--batch_size 1 \ | |
--model_name /path/to/your_llm \ | |
--prompt prompt/beautiful_prompt.txt \ | |
--prefix '"detailed description": ' \ | |
--saved_path datasets/beautiful_prompt.jsonl \ | |
--saved_freq 1 | |
``` | |
#### OpenAI Server | |
+ You can request OpenAI compatible server to perform beautiful prompt by running | |
```shell | |
OPENAI_API_KEY="your_openai_api_key" OPENAI_BASE_URL="your_openai_base_url" python beautiful_prompt.py \ | |
--model "your_model_name" \ | |
--prompt "your_prompt" | |
``` | |
+ You can also deploy the OpenAI Compatible Server locally using vLLM. For example: | |
```shell | |
# Meta-Llama-3-8B-Instruct is sufficient for this task. | |
# Download it from https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct or https://www.modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct to /path/to/your_llm | |
# deploy the OpenAI compatible server | |
python -m vllm.entrypoints.openai.api_server serve /path/to/your_llm --dtype auto --api-key "your_api_key" | |
``` | |
Then you can perform beautiful prompt by running | |
```shell | |
python -m beautiful_prompt.py \ | |
--model /path/to/your_llm \ | |
--prompt "your_prompt" \ | |
--base_url "http://localhost:8000/v1" \ | |
--api_key "your_api_key" | |
``` | |