Spaces:
Sleeping
Sleeping
# VideoCLIP and VLM | |
You just find this toolkit for multimodal video understanding! It contains implementation of two recent multi-modal video understanding papers [VideoCLIP](https://arxiv.org/pdf/2109.14084.pdf) (EMNLP, 2021) and [VLM](https://aclanthology.org/2021.findings-acl.370.pdf) (ACL Findings, 2021), along with high-performance toolkits that are typically lacking in existing codebase. The toolkit is desigend to contain generic performance-tuned components that can be potentially adapted to other frameworks (we initially use fairseq). | |
VideoCLIP is a contrastive learning model for zero-shot transfer to retrieval/classification/sequence labeling style tasks. | |
<img src="videoclip.png" width="350" class="center"> | |
VLM is a masked language model style pre-training using only one encoder with masked modality model (MMM) for retrieval/generation/sequence labeling style tasks. | |
<img src="vlm.png" width="350" class="center"> | |
### News | |
[Oct. 2021] Initial release of implementation for the following papers: | |
[VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) (Xu et. al., EMNLP 2021) | |
[VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding](https://aclanthology.org/2021.findings-acl.370.pdf) (Xu et. al., ACL Findings 2021) | |
### Installation | |
We aim to minimize the dependency of this repo on other packages. | |
We use fairseq as the main trainer (no models/datasets dependency on fairseq. We will support other trainer in future): | |
``` | |
git clone https://github.com/pytorch/fairseq | |
cd fairseq | |
pip install -e . # also optionally follow fairseq README for apex installation for fp16 training. | |
export MKL_THREADING_LAYER=GNU # fairseq may need this for numpy. | |
``` | |
Then install this toolkit: | |
``` | |
cd examples/MMPT # MMPT can be in any folder, not necessarily under fairseq/examples. | |
pip install -e . | |
``` | |
The code is developed under Python=3.8.8, Pytorch=1.8, cuda=11.0 with fairseq=1.0.0a0+af0389f and tested under Python=3.8.8 pytorch=1.9 cuda=11.0 fairseq=1.0.0a0+8e7bc73 during code release. | |
Most models require `transformers==3.4` for API compatibility `pip install transformers==3.4`. | |
In addition, some downstream tasks may need `conda install pandas`. | |
### Usage | |
#### Download Checkpoints | |
We use pre-trained [S3D](https://github.com/antoine77340/S3D_HowTo100M) for video feature extraction. Please place the models as `pretrained_models/s3d_dict.npy` and `pretrained_models/s3d_howto100m.pth`. | |
Download VideoCLIP checkpoint `https://dl.fbaipublicfiles.com/MMPT/retri/videoclip/checkpoint_best.pt` to `runs/retri/videoclip` or VLM checkpoint `https://dl.fbaipublicfiles.com/MMPT/mtm/vlm/checkpoint_best.pt` to `runs/mtm/vlm`. | |
#### Demo of Inference | |
run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` to get all `.yaml`s for VideoCLIP. | |
```python | |
import torch | |
from mmpt.models import MMPTModel | |
model, tokenizer, aligner = MMPTModel.from_pretrained( | |
"projects/retri/videoclip/how2.yaml") | |
model.eval() | |
# B, T, FPS, H, W, C (VideoCLIP is trained on 30 fps of s3d) | |
video_frames = torch.randn(1, 2, 30, 224, 224, 3) | |
caps, cmasks = aligner._build_text_seq( | |
tokenizer("some text", add_special_tokens=False)["input_ids"] | |
) | |
caps, cmasks = caps[None, :], cmasks[None, :] # bsz=1 | |
with torch.no_grad(): | |
output = model(video_frames, caps, cmasks, return_score=True) | |
print(output["score"]) # dot-product | |
``` | |
#### Data Preparation | |
See [dataset](DATASET.md) for each dataset. | |
#### Global Config for Training Pipeline | |
We organize a global config file for a training/testing pipeline under projects (see a detailed [explanation](CONFIG.md)). For example, VideoCLIP in `projects/retri/videoclip.yaml` and VLM is in `projects/mtm/vlm.yaml`. | |
We wrap all cmds into `locallaunch.py` and `mmpt_cli/localjob.py`. You can check concrete cmds by `--dryrun` and then drop it for actual run. | |
First, run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` will generate configs for all configs of pre-training, zero-shot evaluation, fine-tuning and testing, for VideoCLIP under `projects/retri/videoclip`. | |
Then each (either training or evaluation) process will be configed by a concrete config file (we save all complex arguments into the concrete config file for reproducibility, including fairseq args). For example, run zero-shot evaluation on youcook, | |
``` | |
python locallaunch.py projects/retri/videoclip/test_youcook_zs.yaml --jobtype local_predict # zero-shot evaluation. | |
python locallaunch.py projects/retri/videoclip/youcook_videoclip.yaml --jobtype local_single --dryrun # fine-tuning: use --dryrun to check cmds and drop it to make an actual run; local_small will run on two gpus (as in paper). | |
python locallaunch.py projects/retri/videoclip/test_youcook_videoclip.yaml --jobtype local_predict # testing on fine-tuned model. | |
``` | |
Pretraining can be run as: | |
``` | |
python locallaunch.py projects/retri/videoclip/how2.yaml --jobtype local_single --dryrun # check then drop dryrun; paper is ran on local_big as 8 gpus. | |
``` | |
You may need to change `--jobtype`, check/extend `LocalJob` in `mmpt_cli/localjob.py` for multi-gpu/multi-node pre-training. | |
The detailed instructions of pretraining and fine-tuning can be found at [pretraining instruction](pretraining.md) and [finetuning instruction](endtask.md). | |
### Development | |
Several components of this toolkit can be re-used for future research (and also our ongoing research). | |
#### Framework Wrapper | |
We currently only support fairseq, but most components can be easily fit into other frameworks like huggingface. This repo is a `--user-dir` of fairseq with fairseq wrapper. For example, `mmpt/tasks` includes a `FairseqMMTTask`, which manages `mmpt/datasets` with `FairseqDataset`, `mmpt/models` with `FairseqModel`, `mmpt/losses` with `FairseqCriterion`. | |
#### Processors | |
**Multi**modal research introduces the complexity on modality alignment from different input sources to losses. Inspired by [MMF](https://github.com/facebookresearch/mmf), this toolkit leverages `mmpt/processors` to handle various needs of data preprocessing and loading, **alleviating** the needs of multiple `torch.data.utils.Dataset` (that can be tricky for ablation study). | |
Processors can also be decoupled from `torch.data.utils.Dataset` for offline preprocessing instead of on-the-fly data preprocessing. | |
We decouple a `mmpt.MMDataset` as 3 types of processors: `MetaProcessor`, `VideoProcessor`, `TextProcessor` and `Aligner`. They can be configed in `dataset` field of a config file (e.g., see `projects/task/how2.yaml`). | |
`MetaProcessor` is used to load the meta data about a dataset, aka, all video_ids of how2 dataset. | |
`VideoProcessor` is used to load the video features about a dataset. For example, S3D features for each second of a video. | |
`TextProcessor` is used to load the text (feature). For example, BERT pre-tokenized text clips for how2 dataset (with `start`s, `end`s of timestamps and `cap` for `token_ids`). | |
`Aligner` is the core class for different baselines that prepares the training data. For example, sampling a clip, masking tokens for MLM, etc. | |
#### Performance-tuned Components | |
To speed up pre-training, this toolkit uses sharded features stored in mmaped numpy, backed by `ShardedTensor` in `mmpt/utils/shardedtensor.py` (adopted from MARGE paper). This reduces the loads of IO for multi-GPU training without loading all features for a video into the memory each time and `ShardedTensor` ensure features are stored in continuous disk space for near random access. This is used for both How2 video features and texts in `mmpt/processors/how2processor.py`. | |
### Citation | |
If this codebase is useful for your work, please cite the following papers: | |
```BibTeX | |
@inproceedings{xu-etal-2021-videoclip, | |
title = "{VideoCLIP}: Contrastive Pre-training for\\Zero-shot Video-Text Understanding", | |
author = "Xu, Hu and | |
Ghosh, Gargi and | |
Huang, Po-Yao and | |
Okhonko, Dmytro and | |
Aghajanyan, Armen and | |
Metze, Florian and | |
Zettlemoyer, Luke and | |
Feichtenhofer, Christoph", | |
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)", | |
month = nov, | |
year = "2021", | |
address = "Online", | |
publisher = "Association for Computational Linguistics", | |
} | |
@inproceedings{xu-etal-2021-vlm, | |
title = "{VLM}: Task-agnostic Video-Language Model Pre-training for Video Understanding", | |
author = "Xu, Hu and | |
Ghosh, Gargi and | |
Huang, Po-Yao and | |
Arora, Prahal and | |
Aminzadeh, Masoumeh and | |
Feichtenhofer, Christoph and | |
Metze, Florian and | |
Zettlemoyer, Luke", | |
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", | |
month = aug, | |
year = "2021", | |
address = "Online", | |
publisher = "Association for Computational Linguistics", | |
url = "https://aclanthology.org/2021.findings-acl.370", | |
doi = "10.18653/v1/2021.findings-acl.370", | |
pages = "4227--4239", | |
} | |
``` | |
### Bug Reports | |
This repo is in its initial stage, welcome bug reports to [email protected] | |
### Copyright | |
The majority of Multimodal Pre-training (MMPT) is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Evaluation Codes/Models: Howto100M and HuggingFace Transformers are licensed under the Apache2.0 license; COIN and NLG-eval are licensed under the MIT license; CrossTask is licensed under the BSD-3; DiDeMo is licensed under the BSD-2 license. | |