Spaces:

mazpie
/

genrl

Sleeping

App Files Files Community

genrl / third_party /InternVideo /InternVideo2 /README.md

mazpie

Initial commit

2d9a728 10 months ago

preview code

raw

history blame contribute delete

9.96 kB

	# InternVideo2 \[[Paper\]](https://arxiv.org/abs/2403.15377)

	<!-- [中文 README](README_cn.md) -->

	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-activitynet)](https://paperswithcode.com/sota/action-recognition-in-videos-on-activitynet?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-on-hacs)](https://paperswithcode.com/sota/action-recognition-on-hacs?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-vatex)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-audiocaps)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-grounding-on-qvhighlights)](https://paperswithcode.com/sota/video-grounding-on-qvhighlights?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-hacs)](https://paperswithcode.com/sota/temporal-action-localization-on-hacs?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=internvideo2-scaling-video-foundation-models)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=internvideo2-scaling-video-foundation-models)

	This repo will give the code and models of '[InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding](https://arxiv.org/abs/2403.15377)' soon.

	- Achieved `92.1%` Top1 accuracy in Kinetics 400.
	- Achieved `SOTA` performance on over `60` video/audio-related tasks (including action recognition, temporal localization, retrieval, etc) when released.

	## Updates
	- `2024/04/15`: Update the code and scripts for InternVideo2 CLIP.
	- `2024/04/13`: Update the code and scripts for InternVideo2 Stage1 & 2.
	- `2024/03/22`: The technical report of InternVideo2 is released.

	## Citation

	If this work is helpful for your research, please consider citing InternVideo.

	```
	@article{wang2024internvideo2,
	title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
	author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and Shi, Yansong and Jiang, Tianxiang and Li, Songze and Zhang, Hongjie and Huang, Yifei and Qiao, Yu and Wang, Yali and Wang, Limin},
	journal={arXiv preprint arXiv:2403.15377},
	year={2024}
	}
	```