arxiv:2209.04372

Pre-training image-language transformers for open-vocabulary tasks

Published on Sep 9, 2022

Authors:

Abstract

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 164

Browse 164 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2209.04372 in a dataset README.md to link it from this page.

Spaces citing this paper 62

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.