Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# Hybrid Vision Transformer (ViT Hybrid) | |
## Overview | |
The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition | |
at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk | |
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob | |
Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining | |
very good results compared to familiar convolutional architectures. ViT hybrid is a slight variant of the [plain Vision Transformer](vit), | |
by leveraging a convolutional backbone (specifically, [BiT](bit)) whose features are used as initial "tokens" for the Transformer. | |
The abstract from the paper is the following: | |
*While the Transformer architecture has become the de-facto standard for natural language processing tasks, its | |
applications to computer vision remain limited. In vision, attention is either applied in conjunction with | |
convolutional networks, or used to replace certain components of convolutional networks while keeping their overall | |
structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to | |
sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of | |
data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), | |
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring | |
substantially fewer computational resources to train.* | |
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code (written in JAX) can be | |
found [here](https://github.com/google-research/vision_transformer). | |
## Resources | |
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT Hybrid. | |
<PipelineTag pipeline="image-classification"/> | |
- [`ViTHybridForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). | |
- See also: [Image classification task guide](../tasks/image_classification) | |
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. | |
## ViTHybridConfig | |
[[autodoc]] ViTHybridConfig | |
## ViTHybridImageProcessor | |
[[autodoc]] ViTHybridImageProcessor | |
- preprocess | |
## ViTHybridModel | |
[[autodoc]] ViTHybridModel | |
- forward | |
## ViTHybridForImageClassification | |
[[autodoc]] ViTHybridForImageClassification | |
- forward | |