LeViT: Image Classification
LeViT is a hybrid vision model merging CNN and Transformer, designed by Meta in 2021 to optimize computational efficiency and accuracy. It employs a staged architecture: shallow layers use CNN for local feature extraction, while deeper layers integrate lightweight Transformers with attention mechanisms for global context. The model introduces "distillation tokens" to transfer knowledge from larger teacher models, enhancing compact model performance. With low computational costs (e.g., LeViT-384 at 0.4G FLOPs), it achieves competitive accuracy on ImageNet, suitable for mobile/edge-device deployment in tasks like real-time image classification and object detection, balancing speed and resource constraints.
Source model
- Input shape: 1x3x224x224
- Number of parameters: 7.82M
- Model size: 30.17M
- Output shape: 1x1000
The source model can be found here
Performance Reference
Please search model by model name in Model Farm
Inference & Model Conversion
Please search model by model name in Model Farm
License
Source Model: APACHE-2.0
Deployable Model: APLUX-MODEL-FARM-LICENSE