aplux/LeViT · Hugging Face

LeViT: Image Classification

LeViT is a hybrid vision model merging CNN and Transformer, designed by Meta in 2021 to optimize computational efficiency and accuracy. It employs a staged architecture: shallow layers use CNN for local feature extraction, while deeper layers integrate lightweight Transformers with attention mechanisms for global context. The model introduces "distillation tokens" to transfer knowledge from larger teacher models, enhancing compact model performance. With low computational costs (e.g., LeViT-384 at 0.4G FLOPs), it achieves competitive accuracy on ImageNet, suitable for mobile/edge-device deployment in tasks like real-time image classification and object detection, balancing speed and resource constraints.

Source model

Input shape: 1x3x224x224
Number of parameters: 7.82M
Model size: 30.17M
Output shape: 1x1000

The source model can be found here

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Source Model: APACHE-2.0
Deployable Model: APLUX-MODEL-FARM-LICENSE