SigLIP-base: lmage Captioning
SigLIP-base is a medium-sized multimodal model developed by Google, built on the SoViT (Shape-optimized Vision Transformer) architecture and trained using Sigmoid Loss instead of the contrastive loss used in CLIP. This training approach improves performance in small-batch settings and enhances robustness to negative samples. SigLIP-base achieves strong results in tasks such as image-text retrieval and zero-shot image classification. With solid inference efficiency and scalability, it is well-suited for multilingual and multitask vision-language applications.
Source model
- Input shape: [1x3x384x384], [1x64]
- Number of parameters: 88.86M, 105.16M
- Model size: 359.10M, 424.01M
- Output shape: [1x768], [1x768]
The source model can be found here
Performance Reference
Please search model by model name in Model Farm
Inference & Model Conversion
Please search model by model name in Model Farm
License
Source Model: APACHE-2.0
Deployable Model: APLUX-MODEL-FARM-LICENSE