aplux/SigLIP-base · Hugging Face

SigLIP-base: lmage Captioning

SigLIP-base is a medium-sized multimodal model developed by Google, built on the SoViT (Shape-optimized Vision Transformer) architecture and trained using Sigmoid Loss instead of the contrastive loss used in CLIP. This training approach improves performance in small-batch settings and enhances robustness to negative samples. SigLIP-base achieves strong results in tasks such as image-text retrieval and zero-shot image classification. With solid inference efficiency and scalability, it is well-suited for multilingual and multitask vision-language applications.

Source model

Input shape: [1x3x384x384], [1x64]
Number of parameters: 88.86M, 105.16M
Model size: 359.10M, 424.01M
Output shape: [1x768], [1x768]

The source model can be found here

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Source Model: APACHE-2.0
Deployable Model: APLUX-MODEL-FARM-LICENSE