SigLIP-base: lmage Captioning

SigLIP-base is a medium-sized multimodal model developed by Google, built on the SoViT (Shape-optimized Vision Transformer) architecture and trained using Sigmoid Loss instead of the contrastive loss used in CLIP. This training approach improves performance in small-batch settings and enhances robustness to negative samples. SigLIP-base achieves strong results in tasks such as image-text retrieval and zero-shot image classification. With solid inference efficiency and scalability, it is well-suited for multilingual and multitask vision-language applications.

Source model

  • Input shape: [1x3x384x384], [1x64]
  • Number of parameters: 88.86M, 105.16M
  • Model size: 359.10M, 424.01M
  • Output shape: [1x768], [1x768]

The source model can be found here

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support