google
/

siglip-base-patch16-256

Zero-Shot Image Classification

Inference Endpoints

Model card Files Files and versions Community

nielsr HF staff commited on Jan 8, 2024

Commit

6cffdc6

·

1 Parent(s): 415aa2d

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -4,13 +4,13 @@ license: apache-2.0
 # SigLIP (base-sized model)
-SigLIP model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
-Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.
 ## Model description
-SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.
 ## Intended uses & limitations
@@ -27,8 +27,8 @@ import requests
 from transformers import AutoProcessor, AutoModel
 import torch
-model = AutoModel.from_pretrained("nielsr/siglip-base-patch16-224")
-processor = AutoProcessor.from_pretrained("nielsr/siglip-base-patch16-224")
 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 image = Image.open(requests.get(url, stream=True).raw)

 # SigLIP (base-sized model)
+SigLIP model pre-trained on WebLi at resolution 256x256. It was introduced in the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Zhai et al. and first released in [this repository](https://github.com/google-research/big_vision).
+Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
 ## Model description
+SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip), a multimodal model, with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.
 ## Intended uses & limitations
 from transformers import AutoProcessor, AutoModel
 import torch
+model = AutoModel.from_pretrained("google/siglip-base-patch16-256")
+processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256")
 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 image = Image.open(requests.get(url, stream=True).raw)