huggingface/HuggingDiscussions · [ICML 2025] Exploring CLIP Geometry: Likelihood & Conformity (Demo Inside)

Hi everyone 👋
We’re excited to share two ICML 2025 papers exploring the geometry of CLIP’s latent space. To make the ideas tangible, we built a Hugging Face demo that lets you explore them interactively 🚀.

📄 The Double Ellipsoid Geometry of CLIP (https://huggingface.co/papers/2411.14517)
Shows that CLIP’s image and text embeddings do not live on a single hypersphere, but rather in a distinct double ellipsoid structure.
We introduce conformity measure, reveals how common the image or caption is (mathematically proven!).

📄 Whitened CLIP as a Likelihood Surrogate of Images and Captions (https://huggingface.co/papers/2505.06934)
By a simple linear transformation (Whitening) on CLIP latent space, one can measure the likelihood of samples simply by their norms (mathematically driven).

🙌 Try it out
https://huggingface.co/spaces/Yossilevii100/CLIPLatent
Upload an image or text → see its conformity and likelihood scores.
You can try to compare the likelihood of your image, and your friend. See who CLIP consider as more likely.

🌟 Why it matters
Provides a geometric perspective on CLIP embeddings.
Bridges theory (double ellipsoids) with practice - reveal what samples are likely or not, which may be the source for image synthesis fail cases.
Opens the door to using CLIP embeddings as likelihood estimators and for robustness analysis.

We’d love your feedback, ideas, and extensions!