Image Feature Extraction
Transformers
Safetensors
feature-extraction
custom_code
gheinrich commited on
Commit
1c5979a
·
verified ·
1 Parent(s): 9853756

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -2
README.md CHANGED
@@ -4,7 +4,6 @@ license_name: nvidia-open-model-license
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  ---
6
 
7
-
8
  # Model Overview
9
 
10
  ## Description
@@ -68,6 +67,41 @@ Huggingface: 03/26/2025 via [RADIO Collection of Models](https://huggingface.co/
68
  **Output Parameters:** 2D <br>
69
  **Other Properties Related to Output:** Downstream model required to leverage image features <br>
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## Software Integration
72
 
73
  **Runtime Engine(s):**
@@ -192,4 +226,3 @@ Model Application(s): | Generation of visual embe
192
  Describe the life critical impact (if present). | Not Applicable
193
  Use Case Restrictions: | Abide by NVIDIA Open Model License Agreement
194
  Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
195
-
 
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  ---
6
 
 
7
  # Model Overview
8
 
9
  ## Description
 
67
  **Output Parameters:** 2D <br>
68
  **Other Properties Related to Output:** Downstream model required to leverage image features <br>
69
 
70
+ ## Usage:
71
+
72
+ RADIO will return a tuple with two tensors.
73
+ The `summary` is similar to the `cls_token` in ViT and is meant to represent the general concept of the entire image.
74
+ It has shape `(B,C)` with `B` being the batch dimension, and `C` being some number of channels.
75
+ The `spatial_features` represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM.
76
+
77
+ ```python
78
+ import torch
79
+ from PIL import Image
80
+ from transformers import AutoModel, CLIPImageProcessor
81
+
82
+ hf_repo = "nvidia/C-RADIOv2-B"
83
+
84
+ image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
85
+ model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
86
+ model.eval().cuda()
87
+
88
+ image = Image.open('./assets/radio.png').convert('RGB')
89
+ pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
90
+ pixel_values = pixel_values.cuda()
91
+
92
+ summary, features = model(pixel_values)
93
+ ```
94
+
95
+ Spatial features have shape `(B,T,D)` with `T` being the flattened spatial tokens, and `D` being the channels for spatial features. Note that `C!=D` in general.
96
+ Converting to a spatial tensor format can be done using the downsampling size of the model, combined with the input tensor shape. For RADIO, the patch size is 16.
97
+
98
+ ```Python
99
+ from einops import rearrange
100
+ spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
101
+ ```
102
+
103
+ The resulting tensor will have shape `(B,D,H,W)`, as is typically seen with computer vision models.
104
+
105
  ## Software Integration
106
 
107
  **Runtime Engine(s):**
 
226
  Describe the life critical impact (if present). | Not Applicable
227
  Use Case Restrictions: | Abide by NVIDIA Open Model License Agreement
228
  Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.