File size: 12,301 Bytes
c254ac1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---

comments: true
description: Explore and analyze CV datasets with Ultralytics Explorer API, offering SQL, vector similarity, and semantic searches for efficient dataset insights.
keywords: Ultralytics Explorer API, Dataset Exploration, SQL Queries, Vector Similarity Search, Semantic Search, Embeddings Table, Image Similarity, Python API for Datasets, CV Dataset Analysis, LanceDB Integration
---


# Ultralytics Explorer API

## Introduction

<a href="https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/docs/en/datasets/explorer/explorer.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
The Explorer API is a Python API for exploring your datasets. It supports filtering and searching your dataset using SQL queries, vector similarity search and semantic search.

<p align="center">
  <br>
  <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/3VryynorQeo?start=279"

    title="YouTube video player" frameborder="0"

    allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"

    allowfullscreen>
  </iframe>
  <br>
  <strong>Watch:</strong> Ultralytics Explorer API Overview
</p>

## Installation

Explorer depends on external libraries for some of its functionality. These are automatically installed on usage. To manually install these dependencies, use the following command:

```bash

pip install ultralytics[explorer]

```

## Usage

```python

from ultralytics import Explorer



# Create an Explorer object

explorer = Explorer(data='coco128.yaml', model='yolov8n.pt')



# Create embeddings for your dataset

explorer.create_embeddings_table()



# Search for similar images to a given image/images

dataframe = explorer.get_similar(img='path/to/image.jpg')



# Or search for similar images to a given index/indices

dataframe = explorer.get_similar(idx=0)

```

!!! Tip "Note"

    Embeddings table for a given dataset and model pair is only created once and reused. These use [LanceDB](https://lancedb.github.io/lancedb/) under the hood, which scales on-disk, so you can create and reuse embeddings for large datasets like COCO without running out of memory.


In case you want to force update the embeddings table, you can pass `force=True` to `create_embeddings_table` method.
You can directly access the LanceDB table object to perform advanced analysis. Learn more about it in [Working with table section](#4-advanced---working-with-embeddings-table)

## 1. Similarity Search

Similarity search is a technique for finding similar images to a given image. It is based on the idea that similar images will have similar embeddings. Once the embeddings table is built, you can get run semantic search in any of the following ways:

- On a given index or list of indices in the dataset: `exp.get_similar(idx=[1,10], limit=10)`
- On any image or list of images not in the dataset: `exp.get_similar(img=["path/to/img1", "path/to/img2"], limit=10)`

In case of multiple inputs, the aggregate of their embeddings is used.

You get a pandas dataframe with the `limit` number of most similar data points to the input, along with their distance in the embedding space. You can use this dataset to perform further filtering

!!! Example "Semantic Search"

    === "Using Images"


        ```python

        from ultralytics import Explorer


        # create an Explorer object

        exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

        exp.create_embeddings_table()


        similar = exp.get_similar(img='https://ultralytics.com/images/bus.jpg', limit=10)

        print(similar.head())


        # Search using multiple indices

        similar = exp.get_similar(

                                img=['https://ultralytics.com/images/bus.jpg',

                                     'https://ultralytics.com/images/bus.jpg'],

                                limit=10

                                )

        print(similar.head())

        ```


    === "Using Dataset Indices"


        ```python

        from ultralytics import Explorer


        # create an Explorer object

        exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

        exp.create_embeddings_table()


        similar = exp.get_similar(idx=1, limit=10)

        print(similar.head())


        # Search using multiple indices

        similar = exp.get_similar(idx=[1,10], limit=10)

        print(similar.head())

        ```


### Plotting Similar Images

You can also plot the similar images using the `plot_similar` method. This method takes the same arguments as `get_similar` and plots the similar images in a grid.

!!! Example "Plotting Similar Images"

    === "Using Images"


        ```python

        from ultralytics import Explorer


        # create an Explorer object

        exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

        exp.create_embeddings_table()


        plt = exp.plot_similar(img='https://ultralytics.com/images/bus.jpg', limit=10)

        plt.show()

        ```


    === "Using Dataset Indices"


        ```python

        from ultralytics import Explorer


        # create an Explorer object

        exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

        exp.create_embeddings_table()


        plt = exp.plot_similar(idx=1, limit=10)

        plt.show()

        ```


## 2. Ask AI (Natural Language Querying)

This allows you to write how you want to filter your dataset using natural language. You don't have to be proficient in writing SQL queries. Our AI powered query generator will automatically do that under the hood. For example - you can say - "show me 100 images with exactly one person and 2 dogs. There can be other objects too" and it'll internally generate the query and show you those results.
Note: This works using LLMs under the hood so the results are probabilistic and might get things wrong sometimes

!!! Example "Ask AI"

    ```python

    from ultralytics import Explorer

    from ultralytics.data.explorer import plot_query_result



    # create an Explorer object

    exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

    exp.create_embeddings_table()


    df = exp.ask_ai("show me 100 images with exactly one person and 2 dogs. There can be other objects too")

    print(df.head())


    # plot the results

    plt = plot_query_result(df)

    plt.show()

    ```


## 3. SQL Querying

You can run SQL queries on your dataset using the `sql_query` method. This method takes a SQL query as input and returns a pandas dataframe with the results.

!!! Example "SQL Query"

    ```python

    from ultralytics import Explorer


    # create an Explorer object

    exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

    exp.create_embeddings_table()


    df = exp.sql_query("WHERE labels LIKE '%person%' AND labels LIKE '%dog%'")

    print(df.head())

    ```


### Plotting SQL Query Results

You can also plot the results of a SQL query using the `plot_sql_query` method. This method takes the same arguments as `sql_query` and plots the results in a grid.

!!! Example "Plotting SQL Query Results"

    ```python

    from ultralytics import Explorer


    # create an Explorer object

    exp = Explorer(data='coco128.yaml', model='yolov8n.pt')

    exp.create_embeddings_table()


    # plot the SQL Query

    exp.plot_sql_query("WHERE labels LIKE '%person%' AND labels LIKE '%dog%' LIMIT 10")

    ```


## 4. Advanced - Working with Embeddings Table

You can also work with the embeddings table directly. Once the embeddings table is created, you can access it using the `Explorer.table`

!!! Tip "Explorer works on [LanceDB](https://lancedb.github.io/lancedb/) tables internally. You can access this table directly, using `Explorer.table` object and run raw queries, push down pre- and post-filters, etc."

    ```python

    from ultralytics import Explorer


    exp = Explorer()

    exp.create_embeddings_table()

    table = exp.table

    ```


Here are some examples of what you can do with the table:

### Get raw Embeddings

!!! Example

    ```python

    from ultralytics import Explorer


    exp = Explorer()

    exp.create_embeddings_table()

    table = exp.table


    embeddings = table.to_pandas()["vector"]

    print(embeddings)

    ```


### Advanced Querying with pre- and post-filters

!!! Example

    ```python

    from ultralytics import Explorer


    exp = Explorer(model="yolov8n.pt")

    exp.create_embeddings_table()

    table = exp.table


    # Dummy embedding

    embedding = [i for i in range(256)]

    rs = table.search(embedding).metric("cosine").where("").limit(10)

    ```


### Create Vector Index

When using large datasets, you can also create a dedicated vector index for faster querying. This is done using the `create_index` method on LanceDB table.

```python

table.create_index(num_partitions=..., num_sub_vectors=...)

```

Find more details on the type vector indices available and parameters [here](https://lancedb.github.io/lancedb/ann_indexes/#types-of-index) In the future, we will add support for creating vector indices directly from Explorer API.

## 5. Embeddings Applications

You can use the embeddings table to perform a variety of exploratory analysis. Here are some examples:

### Similarity Index

Explorer comes with a `similarity_index` operation:

- It tries to estimate how similar each data point is with the rest of the dataset.
- It does that by counting how many image embeddings lie closer than `max_dist` to the current image in the generated embedding space, considering `top_k` similar images at a time.

It returns a pandas dataframe with the following columns:

- `idx`: Index of the image in the dataset
- `im_file`: Path to the image file
- `count`: Number of images in the dataset that are closer than `max_dist` to the current image
- `sim_im_files`: List of paths to the `count` similar images

!!! Tip

    For a given dataset, model, `max_dist` & `top_k` the similarity index once generated will be reused. In case, your dataset has changed, or you simply need to regenerate the similarity index, you can pass `force=True`.


!!! Example "Similarity Index"

    ```python

    from ultralytics import Explorer


    exp = Explorer()

    exp.create_embeddings_table()


    sim_idx = exp.similarity_index()

    ```


You can use similarity index to build custom conditions to filter out the dataset. For example, you can filter out images that are not similar to any other image in the dataset using the following code:

```python

import numpy as np



sim_count = np.array(sim_idx["count"])

sim_idx['im_file'][sim_count > 30]

```

### Visualize Embedding Space

You can also visualize the embedding space using the plotting tool of your choice. For example here is a simple example using matplotlib:

```python

import numpy as np

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D



# Reduce dimensions using PCA to 3 components for visualization in 3D

pca = PCA(n_components=3)

reduced_data = pca.fit_transform(embeddings)



# Create a 3D scatter plot using Matplotlib Axes3D

fig = plt.figure(figsize=(8, 6))

ax = fig.add_subplot(111, projection='3d')



# Scatter plot

ax.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], alpha=0.5)

ax.set_title('3D Scatter Plot of Reduced 256-Dimensional Data (PCA)')

ax.set_xlabel('Component 1')

ax.set_ylabel('Component 2')

ax.set_zlabel('Component 3')



plt.show()

```

Start creating your own CV dataset exploration reports using the Explorer API. For inspiration, check out the

## Apps Built Using Ultralytics Explorer

Try our GUI Demo based on Explorer API

## Coming Soon

- [ ] Merge specific labels from datasets. Example - Import all `person` labels from COCO and `car` labels from Cityscapes
- [ ] Remove images that have a higher similarity index than the given threshold
- [ ] Automatically persist new datasets after merging/removing entries
- [ ] Advanced Dataset Visualizations