File size: 2,815 Bytes
787933d
 
 
 
8fb6e2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
787933d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: agpl-3.0
sdk: gradio
---
# Vector Search Demo App

This is a Gradio web application that demonstrates vector search capabilities using MongoDB Atlas and OpenAI embeddings.

## Prerequisites

1. MongoDB Atlas account with vector search enabled
2. OpenAI API key
3. Python 3.8+
4. Sample movie data loaded in MongoDB Atlas (sample_mflix database)

## Setup

1. Clone this repository

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Set up environment variables:
```bash
export OPENAI_API_KEY="your-openai-api-key"
export ATLAS_URI="your-mongodb-atlas-connection-string"
```

4. Ensure your MongoDB Atlas setup:
   - Database name: sample_mflix
   - Collection: embedded_movies
   - Vector search index: idx_plot_embedding
   - Index configuration:
   ```json
   {
     "fields": [
       {
         "type": "vector",
         "path": "plot_embedding",
         "numDimensions": 1536,
         "similarity": "dotProduct"
       }
     ]
   }
   ```

## Running the App

Start the application:
```bash
python app.py
```

The app will be available at http://localhost:7860

## Usage

### Generating Embeddings
1. Select your database and collection from the dropdowns
2. Choose the field to generate embeddings for
3. Specify the embedding field name (defaults to "embedding")
4. Set a document limit (0 for all documents)
5. Click "Generate Embeddings" to start processing

The app uses memory-efficient cursor-based batch processing that can handle large collections:
- Documents are processed in batches (default 20 documents per batch)
- Memory usage is optimized through cursor-based iteration
- Real-time progress tracking shows completed/total documents
- Supports processing of large collections (100,000+ documents)
- Automatically resumes from where it left off if embeddings already exist

### Searching
1. Enter a natural language query in the text box (e.g., "humans fighting aliens")
2. Click "Submit" to search
3. View the results showing matching documents with their similarity scores

## Example Queries

- "humans fighting aliens"
- "relationship drama between two good friends"
- "comedy about family vacation"
- "detective solving mysterious murder"

## Performance Notes

The application is optimized for handling large datasets:
- Uses cursor-based batch processing to avoid memory issues
- Processes documents in configurable batch sizes (default: 20)
- Implements parallel processing with ThreadPoolExecutor
- Provides real-time progress tracking
- Automatically handles memory cleanup during processing
- Supports resuming interrupted operations

## Notes

- The search uses OpenAI's text-embedding-ada-002 model to create embeddings
- Results are limited to top 5 matches
- Similarity scores range from 0 to 1, with higher scores indicating better matches