--- license: agpl-3.0 sdk: gradio --- # Vector Search Demo App This is a Gradio web application that demonstrates vector search capabilities using MongoDB Atlas and OpenAI embeddings. ## Prerequisites 1. MongoDB Atlas account with vector search enabled 2. OpenAI API key 3. Python 3.8+ 4. Sample movie data loaded in MongoDB Atlas (sample_mflix database) ## Setup 1. Clone this repository 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Set up environment variables: ```bash export OPENAI_API_KEY="your-openai-api-key" export ATLAS_URI="your-mongodb-atlas-connection-string" ``` 4. Ensure your MongoDB Atlas setup: - Database name: sample_mflix - Collection: embedded_movies - Vector search index: idx_plot_embedding - Index configuration: ```json { "fields": [ { "type": "vector", "path": "plot_embedding", "numDimensions": 1536, "similarity": "dotProduct" } ] } ``` ## Running the App Start the application: ```bash python app.py ``` The app will be available at http://localhost:7860 ## Usage ### Generating Embeddings 1. Select your database and collection from the dropdowns 2. Choose the field to generate embeddings for 3. Specify the embedding field name (defaults to "embedding") 4. Set a document limit (0 for all documents) 5. Click "Generate Embeddings" to start processing The app uses memory-efficient cursor-based batch processing that can handle large collections: - Documents are processed in batches (default 20 documents per batch) - Memory usage is optimized through cursor-based iteration - Real-time progress tracking shows completed/total documents - Supports processing of large collections (100,000+ documents) - Automatically resumes from where it left off if embeddings already exist ### Searching 1. Enter a natural language query in the text box (e.g., "humans fighting aliens") 2. Click "Submit" to search 3. View the results showing matching documents with their similarity scores ## Example Queries - "humans fighting aliens" - "relationship drama between two good friends" - "comedy about family vacation" - "detective solving mysterious murder" ## Performance Notes The application is optimized for handling large datasets: - Uses cursor-based batch processing to avoid memory issues - Processes documents in configurable batch sizes (default: 20) - Implements parallel processing with ThreadPoolExecutor - Provides real-time progress tracking - Automatically handles memory cleanup during processing - Supports resuming interrupted operations ## Notes - The search uses OpenAI's text-embedding-ada-002 model to create embeddings - Results are limited to top 5 matches - Similarity scores range from 0 to 1, with higher scores indicating better matches