Spaces:

satishjasthij
/

PicMatch

Sleeping

App Files Files Community

satishjasthij commited on Jul 28, 2024

Commit

d1df841

1 Parent(s): 767749c

Add code

Browse files

Files changed (11) hide show

.gitignore +2 -0
README.md +156 -7
app.py +101 -0
copy_images_features.py +64 -0
engine/__init__.py +0 -0
engine/download_data.py +296 -0
engine/generate_features.py +251 -0
engine/search.py +216 -0
engine/upload_data_to_hf.py +16 -0
engine/vector_database.py +68 -0
requirements.txt +115 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ data/
2	+ .DS_Store

README.md CHANGED Viewed

@@ -1,13 +1,162 @@
 ---
-title: PicMatch
-emoji: 📉
-colorFrom: gray
-colorTo: blue
 sdk: gradio
 sdk_version: 4.39.0
 app_file: app.py
-pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: 'PicMatch: Your Visual Search Companion'
+emoji: 📷🔍
+colorFrom: blue
+colorTo: green
 sdk: gradio
+python_version: 3.9
 sdk_version: 4.39.0
+suggested_hardware: t4-small
+suggested_storage: medium
 app_file: app.py
+fullWidth: true
+header: mini
+short_description: Search images using text or other images as queries.
+models:
+- wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M
+- Salesforce/blip-image-captioning-base
+tags:
+- image search
+- visual search
+- image processing
+- CLIP
+- image captioning
+thumbnail: https://example.com/thumbnail.png
+pinned: true
+hf_oauth: false
+disable_embedding: false
+startup_duration_timeout: 30m
+custom_headers:
+  cross-origin-embedder-policy: require-corp
+  cross-origin-opener-policy: same-origin
+  cross-origin-resource-policy: cross-origin
 ---
+# 📸 PicMatch: Your Visual Search Companion 🔍
+PicMatch lets you effortlessly search through your image archive using either a text description or another image as your query.  Find those needle-in-a-haystack photos in a flash! ✨
+## 🚀 Getting Started: Let the Fun Begin!
+1. **Prerequisites:** Ensure you have Python 3.9 or higher installed on your system. 🐍
+2. **Create a Virtual Environment:**
+   ```bash
+   python -m venv env
+   ```
+3. **Activate the Environment:**
+   ```bash
+   source ./venv/bin/activate
+   ```
+4. **Install Dependencies:**
+   ```bash
+   python -m pip install -r requirements.txt
+   ```
+5. **Start the App (with Sample Data):**
+   ```bash
+   python app.py
+   ```
+6. **Open Your Browser:**  Head to `localhost:7860` to access the PicMatch interface. 🌐
+## 📂 Data: Organize Your Visual Treasures
+Make sure you have the following folders in your project's root directory:
+```
+data
+├── images
+└── features
+```
+## 🛠️ Image Pipeline: Download & Process with Speed ⚡
+The `engine/download_data.py` Python script streamlines downloading and processing images from a list of URLs. It's designed for performance and reliability:
+- **Async Operations:**  Uses `asyncio` for concurrent image downloading and processing. ⏩
+- **Rate Limiting:**  Follows API usage rules to prevent blocks with a `RateLimiter`. 🚦
+- **Parallel Resizing:**  Employs a `ProcessPoolExecutor` for fast image resizing. ⚙️
+- **State Management:**  Saves progress in a JSON file so you can resume later. 💾
+### 🗝️ Key Components:
+- **`ImagePipeline` Class:** Manages the entire pipeline, its state, and rate limiting. 🎛️
+- **Functions:** Handle URL feeding (`url_feeder`), downloading (`image_downloader`), and processing (`image_processor`). 📥
+- **`ImageSaver` Class:** Defines how images are processed and saved. 🖼️
+- **`resize_image` Function:**  Ensures image resizing maintains the correct aspect ratio. 📏
+### 🏃 How it Works:
+1. **Start:** Configure the pipeline with your URL list, download limits, and rate settings.
+2. **Feed URLs:** Asynchronously read URLs from your file.
+3. **Download:** Download images concurrently while respecting rate limits.
+4. **Process:** Save the original images and resize them in parallel.
+5. **Save State:**  Regularly save progress to avoid starting over if interrupted.
+To get the sample data run the command
+```bash
+cd engine && python download_data.py
+```
+## ✨ Feature Creation: Making Your Images Searchable ✨
+This step prepares your images for searching.  We generate two types of embeddings:
+- **Visual Embeddings (CLIP):** Capture the visual content of your images. 👁️‍🗨️
+- **Textual Embeddings:** Create embeddings from image captions for text-based search. 💬
+To generate these features run the command
+```bash
+cd engine && python generate_features.py
+```
+This process uses these awesome models from Hugging Face:
+- TinyCLIP: `wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M`
+- BLIP Image Captioning: `Salesforce/blip-image-captioning-base`
+- SentenceTransformer: `all-MiniLM-L6-v2`
+## ⚡ Asynchronous Feature Extraction: Supercharge Your Process ⚡
+This script extracts image features (both visual and textual) efficiently:
+- **Asynchronous:**  Loads images, extracts features, and saves them concurrently. ⚡
+- **Dual Embeddings:** Creates both CLIP (visual) and caption (textual) embeddings. 🖼️📝
+- **Checkpoints:** Keeps track of progress and allows resuming from interruptions. 🔄
+- **Parallel:** Uses multiple CPU cores for feature extraction. ⚙️
+## 📊 Vector Database Module: Milvus for Fast Search 🚤
+This module connects to the Milvus vector database to store and search your image embeddings:
+- **Milvus:**  A high-performance database built for handling vector data. 📊
+- **Easy Interface:**  Provides a simple way to manage embeddings and perform searches. 🔍
+- **Single Server:**  Ensures only one Milvus server is running for efficiency.
+- **Indexing:** Automatically creates an index to speed up your searches. 🚀
+- **Similarity Search:** Find the most similar images using cosine similarity. 💯
+## 📚 References: The Brains Behind PicMatch 🧠
+PicMatch leverages these incredible open-source projects:
+- **TinyCLIP:**  The visual powerhouse for understanding your images.
+  - 👉 [https://huggingface.co/wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M](https://huggingface.co/wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M)
+- **Image Captioning:** The wordsmith that describes your photos in detail.
+  - 👉 [https://huggingface.co/Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
+- **Sentence Transformers:** Turns captions into embeddings for text-based search.
+  - 👉 [https://sbert.net](https://sbert.net)
+- **Unsplash:** Images used were taken from Unsplash's open source data
+   - 👉 [https://github.com/unsplash/datasets](https://github.com/unsplash/datasets)
+Let's give credit where credit is due! 🙌 These projects make PicMatch smarter and more capable.

app.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import gradio as gr
+import numpy as np
+from PIL import Image
+from engine.search import ImageSearchModule
+import os
+from pathlib import Path
+PROJECT_ROOT = Path(__file__).resolve().parent
+def check_dirs():
+    dirs = {
+        "Data": (PROJECT_ROOT / "data"),
+        "Images": (PROJECT_ROOT / "data" / "images"),
+        "Features": (PROJECT_ROOT / "data" / "features")
+    }
+    for dir_name, dir_path in dirs.items():
+        if not dir_path.exists():
+            raise FileNotFoundError(f"{dir_name} directory not found: {dir_path}")
+    print("All data directories exist ✅")
+check_dirs()
+# Initialize the ImageSearchModule
+search = ImageSearchModule(
+    image_embeddings_dir=str(PROJECT_ROOT / "data/features"),
+    original_images_dir=str(PROJECT_ROOT / "data/images"),
+)
+print("Add image embeddings and caption embeddings to vector database")
+search.add_images()
+def search_images(input_data, search_type):
+    if search_type == "image" and input_data is not None:
+        # Fix: Get the file path directly from the input data
+        results = search.search_by_image(input_data, top_k=10, similarity_threshold=0)
+    elif search_type == "text" and input_data.strip():
+        results = search.search_by_text(input_data, top_k=10, similarity_threshold=0)
+    else:
+        return [(Image.new("RGB", (100, 100), color="gray"), "No results")] * 10
+    images_with_captions = []
+    for image_name, similarity in results:
+        image_path = os.path.join(search.original_images_dir, f"resized_{image_name}")
+        matching_files = [
+            f
+            for f in os.listdir(search.original_images_dir)
+            if f.startswith(f"resized_{image_name}")
+        ]
+        if matching_files:
+            img = Image.open(
+                os.path.join(search.original_images_dir, matching_files[0])
+            )
+            images_with_captions.append((img, f"Similarity: {similarity:.2f}"))
+        else:
+            images_with_captions.append(
+                (Image.new("RGB", (100, 100), color="gray"), "Image not found")
+            )
+    # Pad the results if less than 10 images are found
+    while len(images_with_captions) < 10:
+        images_with_captions.append(
+            (Image.new("RGB", (100, 100), color="gray"), "No result")
+        )
+    return images_with_captions
+with gr.Blocks() as demo:
+    gr.Markdown("# Image Search App")
+    with gr.Tab("Image Search"):
+        # Fix: Change input type to 'filepath'
+        image_input = gr.Image(type="filepath", label="Upload an image")
+        image_button = gr.Button("Search by Image")
+    with gr.Tab("Text Search"):
+        text_input = gr.Textbox(label="Enter text query")
+        text_button = gr.Button("Search by Text")
+    gallery = gr.Gallery(
+        label="Search Results",
+        show_label=False,
+        elem_id="gallery",
+        columns=2,
+        height="auto",
+    )
+    image_button.click(
+        fn=search_images,
+        inputs=[image_input, gr.Textbox(value="image", visible=False)],
+        outputs=[gallery],
+    )
+    text_button.click(
+        fn=search_images,
+        inputs=[text_input, gr.Textbox(value="text", visible=False)],
+        outputs=[gallery],
+    )
+demo.launch()

copy_images_features.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import os
+import random
+import shutil
+import glob
+from tqdm import tqdm
+def sample_images_and_features(image_folder, feature_folder, sample_size, dest_image_folder, dest_feature_folder):
+    """
+    Randomly samples a specified number of resized images along with their corresponding
+    CLIP and caption features, and copies them to new folders.
+    Args:
+        image_folder (str): Path to the folder containing resized images.
+        feature_folder (str): Path to the folder containing feature files.
+        sample_size (int): Number of images to sample.
+        dest_image_folder (str): Destination folder for sampled images.
+        dest_feature_folder (str): Destination folder for sampled feature files.
+    """
+    # Ensure destination folders exist
+    os.makedirs(dest_image_folder, exist_ok=True)
+    os.makedirs(dest_feature_folder, exist_ok=True)
+    # Get all resized image file names
+    image_files = glob.glob(os.path.join(image_folder, "resized_*.jpg"))
+    image_files.extend(glob.glob(os.path.join(image_folder, "resized_*.png")))
+    image_files.extend(glob.glob(os.path.join(image_folder, "resized_*.jpeg")))
+    # Check if there are enough images
+    if len(image_files) < sample_size:
+        raise ValueError("Not enough resized images in the source folder.")
+    # Sample a subset of image files
+    sampled_images = random.sample(image_files, sample_size)
+    # Copy images and corresponding feature files
+    for image_path in tqdm(sampled_images):
+        image_name = os.path.basename(image_path)
+        base_name, _ = os.path.splitext(image_name)
+        # Construct paths for CLIP and caption feature files
+        clip_feature_path = os.path.join(feature_folder, f"{base_name}_clip.npy")
+        caption_feature_path = os.path.join(feature_folder, f"{base_name}_caption.npy")
+        # Copy image file
+        shutil.copy2(image_path, dest_image_folder)  # copy2 preserves metadata
+        # Copy feature files (if they exist)
+        if os.path.exists(clip_feature_path):
+            shutil.copy2(clip_feature_path, dest_feature_folder)
+        if os.path.exists(caption_feature_path):
+            shutil.copy2(caption_feature_path, dest_feature_folder)
+if __name__ == "__main__":
+    from pathlib import Path
+    PROJECT_ROOT = Path(__file__).resolve().parent
+    image_folder = str(PROJECT_ROOT / "data/images")
+    feature_folder = str(PROJECT_ROOT / "data/features")
+    sample_size = 10
+    dest_image_folder = str(PROJECT_ROOT / "data_temp/images")
+    dest_feature_folder = str(PROJECT_ROOT / "data_temp/features")
+    sample_images_and_features(image_folder, feature_folder, sample_size, dest_image_folder, dest_feature_folder)

engine/__init__.py ADDED Viewed

File without changes

engine/download_data.py ADDED Viewed

	@@ -0,0 +1,296 @@

+import csv
+from pathlib import Path
+import time
+import json
+import os, io
+import aiofiles
+import aiohttp
+import asyncio
+from PIL import Image
+from abc import ABC, abstractmethod
+from concurrent.futures import ProcessPoolExecutor
+from dataclasses import asdict, dataclass
+@dataclass
+class ProcessState:
+    urls_processed: int = 0
+    images_downloaded: int = 0
+    images_saved: int = 0
+    images_resized: int = 0
+class ImageProcessor(ABC):
+    @abstractmethod
+    def process(self, image: bytes, filename: str) -> None:
+        pass
+class ImageSaver(ImageProcessor):
+    async def process(self, image: bytes, filename: str) -> None:
+        async with aiofiles.open(filename, "wb") as f:
+            await f.write(image)
+def resize_image(image: bytes, filename: str, max_size: int = 300) -> None:
+    with Image.open(io.BytesIO(image)) as img:
+        img.thumbnail((max_size, max_size))
+        img.save(filename, optimize=True, quality=85)
+class RateLimiter:
+    """
+    High-Level Concept: The Token Bucket Algorithm
+    ==============================================
+    The Rate_Limiter class implements what's known as the "Token Bucket" algorithm. Imagine you have a bucket that can hold a certain number of tokens. Here's how it works:
+    The bucket is filled with tokens at a constant rate.
+    When you want to perform an action (in our case, make an API request), you need to take a token from the bucket.
+    If there's a token available, you can perform the action immediately.
+    If there are no tokens, you have to wait until a new token is added to the bucket.
+    The bucket has a maximum capacity, so tokens don't accumulate indefinitely when not used.
+    This mechanism allows for both steady-state rate limiting and handling short bursts of activity.
+    In the constructor:
+    ===================
+    rate: is how many tokens we add per time period (e.g., 10 tokens per second)
+    per: is the time period (usually 1 second)
+    burst: is the bucket size (maximum number of tokens)
+    We start with a full bucket (self.tokens = burst)
+    We note the current time (self.updated_at)
+    Logic:
+    ======
+    1. Calculate how much time has passed since we last updated the token count.
+    2. Add tokens based on the time passed and our rate:
+        self.tokens += time_passed * (self.rate / self.per)
+    3. If we've added too many tokens, cap it at our maximum (burst size).
+    4. Update our "last updated" time.
+    5. If we have at least one token:
+        Remove a token (self.tokens -= 1)
+        Return immediately, allowing the API call to proceed
+    6. If we don't have a token:
+        Calculate how long we need to wait for the next token
+        Sleep for that duration
+    Let's walk through an example:
+    ==============================
+    Suppose we set up our RateLimiter like this:
+    Copylimiter = RateLimiter(rate=10, per=1, burst=10)
+    This means:
+    - We allow 10 requests per second on average
+    - We can burst up to 10 requests at once
+    - After the burst, we'll be limited to 1 request every 0.1 seconds
+    Now, imagine a sequence of API calls:
+    1. The first 10 calls will happen immediately (burst capacity)
+    2. The 11th call will wait for 0.1 seconds (time to generate 1 token)
+    3. Subsequent calls will each wait about 0.1 seconds
+    If there's a pause in API calls, tokens will accumulate (up to the burst limit), allowing for another burst of activity.
+    This mechanism ensures that:
+    1. We respect the average rate limit (10 per second in this example)
+    2. We can handle short bursts of activity (up to 10 at once)
+    3. We smoothly regulate requests when operating at capacity
+    """
+    def __init__(self, rate: float, per: float = 1.0, burst: int = 1):
+        self.rate = rate
+        self.per = per
+        self.burst = burst
+        self.tokens = burst
+        self.updated_at = time.monotonic()
+    async def wait(self):
+        while True:
+            now = time.monotonic()
+            time_passed = now - self.updated_at
+            self.tokens += time_passed * (self.rate / self.per)
+            if self.tokens > self.burst:
+                self.tokens = self.burst
+            self.updated_at = now
+            if self.tokens >= 1:
+                self.tokens -= 1
+                return
+            else:
+                await asyncio.sleep((1 - self.tokens) / (self.rate / self.per))
+class ImagePipeline:
+    def __init__(
+        self,
+        txt_file: str,
+        loop: asyncio.AbstractEventLoop,
+        max_concurrent_downloads: int = 10,
+        max_workers: int = max(os.cpu_count() - 4, 4),
+        rate_limit: float = 10,
+        rate_limit_period: float = 1,
+        downloaded_images_dir: str = "",
+    ):
+        self.txt_file = txt_file
+        self.loop = loop
+        self.url_queue = asyncio.Queue(maxsize=1000)
+        self.image_queue = asyncio.Queue(maxsize=100)
+        self.semaphore = asyncio.Semaphore(max_concurrent_downloads)
+        self.state = ProcessState()
+        self.state_file = "pipeline_state.json"
+        self.saver = ImageSaver()
+        self.process_pool = ProcessPoolExecutor(max_workers=max_workers)
+        self.rate_limiter = RateLimiter(
+            rate=rate_limit, per=rate_limit_period, burst=max_concurrent_downloads
+        )
+        self.downloaded_images_dir = Path(downloaded_images_dir)
+    async def url_feeder(self):
+        try:
+            print(f"Starting to read URLs from {self.txt_file}")
+            async with aiofiles.open(self.txt_file, mode="r") as f:
+                line_number = 0
+                async for line in f:
+                    line_number += 1
+                    if line_number <= self.state.urls_processed:
+                        continue
+                    url = line.strip()
+                    if url:  # Skip empty lines
+                        await self.url_queue.put(url)
+                        self.state.urls_processed += 1
+                        # Check if we need to wait for the queue to have space
+                        if self.url_queue.qsize() >= self.url_queue.maxsize - 1:
+                            await asyncio.sleep(0.1)
+        except Exception as e:
+            print(f"Error in url_feeder: {e}")
+        finally:
+            await self.url_queue.put(None)
+    async def image_downloader(self):
+        print("Starting image downloader")
+        async with aiohttp.ClientSession() as session:
+            while True:
+                url = await self.url_queue.get()
+                if url is None:
+                    print("Finished downloading images")
+                    await self.image_queue.put(None)
+                    break
+                try:
+                    await self.rate_limiter.wait()  # Wait for rate limit
+                    async with self.semaphore:
+                        async with session.get(url) as response:
+                            if response.status == 200:
+                                image = await response.read()
+                                await self.image_queue.put((image, url))
+                                self.state.images_downloaded += 1
+                                if self.state.images_downloaded % 100 == 0:
+                                    print(
+                                        f"Downloaded {self.state.images_downloaded} images"
+                                    )
+                except Exception as e:
+                    print(f"Error downloading {url}: {e}")
+                finally:
+                    self.url_queue.task_done()
+    async def image_processor(self):
+        print("Starting image processor")
+        while True:
+            item = await self.image_queue.get()
+            if item is None:
+                print("Finished processing images")
+                break
+            image, url = item
+            filename = os.path.basename(url)
+            if not filename.lower().endswith((".png", ".jpg", ".jpeg")):
+                filename += ".png"
+            try:
+                # Save the original image
+                await self.saver.process(
+                    image, str(self.downloaded_images_dir / f"original_{filename}")
+                )
+                self.state.images_saved += 1
+                if self.state.images_resized % 100 == 0:
+                    print(f"Processed {self.state.images_resized} images")
+                # Resize the image using the process pool
+                # loop = asyncio.get_running_loop()
+                await self.loop.run_in_executor(
+                    self.process_pool,
+                    resize_image,
+                    image,
+                    str(self.downloaded_images_dir / f"resized_{filename}"),
+                )
+                self.state.images_resized += 1
+            except Exception as e:
+                print(f"Error processing {url}: {e}")
+            finally:
+                self.image_queue.task_done()
+    def save_state(self):
+        with open(self.state_file, "w") as f:
+            json.dump(asdict(self.state), f)
+    def load_state(self):
+        if os.path.exists(self.state_file):
+            with open(self.state_file, "r") as f:
+                self.state = ProcessState(**json.load(f))
+    async def run(self):
+        print("Starting pipeline")
+        self.load_state()
+        print(f"Loaded state: {self.state}")
+        tasks = [
+            asyncio.create_task(self.url_feeder()),
+            asyncio.create_task(self.image_downloader()),
+            asyncio.create_task(self.image_processor()),
+        ]
+        try:
+            await asyncio.gather(*tasks)
+        except Exception as e:
+            print(f"Pipeline error: {e}")
+        finally:
+            self.save_state()
+            print(f"Final state: {self.state}")
+            self.process_pool.shutdown()
+        print("Pipeline finished")
+if __name__ == "__main__":
+    from pathlib import Path
+    PROJECT_ROOT = Path(__file__).resolve().parent
+    loop = asyncio.get_event_loop()
+    text_file = PROJECT_ROOT / "data/image_urls.txt"
+    if not text_file.exists():
+        import pandas as pd
+        dataframe = pd.read_csv(PROJECT_ROOT / "data/photos.tsv000", sep="\t")
+        num_image_urls = len(dataframe)
+        print(f"Number of image urls: {num_image_urls}")
+        with open(text_file, "w") as f:
+            for url in dataframe["photo_image_url"]:
+                f.write(url + "\n")
+    print("Started downloading images")
+    pipeline = ImagePipeline(
+        txt_file=text_file,
+        loop=loop,
+        rate_limit=100,
+        rate_limit_period=1,
+        downloaded_images_dir=str(PROJECT_ROOT / "data/data/images"),
+    )
+    # asyncio.run(pipeline.run())
+    loop.run_until_complete(pipeline.run())
+    print("Finished downloading images")

engine/generate_features.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import asyncio
+import os
+import logging
+from PIL import Image
+import torch
+from transformers import (
+    CLIPProcessor,
+    CLIPModel,
+    BlipProcessor,
+    BlipForConditionalGeneration,
+)
+from sentence_transformers import SentenceTransformer
+import numpy as np
+import aiofiles
+import json
+from abc import ABC, abstractmethod
+from typing import Set, Tuple
+from concurrent.futures import ProcessPoolExecutor
+from dataclasses import dataclass, field
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+device = "cpu"
+@dataclass
+class State:
+    processed_files: Set[str] = field(default_factory=set)
+    def to_dict(self) -> dict:
+        return {"processed_files": list(self.processed_files)}
+    @staticmethod
+    def from_dict(state_dict: dict) -> "State":
+        return State(processed_files=set(state_dict.get("processed_files", [])))
+class ImageProcessor(ABC):
+    @abstractmethod
+    def process(self, image: Image.Image) -> np.ndarray:
+        pass
+class CLIPImageProcessor(ImageProcessor):
+    def __init__(self):
+        self.model = CLIPModel.from_pretrained(
+            "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"
+        ).to(device)
+        self.processor = CLIPProcessor.from_pretrained(
+            "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"
+        )
+        print("Initialized CLIP model and processor")
+    def process(self, image: Image.Image) -> np.ndarray:
+        inputs = self.processor(images=image, return_tensors="pt").to(device)
+        outputs = self.model.get_image_features(**inputs)
+        return outputs.detach().cpu().numpy()
+class ImageCaptioningProcessor(ImageProcessor):
+    def __init__(self):
+        self.image_caption_model = BlipForConditionalGeneration.from_pretrained(
+            "Salesforce/blip-image-captioning-base"
+        ).to(device)
+        self.image_caption_processor = BlipProcessor.from_pretrained(
+            "Salesforce/blip-image-captioning-base"
+        )
+        self.text_embedding_model = SentenceTransformer(
+            "all-MiniLM-L6-v2", device=device
+        )
+        print("Initialized BLIP model and processor")
+    def process(self, image: Image.Image) -> np.ndarray:
+        inputs = self.image_caption_processor(images=image, return_tensors="pt").to(
+            device
+        )
+        output = self.image_caption_model.generate(**inputs)
+        caption = self.image_caption_processor.decode(
+            output[0], skip_special_tokens=True
+        )
+        # embedding dim 384
+        return self.text_embedding_model.encode(caption).flatten()
+class ImageFeatureExtractor:
+    def __init__(
+        self,
+        clip_processor: CLIPImageProcessor,
+        caption_processor: ImageCaptioningProcessor,
+        max_queue_size: int = 100,
+        checkpoint_file: str = "checkpoint.json",
+    ):
+        self.clip_processor = clip_processor
+        self.caption_processor = caption_processor
+        self.image_queue = asyncio.Queue(maxsize=max_queue_size)
+        self.processed_images_queue = asyncio.Queue()
+        self.checkpoint_file = checkpoint_file
+        self.state = self.load_state()
+        self.executor = ProcessPoolExecutor()
+        self.total_images = 0
+        self.processed_count = 0
+        print(
+            "Initialized ImageFeatureExtractor with checkpoint file:", checkpoint_file
+        )
+    async def image_loader(self, input_folder: str):
+        print(f"Loading images from {input_folder}")
+        for filename in os.listdir(input_folder):
+            if "resized_" in filename and filename not in self.state.processed_files:
+                try:
+                    file_path = os.path.join(input_folder, filename)
+                    await self.image_queue.put((filename, file_path))
+                    self.total_images += 1
+                    print(f"Loaded image {filename} into queue")
+                except Exception as e:
+                    logger.error(f"Error loading image {filename}: {e}")
+        await self.image_queue.put(None)  # Sentinel to signal end of images
+        print(f"Total images to process: {self.total_images}")
+    async def image_processor_worker(self, loop: asyncio.AbstractEventLoop):
+        while True:
+            item = await self.image_queue.get()
+            if item is None:
+                await self.image_queue.put(None)  # Propagate sentinel
+                break
+            filename, file_path = item
+            try:
+                print(f"Processing image {filename}")
+                image = Image.open(file_path)
+                clip_embedding, caption_embedding = await asyncio.gather(
+                    loop.run_in_executor(
+                        self.executor, self.clip_processor.process, image
+                    ),
+                    loop.run_in_executor(
+                        self.executor, self.caption_processor.process, image
+                    ),
+                )
+                await self.processed_images_queue.put(
+                    (filename, clip_embedding, caption_embedding)
+                )
+                print(f"Processed image {filename}")
+            except Exception as e:
+                logger.error(f"Error processing image {filename}: {e}")
+            finally:
+                self.image_queue.task_done()
+    async def save_processed_images(self, output_folder: str):
+        while self.processed_count < self.total_images:
+            filename, clip_embedding, caption_embedding = (
+                await self.processed_images_queue.get()
+            )
+            try:
+                clip_output_path = os.path.join(
+                    output_folder, f"{os.path.splitext(filename)[0]}_clip.npy"
+                )
+                caption_output_path = os.path.join(
+                    output_folder, f"{os.path.splitext(filename)[0]}_caption.npy"
+                )
+                await asyncio.gather(
+                    self.save_embedding(clip_output_path, clip_embedding),
+                    self.save_embedding(caption_output_path, caption_embedding),
+                )
+                self.state.processed_files.add(filename)
+                self.save_state()
+                self.processed_count += 1
+                print(f"Saved processed embeddings for {filename}")
+            except Exception as e:
+                logger.error(f"Error saving processed image {filename}: {e}")
+            finally:
+                self.processed_images_queue.task_done()
+    async def save_embedding(self, output_path: str, embedding: np.ndarray):
+        async with aiofiles.open(output_path, "wb") as f:
+            await f.write(embedding.tobytes())
+    def load_state(self) -> State:
+        try:
+            with open(self.checkpoint_file, "r") as f:
+                state_dict = json.load(f)
+                print("Loaded state from checkpoint")
+                return State.from_dict(state_dict)
+        except (FileNotFoundError, json.JSONDecodeError):
+            print("No checkpoint found, starting with empty state")
+            return State()
+    def save_state(self):
+        with open(self.checkpoint_file, "w") as f:
+            json.dump(self.state.to_dict(), f)
+            print("Saved state to checkpoint")
+    async def run(
+        self,
+        input_folder: str,
+        output_folder: str,
+        loop: asyncio.AbstractEventLoop,
+        num_workers: int = 2,
+    ):
+        os.makedirs(output_folder, exist_ok=True)
+        print(f"Output folder {output_folder} created")
+        tasks = [
+            loop.create_task(self.image_loader(input_folder)),
+            loop.create_task(self.save_processed_images(output_folder)),
+        ]
+        tasks.extend(
+            [
+                loop.create_task(self.image_processor_worker(loop))
+                for _ in range(num_workers)
+            ]
+        )
+        await asyncio.gather(*tasks)
+class ImageFeatureExtractorFactory:
+    @staticmethod
+    def create() -> ImageFeatureExtractor:
+        print(
+            "Creating ImageFeatureExtractor with CLIPImageProcessor and ImageCaptioningProcessor"
+        )
+        return ImageFeatureExtractor(CLIPImageProcessor(), ImageCaptioningProcessor())
+async def main(loop: asyncio.AbstractEventLoop, input_folder: str, output_folder: str):
+    print("Starting main function")
+    extractor = ImageFeatureExtractorFactory.create()
+    try:
+        await extractor.run(input_folder, output_folder, loop)
+    except Exception as e:
+        logger.error(f"An error occurred during execution: {e}")
+    finally:
+        logger.info("Image processing completed.")
+if __name__ == "__main__":
+    from pathlib import Path
+    PROJECT_ROOT = Path(__file__).resolve().parent.parent
+    loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(loop)
+    print("Event loop created and set")
+    input_folder = str(PROJECT_ROOT / "data/images")
+    output_folder = str(PROJECT_ROOT / "data/features")
+    loop.run_until_complete(main(loop, input_folder, output_folder))
+    loop.close()
+    print("Event loop closed")

engine/search.py ADDED Viewed

	@@ -0,0 +1,216 @@

+import os
+import numpy as np
+from typing import List, Tuple
+import torch
+from glob import glob
+from PIL import Image
+from tqdm import tqdm
+import matplotlib.pyplot as plt
+from transformers import CLIPProcessor, CLIPModel
+from sentence_transformers import SentenceTransformer
+import sqlite3
+from .vector_database import (
+    VectorDB,
+    ImageEmbeddingCollectionSchema,
+    TextEmbeddingCollectionSchema,
+)
+class ImageSearchModule:
+    def __init__(
+        self,
+        image_embeddings_dir: str,
+        original_images_dir: str,
+        sqlite_db_path: str = "image_tracker.db",
+    ):
+        self.image_embeddings_dir = image_embeddings_dir
+        self.original_images_dir = original_images_dir
+        self.vector_db = VectorDB()
+        self.vector_db.create_collection(ImageEmbeddingCollectionSchema)
+        self.vector_db.create_collection(TextEmbeddingCollectionSchema)
+        self.clip_model = CLIPModel.from_pretrained(
+            "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"
+        )
+        self.clip_preprocess = CLIPProcessor.from_pretrained(
+            "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"
+        )
+        self.text_embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
+        self.sqlite_conn = sqlite3.connect(sqlite_db_path)
+        self._create_sqlite_table()
+    def _create_sqlite_table(self):
+        cursor = self.sqlite_conn.cursor()
+        cursor.execute(
+            """
+            CREATE TABLE IF NOT EXISTS added_images (
+                image_name TEXT PRIMARY KEY
+            )
+        """
+        )
+        self.sqlite_conn.commit()
+    def add_images(self):
+        print("Adding images to vector databases")
+        cursor = self.sqlite_conn.cursor()
+        for filename in tqdm(os.listdir(self.image_embeddings_dir)):
+            if filename.startswith("resized_") and filename.endswith("_clip.npy"):
+                image_name = filename[
+                    8:-9
+                ]  # Remove "resized_" prefix and "_clip.npy" suffix
+                cursor.execute(
+                    "SELECT 1 FROM added_images WHERE image_name = ?", (image_name,)
+                )
+                if cursor.fetchone() is None:
+                    clip_embedding_path = os.path.join(
+                        self.image_embeddings_dir, filename
+                    )
+                    caption_embedding_path = os.path.join(
+                        self.image_embeddings_dir, f"resized_{image_name}_caption.npy"
+                    )
+                    if os.path.exists(clip_embedding_path) and os.path.exists(
+                        caption_embedding_path
+                    ):
+                        with open(clip_embedding_path, "rb") as buffer:
+                            image_embedding = np.frombuffer(
+                                buffer.read(), dtype=np.float32
+                            ).reshape(512)
+                        with open(caption_embedding_path, "rb") as buffer:
+                            text_embedding = np.frombuffer(
+                                buffer.read(), dtype=np.float32
+                            ).reshape(384)
+                        if self.vector_db.insert_record(
+                            ImageEmbeddingCollectionSchema.collection_name,
+                            image_embedding,
+                            image_name,
+                        ):
+                            self.vector_db.insert_record(
+                                TextEmbeddingCollectionSchema.collection_name,
+                                text_embedding,
+                                image_name,
+                            )
+                            cursor.execute(
+                                "INSERT INTO added_images (image_name) VALUES (?)",
+                                (image_name,),
+                            )
+                            self.sqlite_conn.commit()
+        print("Finished adding images to vector databases")
+    def search_by_image(
+        self, query_image_path: str, top_k: int = 5, similarity_threshold: float = 0.5
+    ) -> List[Tuple[str, float]]:
+        if not os.path.exists(query_image_path):
+            print(f"Image file not found: {query_image_path}")
+            return []
+        try:
+            query_image = Image.open(query_image_path)
+            query_embedding = self._get_image_embedding(query_image)
+            results = self.vector_db.client.search(
+                collection_name=ImageEmbeddingCollectionSchema.collection_name,
+                data=[query_embedding],
+                output_fields=["filename"],
+                search_params={"metric_type": "COSINE"},
+                limit=top_k,
+            ).pop()
+            return [(item["entity"]["filename"], item["distance"]) for item in results if item["distance"] >= similarity_threshold]
+        except Exception as e:
+            print(f"Error processing image: {e}")
+            return []
+    def search_by_text(
+        self, query_text: str, top_k: int = 5,similarity_threshold: float = 0.5
+    ) -> List[Tuple[str, float]]:
+        if not query_text.strip():
+            print("Empty text query")
+            return []
+        try:
+            query_embedding = self._get_text_embedding(query_text)
+            results = self.vector_db.client.search(
+                collection_name=TextEmbeddingCollectionSchema.collection_name,
+                data=[query_embedding],
+                search_params={"metric_type": "COSINE"},
+                output_fields=["filename"],
+                limit=top_k,
+            ).pop()
+            return [(item["entity"]["filename"], item["distance"]) for item in results if item["distance"] >= similarity_threshold]
+        except Exception as e:
+            print(f"Error processing text: {e}")
+            return []
+    def _get_image_embedding(self, image: Image.Image) -> np.ndarray:
+        with torch.no_grad():
+            image_input = self.clip_preprocess(images=image, return_tensors="pt")[
+                "pixel_values"
+            ].to(self.clip_model.device)
+            image_features = self.clip_model.get_image_features(image_input)
+        return image_features.cpu().numpy().flatten()
+    def _get_text_embedding(self, text: str) -> np.ndarray:
+        with torch.no_grad():
+            embedding = self.text_embedding_model.encode(text).flatten()
+        return embedding
+    def display_results(self, results: List[Tuple[str, float]]):
+        if not results:
+            print("No results to display.")
+            return
+        num_images = min(5, len(results))
+        fig, axes = plt.subplots(1, num_images, figsize=(20, 4))
+        axes = [axes] if num_images == 1 else axes
+        for i, (image_name, similarity) in enumerate(results[:num_images]):
+            pattern = os.path.join(
+                self.original_images_dir, f"resized_{image_name}" + "*"
+            )
+            matching_files = glob(pattern)
+            if matching_files:
+                image_path = matching_files[0]
+                img = Image.open(image_path)
+                axes[i].imshow(img)
+                axes[i].set_title(f"Similarity: {similarity:.2f}")
+                axes[i].axis("off")
+            else:
+                print(f"No matching image found for {image_name}")
+                axes[i].text(0.5, 0.5, "Image not found", ha="center", va="center")
+                axes[i].axis("off")
+        plt.tight_layout()
+        plt.show()
+    def __del__(self):
+        if hasattr(self, "sqlite_conn"):
+            self.sqlite_conn.close()
+if __name__ == "__main__":
+    from pathlib import Path
+    import requests
+    PROJECT_ROOT = Path(__file__).resolve().parent.parent
+    search = ImageSearchModule(
+        image_embeddings_dir=str(PROJECT_ROOT / "data/features"),
+        original_images_dir=str(PROJECT_ROOT / "data/images"),
+    )
+    search.add_images()
+    # Search by image
+    img_url = (
+        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
+    )
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    raw_image.save(PROJECT_ROOT / "test.jpg")
+    image_results = search.search_by_image(str(PROJECT_ROOT / "test.jpg"))
+    print("Image search results:")
+    search.display_results(image_results)
+    # Search by text
+    text_results = search.search_by_text("Images of Nature")
+    print("Text search results:")
+    search.display_results(text_results)

engine/upload_data_to_hf.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import os
+from huggingface_hub import HfApi
+from pathlib import Path
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+api = HfApi()
+print("Uploading data.....")
+api.upload_folder(
+    folder_path=str(PROJECT_ROOT / "data"),
+    repo_id="satishjasthij/Unsplash-Visual-Semantic",
+    repo_type="space",
+    token=os.getenv("HUGGINGFACE_TOKEN"),
+    commit_message="add dataset",
+    create_pr=True,
+)
+print("Finished uploading data")

engine/vector_database.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from dataclasses import dataclass, asdict
+from pathlib import Path
+import random
+import numpy as np
+from pymilvus import MilvusClient
+@dataclass
+class MilvusServer:
+    uri: str = "milvus.db"
+@dataclass
+class EmbeddingCollectionSchema:
+    collection_name: str
+    vector_field_name: str
+    dimension: int
+    auto_id: bool
+    enable_dynamic_field: bool
+    metric_type: str
+ImageEmbeddingCollectionSchema = EmbeddingCollectionSchema(
+    collection_name="image_embeddings",
+    vector_field_name="embedding",
+    dimension=512,
+    auto_id=True,
+    enable_dynamic_field=True,
+    metric_type="COSINE",
+)
+TextEmbeddingCollectionSchema = EmbeddingCollectionSchema(
+    collection_name="text_embeddings",
+    vector_field_name="embedding",
+    dimension=384,
+    auto_id=True,
+    enable_dynamic_field=True,
+    metric_type="COSINE",
+)
+class VectorDB:
+    def __init__(self, client: MilvusClient = MilvusClient(uri=MilvusServer.uri)):
+        self.client = client
+    def create_collection(self, schema: EmbeddingCollectionSchema):
+        if self.client.has_collection(collection_name=schema.collection_name):
+            print(f"Collection {schema.collection_name} already exists")
+            return True
+            # self.client.drop_collection(collection_name=schema.collection_name)
+        print(f"Creating collection {schema.collection_name}")
+        self.client.create_collection(**asdict(schema))
+        print(f"Collection {schema.collection_name} created")
+        return True
+    def insert_record(
+        self, collection_name: str, embedding: np.ndarray, file_path: str
+    ) -> bool:
+        try:
+            self.client.insert(
+                collection_name=collection_name,
+                data={"embedding": embedding, "filename": file_path},
+            )
+        except Exception as e:
+            print(f"Error inserting record: {e}")
+            return False
+        return True

requirements.txt ADDED Viewed

	@@ -0,0 +1,115 @@

+aiofiles==23.2.1
+aiohttp==3.9.5
+aiosignal==1.3.1
+annotated-types==0.7.0
+anyio==4.4.0
+asttokens==2.4.1
+async-timeout==4.0.3
+attrs==23.2.0
+black==24.4.2
+certifi==2024.7.4
+charset-normalizer==3.3.2
+click==8.1.7
+contourpy==1.2.1
+cycler==0.12.1
+decorator==5.1.1
+dnspython==2.6.1
+email-validator==2.2.0
+environs==9.5.0
+exceptiongroup==1.2.2
+executing==2.0.1
+fastapi==0.111.1
+fastapi-cli==0.0.4
+ffmpy==0.3.2
+filelock==3.15.4
+fonttools==4.53.1
+frozenlist==1.4.1
+fsspec==2024.6.1
+gradio==4.39.0
+gradio-client==1.1.1
+grpcio==1.63.0
+h11==0.14.0
+httpcore==1.0.5
+httptools==0.6.1
+httpx==0.27.0
+huggingface-hub==0.24.0
+idna==3.7
+importlib-resources==6.4.0
+ipython==8.18.1
+jedi==0.19.1
+jinja2==3.1.4
+joblib==1.4.2
+kiwisolver==1.4.5
+markdown-it-py==3.0.0
+markupsafe==2.1.5
+marshmallow==3.21.3
+matplotlib==3.9.1
+matplotlib-inline==0.1.7
+mdurl==0.1.2
+milvus-lite==2.4.8
+mpmath==1.3.0
+multidict==6.0.5
+mypy-extensions==1.0.0
+networkx==3.2.1
+numpy==1.26.4
+orjson==3.10.6
+packaging==24.1
+pandas==2.2.2
+parso==0.8.4
+pathspec==0.12.1
+pexpect==4.9.0
+pillow==10.4.0
+platformdirs==4.2.2
+prompt-toolkit==3.0.47
+protobuf==5.27.2
+psutil==6.0.0
+ptyprocess==0.7.0
+pure-eval==0.2.3
+pydantic==2.8.2
+pydantic-core==2.20.1
+pydub==0.25.1
+pygments==2.18.0
+pymilvus==2.4.4
+pyparsing==3.1.2
+python-dateutil==2.9.0.post0
+python-dotenv==1.0.1
+python-multipart==0.0.9
+pytz==2024.1
+pyyaml==6.0.1
+regex==2024.5.15
+requests==2.32.3
+rich==13.7.1
+ruff==0.5.5
+safetensors==0.4.3
+scikit-learn==1.5.1
+scipy==1.13.1
+semantic-version==2.10.0
+sentence-transformers==3.0.1
+setuptools==71.1.0
+shellingham==1.5.4
+six==1.16.0
+sniffio==1.3.1
+stack-data==0.6.3
+starlette==0.37.2
+sympy==1.13.1
+threadpoolctl==3.5.0
+tokenizers==0.19.1
+tomli==2.0.1
+tomlkit==0.12.0
+torch==2.3.1
+torchvision==0.18.1
+tqdm==4.66.4
+traitlets==5.14.3
+transformers==4.42.4
+typer==0.12.3
+typing-extensions==4.12.2
+tzdata==2024.1
+ujson==5.10.0
+urllib3==2.2.2
+uvicorn==0.30.3
+uvloop==0.19.0
+watchfiles==0.22.0
+wcwidth==0.2.13
+websockets==11.0.3
+yarl==1.9.4
+zipp==3.19.2