Spaces:
Build error
Build error
File size: 31,352 Bytes
a54266b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 |
# ColPali π€ Vespa - Visual Retrieval System
A powerful visual document retrieval system that combines **ColPali** (Contextual Late Interaction with Patch-level Information) with **Vespa** for scalable, intelligent document search and question-answering.
## π Features
### π **Visual Document Search**
- **Multi-modal retrieval**: Search through PDF documents using natural language queries
- **Visual understanding**: ColPali model processes document images and text simultaneously
- **Token-level similarity maps**: Visualize exactly which parts of documents match your query
- **Multiple ranking algorithms**: Choose between hybrid, semantic, and other ranking methods
### π§ **AI-Powered Chat**
- **Intelligent Q&A**: Ask questions about retrieved documents using Google Gemini 2.0
- **Context-aware responses**: AI analyzes document images to provide accurate answers
- **Real-time streaming**: Get responses as they're generated
### β‘ **Scalable Infrastructure**
- **Vespa integration**: Enterprise-grade search platform for large document collections
- **Real-time processing**: Instant search results and similarity map generation
- **Cloud-ready**: Supports Vespa Cloud deployment with secure authentication
## ποΈ Architecture
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Backend β β Vespa Cloud β
β (Browser) β β (Your Local β β (Remote) β
β β β Computer) β β β
β β’ Search UI βββββΊβ β’ ColPali Model βββββΊβ β’ Document Storeβ
β β’ Similarity β β β’ Query Proc. β β β’ Vector Search β
β Maps β β β’ Sim Map Gen. β β β’ Ranking β
β β’ Chat Interfaceβ β β’ Gemini Int. β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
Web Browser LOCAL AI REMOTE Storage
```
### π **LOCAL Processing (Your Computer)**
**All AI model inference happens on YOUR local machine:**
- **ColPali Model**: Runs locally on your GPU/CPU (~7GB model)
- **Document Processing**: PDF β Images β Embeddings (local)
- **Query Processing**: Text β Embeddings (local)
- **Similarity Maps**: Visual attention generation (local)
- **Gemini Chat**: Processes retrieved images locally
**Device Detection:**
```python
device = get_torch_device("auto") # Detects: CUDA, MPS (Apple), or CPU
print(f"Using device: {device}") # Shows YOUR hardware
```
### βοΈ **REMOTE Processing (Vespa Cloud)**
**Only storage and search index operations happen remotely:**
- **Document Storage**: Stores processed embeddings (not raw models)
- **Vector Search**: Fast similarity search across document collection
- **Query Routing**: Handles search requests and ranking
- **Metadata Storage**: Document titles, URLs, page numbers
### π **Complete Data Flow**
#### **Document Upload Process:**
1. **LOCAL**: Your computer downloads PDF from URL
2. **LOCAL**: ColPali converts PDF pages to images
3. **LOCAL**: ColPali generates visual embeddings (1024 patches Γ 128 dims)
4. **LOCAL**: Embeddings converted to binary format for efficiency
5. **REMOTE**: Binary embeddings uploaded to Vespa Cloud
6. **REMOTE**: Vespa indexes embeddings for fast search
#### **Search Query Process:**
1. **LOCAL**: You enter search query in browser
2. **LOCAL**: ColPali processes query β generates query embeddings
3. **REMOTE**: Query embeddings sent to Vespa Cloud
4. **REMOTE**: Vespa searches document index, returns matches
5. **LOCAL**: ColPali generates similarity maps for results
6. **BROWSER**: Results displayed with visual attention maps
#### **AI Chat Process:**
1. **LOCAL**: Retrieved document images processed by your machine
2. **REMOTE**: Images + query sent to Google Gemini API
3. **REMOTE**: Gemini generates response based on visual content
4. **BROWSER**: Streaming response displayed in real-time
### Core Components
- **ColPali Model**: Visual-language model for document understanding (LOCAL)
- **Vespa Search**: Distributed search and storage engine (REMOTE)
- **FastHTML Frontend**: Modern, responsive web interface (BROWSER)
- **Gemini Integration**: AI-powered question answering (REMOTE API)
- **Similarity Map Generator**: Visual attention visualization (LOCAL)
## π» **System Requirements**
### **LOCAL Machine Requirements (For AI Processing)**
**Minimum:**
- **CPU**: Modern multi-core processor (Intel/AMD/Apple Silicon)
- **RAM**: 8GB+ (16GB recommended)
- **Storage**: 10GB free space (for model cache)
- **Python**: 3.10+ (< 3.13)
**Recommended:**
- **GPU**: NVIDIA GPU with 8GB+ VRAM (RTX 3070/4060 or better)
- **Apple**: M1/M2/M3 Mac (uses Metal Performance Shaders)
- **RAM**: 16GB+ for smoother processing
- **Storage**: SSD for faster model loading
**Performance Examples:**
- **RTX 4090**: ~1-2 seconds per query
- **RTX 3070**: ~3-5 seconds per query
- **Apple M2**: ~4-6 seconds per query
- **CPU Only**: ~15-30 seconds per query
### **REMOTE Requirements (Vespa Cloud)**
**What you need:**
- **Vespa Cloud account** (handles all remote processing)
- **Internet connection** (for uploading embeddings and search queries)
- **Authentication tokens** (provided by Vespa Cloud)
**What Vespa Cloud provides:**
- **Scalable storage** for any number of documents
- **Sub-second search** across millions of embeddings
- **High availability** with automatic failover
- **Global CDN** for fast access worldwide
## π° **Cost Breakdown**
### **FREE Components**
- **ColPali Model**: Open source, runs locally (no per-query costs)
- **Python Application**: MIT/Apache licensed, completely free
- **Local Processing**: Uses your own hardware (no cloud AI fees)
### **PAID Components**
- **Vespa Cloud**: Pay for storage and search operations
- ~$0.001 per 1000 searches
- ~$0.10 per GB storage per month
- **Google Gemini API**: Optional, for chat features only
- ~$0.01 per 1000 image tokens
- Only used when you ask questions about documents
### **Cost Examples (Monthly)**
- **Personal Use** (100 documents, 1000 searches): ~$5-10/month
- **Small Business** (1000 documents, 10k searches): ~$20-50/month
- **Enterprise** (10k+ documents, 100k+ searches): $200+/month
**π‘ Cost Optimization Tips:**
- Use local Vespa installation to avoid cloud costs
- Disable Gemini chat if not needed (saves API costs)
- Process documents in batches to minimize upload time
## π Quick Start
### Prerequisites
- Python 3.10+ (< 3.13)
- **8GB+ RAM** for ColPali model
- **Vespa Cloud account** or local Vespa installation
- **Google Gemini API key** (optional, for chat features)
- **GPU recommended** but not required
### 1. Installation
```bash
# Clone the repository
git clone <repository-url>
cd colpali-vespa-visual-retrieval
# Install dependencies
pip install -e .
# For development
pip install -e ".[dev]"
# For document feeding capabilities
pip install -e ".[feed]"
```
### 2. Environment Configuration
Create a `.env` file with your configuration:
```bash
# Vespa Configuration
VESPA_APP_TOKEN_URL=https://your-app.vespa-cloud.com
VESPA_CLOUD_SECRET_TOKEN=your_secret_token
# Alternative: mTLS Authentication
USE_MTLS=false
VESPA_APP_MTLS_URL=https://your-app.vespa-cloud.com
VESPA_CLOUD_MTLS_KEY="-----BEGIN PRIVATE KEY-----..."
VESPA_CLOUD_MTLS_CERT="-----BEGIN CERTIFICATE-----..."
# Optional: Gemini AI (for chat features)
GEMINI_API_KEY=your_gemini_api_key
# Optional: Logging
LOG_LEVEL=INFO
HOT_RELOAD=false
```
### 3. Deploy Vespa Application
```bash
# Deploy the Vespa schema and configuration
python deploy_vespa_app.py \
--tenant_name your_tenant \
--vespa_application_name colpalidemo \
--token_id_write colpalidemo_write \
--token_id_read colpalidemo_read
```
### 4. Run the Application
```bash
python main.py
```
The application will be available at `http://localhost:7860`
## π Document Management
### Uploading Documents
Use the feeding script to process and upload PDF documents:
```bash
python feed_vespa.py \
--application_name colpalidemo \
--vespa_schema_name pdf_page
```
**Document Processing Pipeline (LOCAL β REMOTE):**
1. **PDF Download** (LOCAL): Your computer downloads PDFs from URLs
2. **PDF Conversion** (LOCAL): PDFs converted to images (one per page)
3. **ColPali Processing** (LOCAL): Each page processed by ColPali model on YOUR GPU/CPU
4. **Embedding Generation** (LOCAL): Visual embeddings created (1024 patches Γ 128 dimensions)
5. **Binary Encoding** (LOCAL): Embeddings converted to efficient binary format
6. **Vespa Upload** (REMOTE): Binary embeddings uploaded to Vespa Cloud
7. **Search Indexing** (REMOTE): Vespa indexes embeddings for fast retrieval
**β οΈ Important Notes:**
- **Processing Time**: Expect 5-30 seconds per page depending on your hardware
- **Network Usage**: Only final embeddings uploaded (~1KB per page vs ~1MB original)
- **Privacy**: Original PDFs and images stay on your local machine
- **Storage**: Raw images cached locally for similarity map generation
### Supported Operations
- β
**Upload Documents**: Add new PDFs to the system
- β
**Search Documents**: Query existing documents
- β
**View Documents**: Browse stored documents
- β **Remove Documents**: _Not currently implemented_
- β **Update Documents**: _Not currently implemented_
## π Authentication & Security
### π‘οΈ **Current Security Implementation**
#### **SECURE Components:**
**Vespa Authentication (REMOTE)**
- **Token Authentication**: Bearer tokens for Vespa Cloud API access
- **mTLS Certificates**: Mutual TLS for enterprise security
- **Encrypted Communication**: HTTPS/TLS for all Vespa connections
**API Key Management (LOCAL)**
- **Environment Variables**: Sensitive keys stored in `.env` files
- **API Key Rotation**: Google Gemini supports key rotation
- **Local Storage**: Keys never transmitted except to authorized APIs
#### **LIMITED Security Components:**
**Session Management**
```python
# Basic UUID session tracking (FastHTML)
session["session_id"] = str(uuid.uuid4())
# HTTP-only cookies (Next.js)
cookieStore.set(SESSION_KEY, newSessionId, {
httpOnly: true,
secure: process.env.NODE_ENV === "production",
sameSite: "lax",
maxAge: 60 * 60 * 24 * 30, // 30 days
});
```
**Basic Request Validation**
```python
# HTMX request validation
if "hx-request" not in request.headers:
return RedirectResponse("/search")
# Parameter validation
if not query:
return NextResponse.json({ error: "Query is required" }, { status: 400 });
```
### β οΈ **Security Limitations & Risks**
#### **MISSING Security Features:**
**β No API Authentication**
- Local API endpoints are **completely open**
- No rate limiting or abuse protection
- No user authentication or authorization
- Anyone can access `/fetch_results`, `/get_sim_map` endpoints
**β No Input Sanitization**
```python
# Raw user input passed directly to models
query = searchParams.get("query") # No validation/sanitization
ranking = searchParams.get("ranking") # No input filtering
```
**β No Security Headers**
- No CORS configuration
- No Content Security Policy (CSP)
- No X-Frame-Options protection
- No X-Content-Type-Options validation
**β No Rate Limiting**
- Unlimited API requests
- No protection against DoS attacks
- No query throttling or user limits
**β No CSRF Protection**
- No token validation for state-changing operations
- Cross-site request forgery possible
### π― **Security Recommendations**
#### **IMMEDIATE (High Priority)**
**1. Add API Authentication**
```typescript
// middleware.ts - Add API key validation
export function middleware(request: NextRequest) {
const apiKey = request.headers.get("X-API-Key");
if (!apiKey || apiKey !== process.env.COLPALI_API_KEY) {
return new Response("Unauthorized", { status: 401 });
}
}
```
**2. Implement Rate Limiting**
```typescript
// Use next-rate-limit or similar
import rateLimit from "@/lib/rate-limit";
const limiter = rateLimit({
interval: 60 * 1000, // 1 minute
uniqueTokenPerInterval: 500, // Limit each IP to 100 requests per interval
});
await limiter.check(10, getClientIP(request)); // 10 requests per minute
```
**3. Add Security Headers**
```typescript
// next.config.js
const securityHeaders = [
{ key: "X-Frame-Options", value: "DENY" },
{ key: "X-Content-Type-Options", value: "nosniff" },
{ key: "Referrer-Policy", value: "strict-origin-when-cross-origin" },
{
key: "Content-Security-Policy",
value: "default-src 'self'; script-src 'self' 'unsafe-inline'",
},
];
```
**4. Input Validation & Sanitization**
```typescript
import { z } from "zod";
const SearchSchema = z.object({
query: z
.string()
.min(1)
.max(500)
.regex(/^[a-zA-Z0-9\s\.\?\!]*$/),
ranking: z.enum(["hybrid", "colpali", "bm25"]),
});
```
#### **MEDIUM Priority**
**5. CORS Configuration**
```typescript
// Restrict origins to known domains
const corsHeaders = {
"Access-Control-Allow-Origin": "https://yourdomain.com",
"Access-Control-Allow-Methods": "GET, POST, OPTIONS",
"Access-Control-Allow-Headers": "Content-Type, Authorization",
};
```
**6. Request Size Limits**
```typescript
// Limit request payload sizes
export const config = {
api: {
bodyParser: {
sizeLimit: "1mb",
},
},
};
```
**7. Audit Logging**
```python
# Log all API access with IP, timestamp, and queries
logger.info(f"API_ACCESS: {client_ip} - {endpoint} - {query[:100]}")
```
#### **LONG-TERM (Production Ready)**
**8. User Authentication (Optional)**
```typescript
// Add NextAuth.js or similar for user accounts
// Implement role-based access control
// Add document ownership and permissions
```
**9. Network Security**
```bash
# Deploy behind reverse proxy (nginx/cloudflare)
# Enable DDoS protection
# Use Web Application Firewall (WAF)
```
**10. Data Privacy Controls**
```typescript
// Implement data retention policies
// Add user data deletion capabilities
// GDPR compliance features
```
### π **Security Best Practices**
#### **For LOCAL Development:**
- **Never commit API keys** to version control
- **Use strong environment variable names** (avoid `API_KEY`)
- **Rotate API keys regularly** (monthly)
- **Enable firewall** on development machines
- **Use HTTPS even locally** for production testing
#### **For PRODUCTION Deployment:**
- **Deploy behind CDN/WAF** (Cloudflare, AWS Shield)
- **Enable rate limiting** at infrastructure level
- **Use container security scanning**
- **Implement monitoring and alerting**
- **Regular security audits and penetration testing**
#### **For REMOTE Services:**
- **Vespa Cloud**: Follows enterprise security standards
- **Gemini API**: Google-managed security and compliance
- **Environment Isolation**: Separate dev/staging/prod credentials
### π¨ **Current Risk Level: MEDIUM**
**Suitable for:**
- β
**Personal projects and demos**
- β
**Internal company tools** (behind firewall)
- β
**Research and development** environments
**NOT suitable for:**
- β **Public internet deployment**
- β **Customer-facing applications**
- β **Production environments** with sensitive data
- β **Commercial applications** without security hardening
## π― Usage Guide
### Basic Search
1. Navigate to the homepage
2. Enter your search query in natural language
3. Select ranking method (hybrid, semantic, etc.)
4. View results with similarity maps
### Similarity Maps
- Click on token buttons to see which parts of documents match specific query terms
- Visual heatmaps show attention patterns
- Reset button returns to original document view
### AI Chat
- Ask questions about retrieved documents
- Chat responses are based on document content
- Streaming responses for real-time interaction
### Search Rankings
- **Hybrid**: Combines multiple ranking signals
- **Semantic**: Pure semantic similarity
- **BM25**: Traditional text-based ranking
- **ColPali**: Visual-first ranking
## π οΈ Development
### Project Structure
```
βββ main.py # Application entry point
βββ backend/
β βββ colpali.py # ColPali model integration
β βββ vespa_app.py # Vespa client and queries
β βββ modelmanager.py # Model management utilities
βββ frontend/
β βββ app.py # UI components
β βββ layout.py # Layout templates
βββ feed_vespa.py # Document upload script
βββ deploy_vespa_app.py # Vespa deployment script
βββ colpali-with-snippets/ # Vespa schema definitions
βββ static/ # Static assets and generated files
```
### Running in Development
```bash
# Enable hot reload
export HOT_RELOAD=true
python main.py
# Or set in .env
echo "HOT_RELOAD=true" >> .env
```
### Code Quality
```bash
# Format code
ruff format .
# Lint code
ruff check .
```
## π API Endpoints
### **Current API Routes (β οΈ UNSECURED)**
| Endpoint | Method | Description | Security Status |
| ---------------- | ------ | ----------------------- | ---------------- |
| `/` | GET | Homepage | β
Public (safe) |
| `/search` | GET | Search interface | β
Public (safe) |
| `/fetch_results` | GET | Fetch search results | β οΈ **OPEN API** |
| `/get_sim_map` | GET | Get similarity maps | β οΈ **OPEN API** |
| `/get-message` | GET | Chat with AI (SSE) | β οΈ **OPEN API** |
| `/full_image` | GET | Get full document image | β οΈ **OPEN API** |
| `/suggestions` | GET | Query autocomplete | β οΈ **OPEN API** |
| `/static/*` | GET | Static file serving | β
Public (safe) |
### **Security Analysis by Endpoint**
#### **π SECURE Endpoints**
- **`/`** and **`/search`**: Static HTML pages, no sensitive data
- **`/static/*`**: Public assets (CSS, JS, images)
#### **β οΈ UNSECURED Endpoints (Risk)**
**`/fetch_results`** - **HIGH RISK**
```bash
# Anyone can perform unlimited searches
curl "http://localhost:7860/fetch_results?query=secret&ranking=hybrid"
```
- **Risks**: Resource abuse, server overload, competitive intelligence gathering
- **Exposes**: Search capabilities, document metadata, processing times
**`/get_sim_map`** - **MEDIUM RISK**
```bash
# Access similarity maps without authentication
curl "http://localhost:7860/get_sim_map?query_id=123&idx=0&token=word&token_idx=5"
```
- **Risks**: Unauthorized access to visual analysis
- **Exposes**: Document visual patterns, query insights
**`/get-message`** - **HIGH RISK**
```bash
# Trigger AI processing without limits
curl "http://localhost:7860/get-message?query_id=123&query=question&doc_ids=doc1,doc2"
```
- **Risks**: Gemini API abuse, cost exploitation, resource exhaustion
- **Exposes**: AI-generated insights, document content analysis
**`/full_image`** - **HIGH RISK**
```bash
# Download any document image
curl "http://localhost:7860/full_image?doc_id=any_document_id"
```
- **Risks**: Unauthorized document access, data leakage
- **Exposes**: Full document images, potentially sensitive content
### **Immediate Security Fixes**
#### **1. Add API Key Authentication**
```python
# Python FastHTML middleware
@app.middleware("http")
async def verify_api_key(request, call_next):
if request.url.path.startswith("/fetch_results"):
api_key = request.headers.get("X-API-Key")
if not api_key or api_key != os.getenv("COLPALI_API_KEY"):
return JSONResponse({"error": "Unauthorized"}, status_code=401)
return await call_next(request)
```
#### **2. Implement Rate Limiting**
```python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@rt("/fetch_results")
@limiter.limit("10/minute") # 10 requests per minute per IP
async def get_results(request, query: str, ranking: str):
# ... existing code
```
#### **3. Input Validation**
```python
from pydantic import BaseModel, validator
class SearchRequest(BaseModel):
query: str
ranking: str
@validator('query')
def query_must_be_safe(cls, v):
if len(v) > 500:
raise ValueError('Query too long')
# Add sanitization logic
return v.strip()
```
#### **4. Request Origin Validation**
```python
ALLOWED_ORIGINS = ["http://localhost:3000", "https://yourdomain.com"]
@app.middleware("http")
async def cors_middleware(request, call_next):
origin = request.headers.get("origin")
if origin not in ALLOWED_ORIGINS:
return JSONResponse({"error": "Forbidden"}, status_code=403)
return await call_next(request)
```
### **π Recommended API Security Architecture**
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Rate Limiter β β Backend API β
β β β β β β
β β’ API Key βββββΊβ β’ IP Limiting βββββΊβ β’ Input Valid. β
β β’ CORS Headers β β β’ User Quotas β β β’ Auth Checks β
β β’ Request Valid.β β β’ DoS Protectionβ β β’ Audit Logs β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
**Benefits:**
- **Layer 1**: Frontend validates requests before sending
- **Layer 2**: Rate limiter prevents abuse and DoS attacks
- **Layer 3**: Backend performs final validation and authorization
### **π Security Implementation Checklist**
#### **Before Production Deployment:**
**CRITICAL (Must Do):**
- [ ] **Generate API Key**: Create strong API key for endpoint authentication
- [ ] **Enable Rate Limiting**: Implement per-IP request limits
- [ ] **Add Security Headers**: X-Frame-Options, CSP, X-Content-Type-Options
- [ ] **Input Validation**: Sanitize all user inputs (query, ranking)
- [ ] **CORS Configuration**: Restrict origins to known domains only
- [ ] **Environment Security**: Never commit API keys, use secure .env
- [ ] **HTTPS Only**: Force TLS in production (no HTTP)
**HIGH Priority:**
- [ ] **Audit Logging**: Log all API requests with IP and timestamp
- [ ] **Request Size Limits**: Prevent large payload attacks
- [ ] **Error Handling**: Don't expose stack traces or internal details
- [ ] **Session Security**: HTTP-only, secure, SameSite cookies
- [ ] **API Documentation**: Document authentication requirements
**MEDIUM Priority:**
- [ ] **User Authentication**: Consider adding user accounts for access control
- [ ] **Request Timeout**: Prevent long-running request abuse
- [ ] **Content Validation**: Verify response content types
- [ ] **Monitoring**: Set up alerts for unusual API usage patterns
- [ ] **Backup Strategy**: Secure backup of environment variables
#### **Security Testing Commands:**
**Test API Authentication:**
```bash
# Should fail without API key
curl "http://localhost:7860/fetch_results?query=test&ranking=hybrid"
# Should succeed with API key
curl -H "X-API-Key: your_api_key" "http://localhost:7860/fetch_results?query=test&ranking=hybrid"
```
**Test Rate Limiting:**
```bash
# Run multiple requests to trigger rate limit
for i in {1..15}; do
curl -H "X-API-Key: your_api_key" "http://localhost:7860/fetch_results?query=test$i&ranking=hybrid"
echo "Request $i"
done
```
**Test Input Validation:**
```bash
# Should reject invalid/malicious inputs
curl -H "X-API-Key: your_api_key" "http://localhost:7860/fetch_results?query=<script>alert('xss')</script>&ranking=invalid"
```
**Test Security Headers:**
```bash
# Check security headers in response
curl -I "http://localhost:7860/"
# Should see: X-Frame-Options, X-Content-Type-Options, etc.
```
#### **Security Monitoring:**
**Log Analysis Queries:**
```bash
# Monitor API usage patterns
grep "API_ACCESS" /var/log/colpali.log | tail -100
# Detect potential abuse
grep "RATE_LIMIT_EXCEEDED" /var/log/colpali.log
# Check authentication failures
grep "UNAUTHORIZED" /var/log/colpali.log
```
**Alerting Setup:**
- **Rate Limit Violations**: Alert when >50 requests/minute from single IP
- **Authentication Failures**: Alert on repeated unauthorized attempts
- **Unusual Queries**: Alert on suspicious query patterns or injection attempts
- **Resource Usage**: Alert on high CPU/memory usage (potential DoS)
## π§ͺ Models Used
- **ColPali v1.2**: Visual document understanding
- **ColPaliGemma 3B**: Base visual-language model
- **Google Gemini 2.0**: AI chat and question answering
## π§ Configuration Options
### Environment Variables
| Variable | Required | Description | Security Impact |
| -------------------------- | -------- | ------------------------------------------- | ----------------------------------- |
| `VESPA_APP_TOKEN_URL` | Yes\* | Vespa application URL (token auth) | **HIGH** - Remote access |
| `VESPA_CLOUD_SECRET_TOKEN` | Yes\* | Vespa secret token | **CRITICAL** - Full database access |
| `USE_MTLS` | No | Use mTLS instead of token auth | **MEDIUM** - Auth method |
| `VESPA_APP_MTLS_URL` | Yes\*\* | Vespa application URL (mTLS) | **HIGH** - Remote access |
| `VESPA_CLOUD_MTLS_KEY` | Yes\*\* | mTLS private key | **CRITICAL** - TLS credentials |
| `VESPA_CLOUD_MTLS_CERT` | Yes\*\* | mTLS certificate | **HIGH** - TLS credentials |
| `GEMINI_API_KEY` | No | Google Gemini API key | **HIGH** - AI access/costs |
| `LOG_LEVEL` | No | Logging level (DEBUG, INFO, WARNING, ERROR) | **LOW** - Debug info |
| `HOT_RELOAD` | No | Enable hot reload in development | **LOW** - Dev convenience |
#### **π Security-Related Environment Variables (Recommended)**
| Variable | Required | Description | Default |
| -------------------------- | --------- | ------------------------------------ | ------- |
| `COLPALI_API_KEY` | **YES\*** | API key for endpoint authentication | None |
| `ALLOWED_ORIGINS` | **YES\*** | Comma-separated allowed CORS origins | None |
| `RATE_LIMIT_REQUESTS` | No | Max requests per minute per IP | `10` |
| `RATE_LIMIT_WINDOW` | No | Rate limit window in seconds | `60` |
| `MAX_QUERY_LENGTH` | No | Maximum query string length | `500` |
| `ENABLE_AUDIT_LOGGING` | No | Log all API requests for security | `false` |
| `SECURITY_HEADERS_ENABLED` | No | Enable security headers | `true` |
| `CSRF_SECRET` | **YES\*** | Secret for CSRF token generation | None |
**Example Security-Enhanced `.env`:**
```bash
# Existing configuration
VESPA_APP_TOKEN_URL=https://your-app.vespa-cloud.com
VESPA_CLOUD_SECRET_TOKEN=your_vespa_secret_token
GEMINI_API_KEY=your_gemini_api_key
# NEW: Security configuration
COLPALI_API_KEY=your_strong_random_api_key_here
ALLOWED_ORIGINS=http://localhost:3000,https://yourdomain.com
RATE_LIMIT_REQUESTS=10
RATE_LIMIT_WINDOW=60
MAX_QUERY_LENGTH=500
ENABLE_AUDIT_LOGGING=true
SECURITY_HEADERS_ENABLED=true
CSRF_SECRET=your_random_csrf_secret_here
# Development vs Production
NODE_ENV=production # Enable secure cookies
LOG_LEVEL=INFO # Don't expose debug info in production
```
\*Required for token authentication
\*\*Required for mTLS authentication
\*\*\*Required for production security
## π¨ Troubleshooting
### **LOCAL Processing Issues**
**ColPali model fails to load:**
```bash
# Check GPU memory
nvidia-smi # For NVIDIA GPUs
# or
system_profiler SPDisplaysDataType # For Apple Silicon
# Clear model cache if corrupted
rm -rf ~/.cache/huggingface/hub/models--vidore--colpali-v1.2
```
**Out of memory errors:**
- Reduce batch size in `feed_vespa.py` (try `batch_size=1`)
- Close other applications to free RAM/VRAM
- Use CPU processing if GPU memory insufficient: `CUDA_VISIBLE_DEVICES="" python main.py`
**Slow processing on CPU:**
- Expected behavior - ColPali requires significant computation
- Consider upgrading to GPU or Apple Silicon for 5-10x speedup
- Process documents overnight for large collections
### **REMOTE Processing Issues**
**Connection to Vespa fails:**
- Verify your Vespa URL and credentials in `.env`
- Check if the Vespa application is deployed and running
- Ensure network connectivity: `ping your-app.vespa-cloud.com`
- Validate authentication tokens haven't expired
**Document upload fails:**
- Check Vespa Cloud storage quota and billing
- Verify embedding format matches Vespa schema
- Ensure stable internet connection for large uploads
**Search returns no results:**
- Confirm documents were successfully uploaded to Vespa
- Check if embeddings were properly indexed
- Verify query processing isn't failing locally
### **MIXED (Local + Remote) Issues**
**Chat features don't work:**
- **LOCAL**: Verify document images are being generated locally
- **REMOTE**: Check `GEMINI_API_KEY` is set correctly
- **REMOTE**: Verify Gemini API quota and billing
- **NETWORK**: Ensure images can be sent to Gemini API
**Similarity maps missing:**
- **LOCAL**: Confirm ColPali model loaded successfully
- **LOCAL**: Check if similarity map generation completed
- **REMOTE**: Verify Vespa returned similarity data
- **BROWSER**: Clear browser cache for static files
### Performance Tips
**LOCAL Optimization:**
- Use GPU acceleration for 5-10x faster model inference
- Optimize batch sizes based on available memory
- Use SSD storage for faster model loading
- Consider quantized models for lower memory usage
**REMOTE Optimization:**
- Use Vespa's HNSW indexing for faster search
- Optimize embedding dimensions vs accuracy tradeoff
- Enable compression for faster network transfer
- Use multiple Vespa instances for high availability
**NETWORK Optimization:**
- Process documents in batches to reduce upload overhead
- Use compression for embedding transfer
- Consider regional Vespa deployment for lower latency
## π License
Apache-2.0
## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests and linting
5. Submit a pull request
## π Support
For issues and questions:
- Check the troubleshooting section
- Review Vespa and ColPali documentation
- Open an issue on the repository
|