High-Performance Neural Search Engine

A semantic search system built with NVIDIA cuVS, featuring vector quantization and optimized memory access patterns for sub-100ms query latency on 35M high-dimensional embeddings.

🚀 Project Overview

This project builds a complete end-to-end neural search pipeline capable of handling large-scale semantic search across 35 million 768-dimensional Wikipedia embeddings. The system achieves 13x index size reduction and <100ms average query latency through IVF-PQ indexing and memory layout optimization for sequential I/O access patterns.

Key Technical Achievements

1. Vector Indexing & GPU Optimization

  • Problem: 35M vectors × 768 dims × 2 bytes = 53.76 GB - too large to fit in GPU memory
  • Solution: Used NVIDIA cuVS to train IVF-PQ clusters and add embeddings to the index to reduce search space
  • Result: Compressed to 3.36 GB (96 bytes/vector) - fits entirely in GPU memory for faster search

Technical Details:

  • • 32,768 IVF clusters for coarse quantization
  • • 96 PQ sub-quantizers (8 dimensions each, 8 bits per code)
  • • Used 2M representative vectors for training, populated with remaining 33M in 500K batches

2. Recall Optimization Through Oversampling

  • Problem: Direct IVF-PQ search yielded only 60% recall@10
  • Solution: Built two-stage retrieval pipeline with oversampling + reranking to improve accuracy
  • Result: Achieved 89% recall@10 while maintaining performance

Implementation:

Stage 1: Retrieve 1000 candidates using IVF-PQ approximation
Stage 2: Exact cosine similarity reranking on original embeddings

3. Memory Access Pattern Optimization

  • Problem: Accessing metadata for top k results in random order took 100+ ms (unacceptable for real-time search)
  • Solution: Reorganized embeddings and metadata by IVF list structure to enable sequential I/O access patterns on SSD/RAM
  • Result: 1-10ms average retrieval, 100ms worst-case (10x improvement)

Key Insight:

With n_probes=40, results come from max 40 IVF lists. Organizing data layout by list ID enables sequential I/O reads instead of random access, dramatically improving SSD/RAM performance.

📊 Performance Metrics

MetricValueBaseline Comparison
Index Size Reduction13.0x53.76 GB → 3.36 GB (fits in GPU)
Query Latency (avg)<100ms-
Recall@1089%60% (direct IVF-PQ)
Metadata Access Speed1-10ms100+ms (random access)

🏗️ System Architecture

Query
Input
Cohere API
70ms
GPU Index Search
1-10ms
Sequential I/O Retrieval
1-5ms
Results
<100ms total

Core Components:

1. Index Training Pipeline

Data loading with multiprocessing, GPU memory monitoring, and configurable cuVS training parameters

2. Two-Stage Search Engine

IVF-PQ approximation followed by exact reranking, with list-organized data layout for sequential I/O

3. API Server

FastAPI with async processing, LMDB for metadata storage, Cohere API integration

4. Data Layout Management

IVF list-organized storage, offset-based indexing for sequential access patterns

💡 Key Learnings & Problem-Solving

1. Memory Access Patterns Matter

Achieved 10x speedup through data layout optimization

2. Quantization Trade-offs

Balanced compression vs. accuracy through two-stage retrieval

3. System Integration

Built complete pipeline from training to production deployment