High-Performance Neural Search Engine

A semantic search system built with NVIDIA cuVS, featuring vector quantization and optimized memory access patterns for sub-100ms query latency on 35M high-dimensional embeddings.

View Source

🚀 Project Overview

This project builds a complete end-to-end neural search pipeline capable of handling large-scale semantic search across 35 million 768-dimensional Wikipedia embeddings. The system achieves 13x index size reduction and <100ms average query latency through IVF-PQ indexing and memory layout optimization for sequential I/O access patterns.

Key Technical Achievements

1. Vector Indexing & GPU Optimization

Problem: 35M vectors × 768 dims × 2 bytes = 53.76 GB - too large to fit in GPU memory
Solution: Used NVIDIA cuVS to train IVF-PQ clusters and add embeddings to the index to reduce search space
Result: Compressed to 3.36 GB (96 bytes/vector) - fits entirely in GPU memory for faster search

Technical Details:

• 32,768 IVF clusters for coarse quantization
• 96 PQ sub-quantizers (8 dimensions each, 8 bits per code)
• Used 2M representative vectors for training, populated with remaining 33M in 500K batches

2. Recall Optimization Through Oversampling

Problem: Direct IVF-PQ search yielded only 60% recall@10
Solution: Built two-stage retrieval pipeline with oversampling + reranking to improve accuracy
Result: Achieved 89% recall@10 while maintaining performance

Implementation:

Stage 1: Retrieve 1000 candidates using IVF-PQ approximation

Stage 2: Exact cosine similarity reranking on original embeddings

3. Memory Access Pattern Optimization

Problem: Accessing metadata for top k results in random order took 100+ ms (unacceptable for real-time search)
Solution: Reorganized embeddings and metadata by IVF list structure to enable sequential I/O access patterns on SSD/RAM
Result: 1-10ms average retrieval, 100ms worst-case (10x improvement)

Key Insight:

With n_probes=40, results come from max 40 IVF lists. Organizing data layout by list ID enables sequential I/O reads instead of random access, dramatically improving SSD/RAM performance.

📊 Performance Metrics

Metric	Value	Baseline Comparison
Index Size Reduction	13.0x	53.76 GB → 3.36 GB (fits in GPU)
Query Latency (avg)	<100ms	-
Recall@10	89%	60% (direct IVF-PQ)
Metadata Access Speed	1-10ms	100+ms (random access)

🏗️ System Architecture

Query

Input

→

Cohere API

70ms

→

GPU Index Search

1-10ms

→

Sequential I/O Retrieval

1-5ms

→

Results

<100ms total

Core Components:

1. Index Training Pipeline

Data loading with multiprocessing, GPU memory monitoring, and configurable cuVS training parameters

2. Two-Stage Search Engine

IVF-PQ approximation followed by exact reranking, with list-organized data layout for sequential I/O

3. API Server

FastAPI with async processing, LMDB for metadata storage, Cohere API integration

4. Data Layout Management

IVF list-organized storage, offset-based indexing for sequential access patterns

💡 Key Learnings & Problem-Solving

1. Memory Access Patterns Matter

Achieved 10x speedup through data layout optimization

2. Quantization Trade-offs

Balanced compression vs. accuracy through two-stage retrieval

3. System Integration

Built complete pipeline from training to production deployment