A semantic search system built with NVIDIA cuVS, featuring vector quantization and optimized memory access patterns for sub-100ms query latency on 35M high-dimensional embeddings.
This project builds a complete end-to-end neural search pipeline capable of handling large-scale semantic search across 35 million 768-dimensional Wikipedia embeddings. The system achieves 13x index size reduction and <100ms average query latency through IVF-PQ indexing and memory layout optimization for sequential I/O access patterns.
Technical Details:
Implementation:
Key Insight:
With n_probes=40, results come from max 40 IVF lists. Organizing data layout by list ID enables sequential I/O reads instead of random access, dramatically improving SSD/RAM performance.
Metric | Value | Baseline Comparison |
---|---|---|
Index Size Reduction | 13.0x | 53.76 GB → 3.36 GB (fits in GPU) |
Query Latency (avg) | <100ms | - |
Recall@10 | 89% | 60% (direct IVF-PQ) |
Metadata Access Speed | 1-10ms | 100+ms (random access) |
Data loading with multiprocessing, GPU memory monitoring, and configurable cuVS training parameters
IVF-PQ approximation followed by exact reranking, with list-organized data layout for sequential I/O
FastAPI with async processing, LMDB for metadata storage, Cohere API integration
IVF list-organized storage, offset-based indexing for sequential access patterns
Achieved 10x speedup through data layout optimization
Balanced compression vs. accuracy through two-stage retrieval
Built complete pipeline from training to production deployment