Hybrid Search Recommendation API
A low-latency hybrid retrieval API combining lexical matching, semantic embeddings, and user segmentation.
The Strategic Problem
The Context
"A traditional keyword system was failing users on typos and conceptual queries (e.g., 'breakfast' vs 'cereal'). We needed intelligence that didn't break latency budgets or require cloud LLMs."
Designed to reduce the 'zero results' problem in large retail catalogs, the system needed to understand user intent instead of relying only on exact keyword overlap.
- The Trade-off Map: Constrained by operational overhead, meaning pure theoretical accuracy was less important than practical reliability.
- Constraint 01: VM/CPU Training Limits: Forced architectural constraints to prioritize highly efficient Bi-Encoders instead of massive LLMs.
- Constraint 02: Decoupled Deployment: Runtime artifacts had to be decoupled from heavy vector databases so inference didn't choke.
The Diamond Centerpiece
Technical Rationale
Core Approach
Implemented a two-stage model design splitting representation learning and query understanding. Packed loosely coupled Lexical, Semantic, and Personalization layers into a decoupled FastAPI ecosystem.
Outcome
End-to-end ML engineering: bridging typo-recovery, cosine-similarity embedding, and cohort-level personalization into a sub-100ms inference path.
Lexical
BM25 matching paired with SymSpell recovery for typo-tolerance and precision.
Semantic
IndoBERT encoders compressed into an in-memory Meta FAISS index.
Quantitative Validation
Stage 1 Retrieval Quality: Achieved NDCG@10 of 0.798 for candidate selection.
Stage 2 Personalization: Reached combined MRR of 0.373 and Recall@5 of 0.558.
Throughput Stability: Maintained an average latency of 98.9ms across the hybrid pipeline.
Production Delivery & Infra
Microservice Decoupling: Training and Inference are strictly decoupled into two isolated FastAPI layers to prevent heavy vector indexing from choking the API runtime.
Non-GPU Compute Constraints: Deployed entirely on CPU instances. This forced the architectural choice of highly optimized Bi-Encoders and FAISS HNSW over heavy Cross-Encoders to maintain sub-100ms speeds.
Asynchronous State Sync: Offline jobs handle embedding updates and push lightweight runtime checkpoints, ensuring the serving API remains purely functional with zero-downtime reloads.