Alvin Hans
Case Study // Case Study 01

Hybrid Search Recommendation API

A low-latency hybrid retrieval API combining lexical matching, semantic embeddings, and user segmentation.

Pipeline Latency<100ms
NDCG@100.798
MRR Score0.373

The Strategic Problem

The Context

"A traditional keyword system was failing users on typos and conceptual queries (e.g., 'breakfast' vs 'cereal'). We needed intelligence that didn't break latency budgets or require cloud LLMs."

Designed to reduce the 'zero results' problem in large retail catalogs, the system needed to understand user intent instead of relying only on exact keyword overlap.

  • The Trade-off Map: Constrained by operational overhead, meaning pure theoretical accuracy was less important than practical reliability.
  • Constraint 01: VM/CPU Training Limits: Forced architectural constraints to prioritize highly efficient Bi-Encoders instead of massive LLMs.
  • Constraint 02: Decoupled Deployment: Runtime artifacts had to be decoupled from heavy vector databases so inference didn't choke.
System Blueprints

The Diamond Centerpiece

CLIENT REQUEST
API GATEWAY
Stage 1ALexical
Stage 1BSemantic
Aggregation StageFusion
Architecture Map: Multi-stage pipeline optimized for throughput and relevance.

Technical Rationale

Core Approach

Implemented a two-stage model design splitting representation learning and query understanding. Packed loosely coupled Lexical, Semantic, and Personalization layers into a decoupled FastAPI ecosystem.

Outcome

End-to-end ML engineering: bridging typo-recovery, cosine-similarity embedding, and cohort-level personalization into a sub-100ms inference path.

Lexical

BM25 matching paired with SymSpell recovery for typo-tolerance and precision.

Semantic

IndoBERT encoders compressed into an in-memory Meta FAISS index.

Evaluation Metrics

Quantitative Validation

Observation 01

Stage 1 Retrieval Quality: Achieved NDCG@10 of 0.798 for candidate selection.

Observation 02

Stage 2 Personalization: Reached combined MRR of 0.373 and Recall@5 of 0.558.

Observation 03

Throughput Stability: Maintained an average latency of 98.9ms across the hybrid pipeline.

Production Delivery & Infra

Microservice Decoupling: Training and Inference are strictly decoupled into two isolated FastAPI layers to prevent heavy vector indexing from choking the API runtime.

Non-GPU Compute Constraints: Deployed entirely on CPU instances. This forced the architectural choice of highly optimized Bi-Encoders and FAISS HNSW over heavy Cross-Encoders to maintain sub-100ms speeds.

Asynchronous State Sync: Offline jobs handle embedding updates and push lightweight runtime checkpoints, ensuring the serving API remains purely functional with zero-downtime reloads.

Project Repository & Exploration