Production Retrieval-Augmented Generation

Ask your documentation
anything.

Hybrid retrieval fuses keyword precision with semantic understanding. A cross-encoder reranker sharpens relevance. Every answer cites its sources.

BM25 + Vector Search Cross-Encoder Reranking Enforced Citations Fully Local Inference

How It Works

01

Chunking

Documents split into 512-token overlapping chunks via recursive character splitting

02

Dual Index

BM25Okapi for keyword matching + ChromaDB vectors for semantic search

03

Hybrid Fusion

Reciprocal Rank Fusion merges both retrieval signals into unified ranking

04

Reranking

Cross-encoder ms-marco-MiniLM rescores top candidates for precision

05

Generation

Ollama LLM generates cited answers grounded in retrieved context

Documents Indexed
Total Chunks
Embedding Model
Reranker
LLM Engine
Health Status

Technology Stack

Embeddings all-MiniLM-L6-v2 Local
Vector Store ChromaDB Persistent
Keyword Search BM25Okapi In-Memory
Reranker ms-marco-MiniLM-L-6-v2 Cross-Encoder
LLM Ollama / phi3:mini Local
Framework FastAPI Async

Approach Comparison

Method Strength Weakness Active
BM25 (Keyword) Exact term matching, zero-model overhead Misses synonyms and semantic similarity Active
Vector Search Semantic understanding, handles paraphrasing Can miss exact terms, embedding-quality dependent Active
Hybrid RRF Best of both: keyword precision + semantic recall Marginally higher latency from dual retrieval Active
Cross-Encoder Rerank High-precision reordering of candidate set Slower than bi-encoder, runs on top-N only Active
Naive RAG Simple to implement Lower recall, no precision refinement Unused