Production RAG Knowledge Pipeline
End-to-end knowledge and retrieval layer for a production RAG assistant — a deterministic, rule-based ingestion and chunking pipeline (no ML, no LLM), structure-aware indexing for AWS Bedrock, two-stage retrieval with reranking, and a four-pillar evaluation harness.
A measurable, operable RAG knowledge layer: Recall@5 at 78% (MRR 0.66) on an expert-validated gold set (n=56) — up from roughly 30% in early measurements under the platform's default chunking — with every change guarded by stratified confidence-interval metrics and a 5-point recall regression gate that CI-blocks any drop. Built and deployed on AWS Bedrock with EU data residency, in user-acceptance testing ahead of production rollout.
This is the part of a RAG system that decides whether it works: turning an enterprise knowledge base into retrieval-optimised chunks, making them findable against a vector backend without hybrid search, and proving retrieval quality with confidence-interval metrics rather than asserting it. The ingestion pipeline is fully deterministic — no ML inference, no heuristic scoring — so its output is reproducible and auditable.
This page covers the system design, the non-trivial engineering decisions, and how the system is evaluated and operated — not the implementation details.
End-to-end flow: ingestion → chunking → indexing → retrieval & reranking → grounded generation, with a four-pillar evaluation layer feeding the iteration loop.
Background
An enterprise knowledge base had to become reliably answerable through a retrieval-augmented assistant. The hard problems sit upstream of the language model: source content is semi-structured HTML where process knowledge is encoded as UI macros rather than headings; chunk boundaries determine retrieval quality and differ by content type; and the vector backend supports neither hybrid search nor metadata embedding. The work was to own the knowledge and retrieval layer end-to-end and make its quality measurable — generation itself is a managed model call.
Design Decisions
The ingestion pipeline is deterministic and rule-based rather than LLM-driven. In a regulated, public-sector context, reproducibility and auditability outweigh convenience: the same input always yields the same chunks, with no inference cost and no hallucination risk in the knowledge base. It runs as ordered, single-responsibility stages with a compute-once architecture — explicit contracts and invariants enforced at stage boundaries.
Chunking is content-type aware. Process, reference and definition pages are split by different strategies with their own size profiles, and document structure is recovered before chunking (UI macros are promoted to real headings) so that semantic signal survives into the embedding. The chunk is the unit of retrieval, so its quality caps everything downstream.
Because the embedding model only sees the chunk body and the backend has no hybrid search, structural and topic signal is enriched directly into the chunk body, and the query side is expanded symmetrically with the same glossary — closing the vocabulary gap between domain terminology and everyday phrasing.
Retrieval is two-stage: vector recall followed by a reranker, the most effective lever available on this backend — its contribution measured in isolation against the gold set, not assumed.
How I Build Production RAG
RAG here is treated as a measurable loop, not a linear build: ingest, chunk, index, retrieve, generate — with evaluation running across the whole chain and operations (network constraints, data residency, cost, reproducibility) underneath, both feeding the next iteration.
Those two layers underneath are what separate a proof of concept from production. Most RAG work stops when generation looks right; here, every pipeline change ships as a measurable bet against the evaluation harness, and the operational constraints are architecture, not afterthoughts.
Evaluation
Quality is treated as a measured property, not a feeling. A four-pillar harness covers pipeline structure, retrieval (Recall@k, MRR and nDCG with 95% confidence intervals, stratified by source and content type), a direct knowledge-base-vs-bot answer comparison on an identical query population, and a structured human-in-the-loop review where domain experts rate answers across five quality dimensions.
Every pipeline change is guarded: a 5-point recall regression gate fails the run (and CI) if recall drops beyond that threshold, and a byte-exact snapshot test protects the reporting layer. Results and expert feedback feed the next iteration — the loop is the product. On the expert-validated gold set (n=56), Recall@5 stands at 78% (MRR 0.66) — up from roughly 30% in early measurements under the platform's default chunking. The reranker's contribution is measured in isolation: identical queries run pre- and post-rerank lift the gold set from 67% to 78% (MRR 0.54 → 0.66) and the full 287-query pool from 56% to 76%. Deliberately hard edge-case probes are tracked in separate suites and not counted toward those scores.
The query pool is split into three suites by purpose: a frozen, content-hashed benchmark that delivers the verdict; an append-only regression suite where every fixed bug becomes a permanent query; and an LLM-generated audit suite for coverage diagnosis — deliberately kept out of the headline metric, because diagnostic data is not human-validated. Before the split, one pool served all three purposes and unvalidated queries corrupted the aggregate.
Synthetic evaluation queries are validated, not trusted: the synthetic cohort reproduces the expert gold set's difficulty within a few points across multiple metrics, and is used strictly as a statistical surrogate. A drastic divergence is read as a ground-truth problem, never as a win.
Operational Considerations
The system spans two isolated networks — the knowledge source and the cloud environment cannot talk to each other directly. This was solved architecturally: a region-agnostic processing pipeline and a region-aware cloud worker, coupled only through idempotent, encrypted file bundles, with reproducibility and foot-gun protection (no cross-region state) enforced.
The assistant is deployed across two AWS regions with EU data residency, using a code-level deterministic deploy region and gated access. Cost is controlled deliberately: a vector index instead of a managed search cluster, per-run reranker cost counters, and a deterministic replay mode that avoids live API calls during iteration.
The managed knowledge base silently re-chunked the pipeline's already-semantically-chunked output into fixed-size pieces with overlap, destroying the per-content-type strategy splits. The evaluation caught it; the fix was configuring the index to no managed chunking. Platform defaults are part of the system — knowing which ones to disable is operating knowledge a demo never surfaces.
Scope is deliberately bounded. The ingestion pipeline is rule-based and does not attempt LLM chunking; generation and the chat frontend are a managed model and a separate team's surface. The ownership here is the knowledge, retrieval and evaluation layer plus the multi-region deployment — the parts that separate a production RAG system from a demo.
Want the full picture behind this system? Get in touch — or see the engineering principles that run through all of them.