Production RAG is decided before the model runs
June 2026 · 6 min read
Most RAG demos work. Most RAG systems in production don't — and when they fail, the instinct is to reach for a better model, a bigger context window, a smarter prompt. That's almost always the wrong layer. By the time a question reaches the language model, the outcome is mostly decided. What decides it is the part nobody demos: how the knowledge base was turned into retrievable chunks. This is an argument for making that layer deterministic — and what you get when you do.
The model is the last thing that matters
A RAG answer is only as good as the chunks retrieved for it. If the right passage isn't in the retrieved set, no model — however capable — can ground an answer in it. It will either omit the fact or invent one. So retrieval quality is the ceiling on answer quality, and chunk quality is the ceiling on retrieval. Everything downstream inherits whatever the ingestion layer decided.
That reframes where the engineering effort belongs. Generation is increasingly a managed model call — someone else's API, improving every quarter without your help. The part you own, the part specific to your knowledge base, is everything upstream: extraction, structure recovery, chunking, indexing, retrieval. That's where a production system is won or lost — and it's the part no demo shows you.
Why deterministic, not "AI-powered"
The fashionable move is to put an LLM in the ingestion path too — let it chunk, let it summarize, let it tag. It's seductive, and for a knowledge base it's usually a mistake. A deterministic, rule-based pipeline means the same source document always produces the same chunks. That buys three things that matter more than convenience:
None of this is anti-AI. The generation step is still a model call. The point is to keep the model out of the layer that has to be trustworthy and reproducible — and to spend the inference budget where it actually changes the answer.
- — Reproducibility — you can re-run ingestion and diff the output. A change in the chunks is a change you made, not weather.
- — Auditability — in a regulated context, "why does the system know this?" has to have an answer. A deterministic pipeline has one; an LLM that chunked differently last Tuesday does not.
- — No hallucination in the corpus itself — the moment an LLM rewrites your content during ingestion, your retrieval corpus contains text no human wrote. You've moved the hallucination risk upstream, where it's harder to see.
The chunk is the unit of retrieval
If the chunk is what gets embedded and retrieved, its boundaries decide everything. Split a procedure in the middle and neither half retrieves well. Merge two unrelated topics and the embedding is a blur of both. So chunking can't be one-size-fits-all: a reference table, a step-by-step procedure, and a definition page have different natural boundaries and need different strategies and size profiles. Content-type-aware chunking — detect the kind of page, split it the way that kind of page wants to be split — is unglamorous, and it's most of the battle.
There's a subtler problem before chunking even starts: structure is often encoded in ways a naive parser doesn't see. A page might carry its real hierarchy in UI widgets or formatting rather than in headings, so a flat extraction throws that signal away. Recovering it — promoting the implicit headings to real ones before the splitter runs — is what keeps semantic structure alive all the way into the embedding.
Closing the vocabulary gap
Embedding models see only what you give them. If the chunk body is just prose, and the backend has no hybrid (keyword + vector) search to fall back on, then any structural or topical signal that isn't in the body is simply invisible at retrieval time. The embedding can't retrieve a context it was never shown.
Two moves help, and they have to be symmetric. On the ingestion side, enrich the chunk body with the signal it's missing — the section it belongs to, the domain terms it's about. On the query side, expand the incoming question with the same glossary, so domain terminology and everyday phrasing meet in the middle. Enriching one side without the other just shifts the gap; doing both closes it. The symmetry is the point.
Measure the layer, don't assert it
This discipline holds up because retrieval is measurable. It's one of the few parts of an LLM system with honest, classical metrics: Recall@k, MRR, nDCG — computed over a labelled query set, with confidence intervals so you know which differences are real and which are noise.
Once you have that, every change to the pipeline is a measurable bet. A two-stage retrieval — vector recall followed by a reranker — either moves recall or it doesn't, and you can see by how much. In the system this is drawn from, reranking plus structure enrichment lifted Recall@5 from 53% to 67% — measured, not assumed. Wire a regression gate to those numbers and the pipeline can't silently get worse on the next change.
The takeaway
"Deterministic" sounds like the opposite of moving fast with AI. It isn't. It's what lets you move fast without the system quietly rotting — because every part of it is reproducible, measurable, and yours to reason about. The model will keep getting better on its own. The ingestion layer only gets better if you engineer it. That asymmetry is the whole argument for where to spend your time.