Sovereign RAG · Evaluation

Measured per step, not asserted

Anyone can claim a RAG system “works.” This page is the proof for the live demo — and it isolates each processing step so a weakness can be located, not just noticed. Three pillars: the ingestion that builds the chunks, the retrieval that finds them, and the rerank step measured on its own. The gold set and the method are disclosed in full below.

These numbers are for this demo only (588 chunks of the EU AI Act + GDPR) and are unrelated to retrieval figures quoted for other projects on this site, which were measured on different systems and corpora.

1 · Ingestion & chunking

GOOD

Measured directly on the chunks, with no retrieval — does the pipeline turn two regulations into clean, complete, embedding-friendly units?

100.0%
Structure coverage — every Article & Recital captured
221.5
Avg words / chunk (embedding-friendly band)
0.575
Lexical density (content-word ratio)
172 0-100 181 100-200 122 200-350 50 350-500 51 500-700 12 700+ words per chunk embedding-friendly band
Chunk-size distribution across all 588 chunks — the 200–350 word band (accent) is where the embedding model is sharpest.
RegulationArticlesRecitalsChunksSplit articles
EU AI Act 113/113 180/180 309 11
GDPR 99/99 173/173 279 7

Extraction is structural and lossless (one chunk per recital; one per article, long articles split on paragraph/point boundaries), so there is no noise-filter stage — willma's noise_ratio and filter_loss have no analogue here and are N/A. Zero-chunk provisions: 0 · oversize-chunk rate 0.0% · max 837 words.

2 · Retrieval

The gold set (40 questions, 16 DE / 24 EN) through the production path — multilingual embedding, Qdrant top-20, rerank to top-10.

97.5%
Recall@5
0.890
MRR
0.911
nDCG@5
Recall@5 0.975 MRR 0.890 nDCG@5 0.911 Primary-Hit@5 0.975 Diversity@5 0.970
Retrieval quality on the production path (all values 0–1, 40 questions).
SlicenRecall@5MRRnDCG@5Primary@5Qual.Recall@5Diversity@5
Overall4097.5%0.8900.91197.5%97.5%97.0%
German1693.8%0.8750.89193.8%93.8%97.5%
English24100.0%0.8990.925100.0%100.0%96.7%

Negative test — out-of-domain honesty

6 off-topic questions, all of which should be refused. 100.0% were rejected below the cut-off. Positive questions score on average 0.698 at rank 1 versus 0.196 for off-topic — a clean separation margin of 0.501.

cut-off 0.3 In-corpus (positive) 0.698 Off-topic 0.196
Average top-1 rerank score: in-corpus questions sit well above the cut-off, off-topic well below.

3 · Rerank lift

The same metrics computed on the raw vector order (before rerank) and after rerank. The delta is exactly what the rerank step contributes — isolated, not inferred. It moved 11 questions to rank 1 that the vector search alone ranked lower.

before rerank after Recall@5 +0.075 MRR +0.106 nDCG@5 +0.102
Each metric before and after the rerank step. The line is the lift the reranker adds on top of plain vector search.
MetricBefore rerankAfter rerankLift
Recall@590.0%97.5%+0.075
MRR0.7840.890+0.106
nDCG@50.8090.911+0.102

Methodology & the full gold set

Three pillars, each isolating one processing step. Pillar 1 (ingestion) is measured directly on the chunk corpus, with no retrieval. Pillars 2 and 3 run a self-labelled gold set (16 DE / 24 EN) through the production path — multilingual embedding, Qdrant top-20, rerank to top-10 — and score it. Pillar 3 compares the raw vector order against the reranked order, isolating the rerank step's lift. A negative set of out-of-domain questions checks the honesty cut-off. Binary relevance; nothing is tuned on the gold set.

embedding: bge-multilingual-gemma2
rerank: qwen3-embedding-8b
retrieval_top_k: 20 · eval_depth: 10 · cut-off: 0.3
generated: 2026-06-05T08:28:29+00:00

All 40 questions, with the rank of the target provision beforeafter rerank — nothing hidden, including the 1 miss.

#LangQuestionTargetPre→Post
g01 en What is the right to erasure? 32016R0679:article:17 #1 #1
g02 de Was umfasst das Auskunftsrecht der betroffenen Person? 32016R0679:article:15 #1 #1
g03 en Do I have a right to data portability? 32016R0679:article:20 #1 #1
g04 de Wie kann ich die Berichtigung falscher Daten verlangen? 32016R0679:article:16 #1 #1
g05 en On what legal bases may personal data be processed lawfully? 32016R0679:article:6 #2 #1
g06 de Welche Bedingungen gelten fuer eine wirksame Einwilligung? 32016R0679:article:7 #1 #1
g07 en How are special categories of sensitive personal data protected? 32016R0679:article:9 #11 #1
g08 de Ab welchem Alter koennen Kinder selbst in die Datenverarbeitung einwilligen? 32016R0679:article:8 #1 #1
g09 en When must a personal data breach be reported to the supervisory authority? 32016R0679:article:33 #1 #1
g10 de Wann muss eine Datenpanne den betroffenen Personen mitgeteilt werden? 32016R0679:article:34 #2 #1
g11 en When is a data protection impact assessment required? 32016R0679:article:35 #1 #1
g12 de Wann muss ein Datenschutzbeauftragter benannt werden? 32016R0679:article:37 #1 #1
g13 en What records of processing activities must a controller keep? 32016R0679:article:30 #1 #1
g14 en What does data protection by design and by default require? 32016R0679:article:25 #2 #1
g15 de Wie kann ich der Verarbeitung meiner Daten widersprechen? 32016R0679:article:21 #3 #1
g16 en Are decisions based solely on automated processing and profiling allowed? 32016R0679:article:22 #1 #1
g17 en How high can administrative fines for GDPR infringements be? 32016R0679:article:83 #1 #1
g18 de Welche technischen Massnahmen sichern die Verarbeitung ab? 32016R0679:article:32 #1 #1
g19 en When can processing be restricted by the data subject? 32016R0679:article:18 #1 #1
g20 en When may personal data be transferred based on an adequacy decision? 32016R0679:article:45 #1 #1
g21 en Which AI practices are prohibited? 32024R1689:article:5 #1 #2
g22 de Welche KI-Praktiken sind verboten? 32024R1689:article:5 #1 #2
g23 en How is an AI system classified as high-risk? 32024R1689:article:6 #2 #1
g24 de Wie ist ein KI-System rechtlich definiert? 32024R1689:article:3 #1 #2
g25 en What transparency obligations apply to chatbots and deepfakes? 32024R1689:article:50 #2 #1
g26 de Welche Pflichten haben Anbieter von KI-Modellen mit allgemeinem Verwendungszweck? 32024R1689:article:53 #1 #1
g27 en What rules apply to general-purpose AI models with systemic risk? 32024R1689:article:55 #9 #4
g28 en What risk management system must high-risk AI systems have? 32024R1689:article:9 #1 #1
g29 de Welche Anforderungen gelten fuer Trainingsdaten und Daten-Governance? 32024R1689:article:10 #2 #1
g30 en What technical documentation is required for high-risk AI? 32024R1689:article:11 #16 #1
g31 de Welche Anforderungen gelten fuer die menschliche Aufsicht? 32024R1689:article:14 #1 #1
g32 en What are the accuracy, robustness and cybersecurity requirements? 32024R1689:article:15 #1 #1
g33 en How is conformity assessment carried out for high-risk AI? 32024R1689:article:43 #1 #2
g34 de Welche Sanktionen und Geldbussen sieht die KI-Verordnung vor? 32024R1689:article:99 #1 #1
g35 en What are AI regulatory sandboxes? 32024R1689:article:57 #1 #3
g36 de Was ist der Anwendungsbereich der KI-Verordnung? 32024R1689:article:2 #4 #1
g37 en What logging and record-keeping must high-risk AI systems provide? 32024R1689:article:12 #2 #1
g38 en What obligations do deployers of high-risk AI systems have? 32024R1689:article:26 #1 #1
g39 de Muessen Hochrisiko-KI-Systeme registriert werden? 32024R1689:article:49 miss
g40 en What post-market monitoring must providers carry out? 32024R1689:article:72 #1 #1