Sovereign RAG · Evaluation

Measured per step, not asserted

Anyone can claim a RAG system “works.” This page is the proof for the live demo — and it isolates each processing step so a weakness can be located, not just noticed. Three pillars: the ingestion that builds the chunks, the retrieval that finds them, and the rerank step measured on its own. The gold set and the method are disclosed in full below.

These numbers are for this demo only (588 chunks of the EU AI Act + GDPR) and are unrelated to retrieval figures quoted for other projects on this site, which were measured on different systems and corpora.

1 · Ingestion & chunking

GOOD

Measured directly on the chunks, with no retrieval — does the pipeline turn two regulations into clean, complete, embedding-friendly units?

100.0%

Structure coverage — every Article & Recital captured

221.5

Avg words / chunk (embedding-friendly band)

0.575

Lexical density (content-word ratio)

Chunk-size distribution across all 588 chunks — the 200–350 word band (accent) is where the embedding model is sharpest.

Regulation	Articles	Recitals	Chunks	Split articles
EU AI Act	113/113	180/180	309	11
GDPR	99/99	173/173	279	7

Extraction is structural and lossless (one chunk per recital; one per article, long articles split on paragraph/point boundaries), so there is no noise-filter stage — willma's noise_ratio and filter_loss have no analogue here and are N/A. Zero-chunk provisions: 0 · oversize-chunk rate 0.0% · max 837 words.

2 · Retrieval

The gold set (40 questions, 16 DE / 24 EN) through the production path — multilingual embedding, Qdrant top-20, rerank to top-10.

97.5%

Recall@5

0.890

MRR

0.911

nDCG@5

Retrieval quality on the production path (all values 0–1, 40 questions).

Slice	n	Recall@5	MRR	nDCG@5	Primary@5	Qual.Recall@5	Diversity@5
Overall	40	97.5%	0.890	0.911	97.5%	97.5%	97.0%
German	16	93.8%	0.875	0.891	93.8%	93.8%	97.5%
English	24	100.0%	0.899	0.925	100.0%	100.0%	96.7%

Negative test — out-of-domain honesty

6 off-topic questions, all of which should be refused. 100.0% were rejected below the cut-off. Positive questions score on average 0.698 at rank 1 versus 0.196 for off-topic — a clean separation margin of 0.501.

Average top-1 rerank score: in-corpus questions sit well above the cut-off, off-topic well below.

3 · Rerank lift

The same metrics computed on the raw vector order (before rerank) and after rerank. The delta is exactly what the rerank step contributes — isolated, not inferred. It moved 11 questions to rank 1 that the vector search alone ranked lower.

Each metric before and after the rerank step. The line is the lift the reranker adds on top of plain vector search.

Metric	Before rerank	After rerank	Lift
Recall@5	90.0%	97.5%	+0.075
MRR	0.784	0.890	+0.106
nDCG@5	0.809	0.911	+0.102

Methodology & the full gold set

Three pillars, each isolating one processing step. Pillar 1 (ingestion) is measured directly on the chunk corpus, with no retrieval. Pillars 2 and 3 run a self-labelled gold set (16 DE / 24 EN) through the production path — multilingual embedding, Qdrant top-20, rerank to top-10 — and score it. Pillar 3 compares the raw vector order against the reranked order, isolating the rerank step's lift. A negative set of out-of-domain questions checks the honesty cut-off. Binary relevance; nothing is tuned on the gold set.

embedding: bge-multilingual-gemma2
rerank: qwen3-embedding-8b
retrieval_top_k: 20 · eval_depth: 10 · cut-off: 0.3
generated: 2026-06-05T08:28:29+00:00

All 40 questions, with the rank of the target provision before → after rerank — nothing hidden, including the 1 miss.

#	Lang	Question	Target	Pre→Post
g01	en	What is the right to erasure?	32016R0679:article:17	#1 → #1
g02	de	Was umfasst das Auskunftsrecht der betroffenen Person?	32016R0679:article:15	#1 → #1
g03	en	Do I have a right to data portability?	32016R0679:article:20	#1 → #1
g04	de	Wie kann ich die Berichtigung falscher Daten verlangen?	32016R0679:article:16	#1 → #1
g05	en	On what legal bases may personal data be processed lawfully?	32016R0679:article:6	#2 → #1
g06	de	Welche Bedingungen gelten fuer eine wirksame Einwilligung?	32016R0679:article:7	#1 → #1
g07	en	How are special categories of sensitive personal data protected?	32016R0679:article:9	#11 → #1
g08	de	Ab welchem Alter koennen Kinder selbst in die Datenverarbeitung einwilligen?	32016R0679:article:8	#1 → #1
g09	en	When must a personal data breach be reported to the supervisory authority?	32016R0679:article:33	#1 → #1
g10	de	Wann muss eine Datenpanne den betroffenen Personen mitgeteilt werden?	32016R0679:article:34	#2 → #1
g11	en	When is a data protection impact assessment required?	32016R0679:article:35	#1 → #1
g12	de	Wann muss ein Datenschutzbeauftragter benannt werden?	32016R0679:article:37	#1 → #1
g13	en	What records of processing activities must a controller keep?	32016R0679:article:30	#1 → #1
g14	en	What does data protection by design and by default require?	32016R0679:article:25	#2 → #1
g15	de	Wie kann ich der Verarbeitung meiner Daten widersprechen?	32016R0679:article:21	#3 → #1
g16	en	Are decisions based solely on automated processing and profiling allowed?	32016R0679:article:22	#1 → #1
g17	en	How high can administrative fines for GDPR infringements be?	32016R0679:article:83	#1 → #1
g18	de	Welche technischen Massnahmen sichern die Verarbeitung ab?	32016R0679:article:32	#1 → #1
g19	en	When can processing be restricted by the data subject?	32016R0679:article:18	#1 → #1
g20	en	When may personal data be transferred based on an adequacy decision?	32016R0679:article:45	#1 → #1
g21	en	Which AI practices are prohibited?	32024R1689:article:5	#1 → #2
g22	de	Welche KI-Praktiken sind verboten?	32024R1689:article:5	#1 → #2
g23	en	How is an AI system classified as high-risk?	32024R1689:article:6	#2 → #1
g24	de	Wie ist ein KI-System rechtlich definiert?	32024R1689:article:3	#1 → #2
g25	en	What transparency obligations apply to chatbots and deepfakes?	32024R1689:article:50	#2 → #1
g26	de	Welche Pflichten haben Anbieter von KI-Modellen mit allgemeinem Verwendungszweck?	32024R1689:article:53	#1 → #1
g27	en	What rules apply to general-purpose AI models with systemic risk?	32024R1689:article:55	#9 → #4
g28	en	What risk management system must high-risk AI systems have?	32024R1689:article:9	#1 → #1
g29	de	Welche Anforderungen gelten fuer Trainingsdaten und Daten-Governance?	32024R1689:article:10	#2 → #1
g30	en	What technical documentation is required for high-risk AI?	32024R1689:article:11	#16 → #1
g31	de	Welche Anforderungen gelten fuer die menschliche Aufsicht?	32024R1689:article:14	#1 → #1
g32	en	What are the accuracy, robustness and cybersecurity requirements?	32024R1689:article:15	#1 → #1
g33	en	How is conformity assessment carried out for high-risk AI?	32024R1689:article:43	#1 → #2
g34	de	Welche Sanktionen und Geldbussen sieht die KI-Verordnung vor?	32024R1689:article:99	#1 → #1
g35	en	What are AI regulatory sandboxes?	32024R1689:article:57	#1 → #3
g36	de	Was ist der Anwendungsbereich der KI-Verordnung?	32024R1689:article:2	#4 → #1
g37	en	What logging and record-keeping must high-risk AI systems provide?	32024R1689:article:12	#2 → #1
g38	en	What obligations do deployers of high-risk AI systems have?	32024R1689:article:26	#1 → #1
g39	de	Muessen Hochrisiko-KI-Systeme registriert werden?	32024R1689:article:49	— → miss
g40	en	What post-market monitoring must providers carry out?	32024R1689:article:72	#1 → #1

Try the live demo See the projects