Sovereign RAG · Evaluation
Measured per step, not asserted
Anyone can claim a RAG system “works.” This page is the proof for the live demo — and it isolates each processing step so a weakness can be located, not just noticed. Three pillars: the ingestion that builds the chunks, the retrieval that finds them, and the rerank step measured on its own. The gold set and the method are disclosed in full below.
These numbers are for this demo only (588 chunks of the EU AI Act + GDPR) and are unrelated to retrieval figures quoted for other projects on this site, which were measured on different systems and corpora.
1 · Ingestion & chunking
GOODMeasured directly on the chunks, with no retrieval — does the pipeline turn two regulations into clean, complete, embedding-friendly units?
| Regulation | Articles | Recitals | Chunks | Split articles |
|---|---|---|---|---|
| EU AI Act | 113/113 | 180/180 | 309 | 11 |
| GDPR | 99/99 | 173/173 | 279 | 7 |
Extraction is structural and lossless (one chunk per recital; one per article, long articles split on paragraph/point boundaries), so there is no noise-filter stage — willma's noise_ratio and filter_loss have no analogue here and are N/A. Zero-chunk provisions: 0 · oversize-chunk rate 0.0% · max 837 words.
2 · Retrieval
The gold set (40 questions, 16 DE / 24 EN) through the production path — multilingual embedding, Qdrant top-20, rerank to top-10.
| Slice | n | Recall@5 | MRR | nDCG@5 | Primary@5 | Qual.Recall@5 | Diversity@5 |
|---|---|---|---|---|---|---|---|
| Overall | 40 | 97.5% | 0.890 | 0.911 | 97.5% | 97.5% | 97.0% |
| German | 16 | 93.8% | 0.875 | 0.891 | 93.8% | 93.8% | 97.5% |
| English | 24 | 100.0% | 0.899 | 0.925 | 100.0% | 100.0% | 96.7% |
Negative test — out-of-domain honesty
6 off-topic questions, all of which should be refused. 100.0% were rejected below the cut-off. Positive questions score on average 0.698 at rank 1 versus 0.196 for off-topic — a clean separation margin of 0.501.
3 · Rerank lift
The same metrics computed on the raw vector order (before rerank) and after rerank. The delta is exactly what the rerank step contributes — isolated, not inferred. It moved 11 questions to rank 1 that the vector search alone ranked lower.
| Metric | Before rerank | After rerank | Lift |
|---|---|---|---|
| Recall@5 | 90.0% | 97.5% | +0.075 |
| MRR | 0.784 | 0.890 | +0.106 |
| nDCG@5 | 0.809 | 0.911 | +0.102 |
Methodology & the full gold set
Three pillars, each isolating one processing step. Pillar 1 (ingestion) is measured directly on the chunk corpus, with no retrieval. Pillars 2 and 3 run a self-labelled gold set (16 DE / 24 EN) through the production path — multilingual embedding, Qdrant top-20, rerank to top-10 — and score it. Pillar 3 compares the raw vector order against the reranked order, isolating the rerank step's lift. A negative set of out-of-domain questions checks the honesty cut-off. Binary relevance; nothing is tuned on the gold set.
embedding: bge-multilingual-gemma2
rerank: qwen3-embedding-8b
retrieval_top_k: 20 · eval_depth: 10 · cut-off: 0.3
generated: 2026-06-05T08:28:29+00:00
All 40 questions, with the rank of the target provision before → after rerank — nothing hidden, including the 1 miss.
| # | Lang | Question | Target | Pre→Post |
|---|---|---|---|---|
| g01 | en | What is the right to erasure? | 32016R0679:article:17 | #1 → #1 |
| g02 | de | Was umfasst das Auskunftsrecht der betroffenen Person? | 32016R0679:article:15 | #1 → #1 |
| g03 | en | Do I have a right to data portability? | 32016R0679:article:20 | #1 → #1 |
| g04 | de | Wie kann ich die Berichtigung falscher Daten verlangen? | 32016R0679:article:16 | #1 → #1 |
| g05 | en | On what legal bases may personal data be processed lawfully? | 32016R0679:article:6 | #2 → #1 |
| g06 | de | Welche Bedingungen gelten fuer eine wirksame Einwilligung? | 32016R0679:article:7 | #1 → #1 |
| g07 | en | How are special categories of sensitive personal data protected? | 32016R0679:article:9 | #11 → #1 |
| g08 | de | Ab welchem Alter koennen Kinder selbst in die Datenverarbeitung einwilligen? | 32016R0679:article:8 | #1 → #1 |
| g09 | en | When must a personal data breach be reported to the supervisory authority? | 32016R0679:article:33 | #1 → #1 |
| g10 | de | Wann muss eine Datenpanne den betroffenen Personen mitgeteilt werden? | 32016R0679:article:34 | #2 → #1 |
| g11 | en | When is a data protection impact assessment required? | 32016R0679:article:35 | #1 → #1 |
| g12 | de | Wann muss ein Datenschutzbeauftragter benannt werden? | 32016R0679:article:37 | #1 → #1 |
| g13 | en | What records of processing activities must a controller keep? | 32016R0679:article:30 | #1 → #1 |
| g14 | en | What does data protection by design and by default require? | 32016R0679:article:25 | #2 → #1 |
| g15 | de | Wie kann ich der Verarbeitung meiner Daten widersprechen? | 32016R0679:article:21 | #3 → #1 |
| g16 | en | Are decisions based solely on automated processing and profiling allowed? | 32016R0679:article:22 | #1 → #1 |
| g17 | en | How high can administrative fines for GDPR infringements be? | 32016R0679:article:83 | #1 → #1 |
| g18 | de | Welche technischen Massnahmen sichern die Verarbeitung ab? | 32016R0679:article:32 | #1 → #1 |
| g19 | en | When can processing be restricted by the data subject? | 32016R0679:article:18 | #1 → #1 |
| g20 | en | When may personal data be transferred based on an adequacy decision? | 32016R0679:article:45 | #1 → #1 |
| g21 | en | Which AI practices are prohibited? | 32024R1689:article:5 | #1 → #2 |
| g22 | de | Welche KI-Praktiken sind verboten? | 32024R1689:article:5 | #1 → #2 |
| g23 | en | How is an AI system classified as high-risk? | 32024R1689:article:6 | #2 → #1 |
| g24 | de | Wie ist ein KI-System rechtlich definiert? | 32024R1689:article:3 | #1 → #2 |
| g25 | en | What transparency obligations apply to chatbots and deepfakes? | 32024R1689:article:50 | #2 → #1 |
| g26 | de | Welche Pflichten haben Anbieter von KI-Modellen mit allgemeinem Verwendungszweck? | 32024R1689:article:53 | #1 → #1 |
| g27 | en | What rules apply to general-purpose AI models with systemic risk? | 32024R1689:article:55 | #9 → #4 |
| g28 | en | What risk management system must high-risk AI systems have? | 32024R1689:article:9 | #1 → #1 |
| g29 | de | Welche Anforderungen gelten fuer Trainingsdaten und Daten-Governance? | 32024R1689:article:10 | #2 → #1 |
| g30 | en | What technical documentation is required for high-risk AI? | 32024R1689:article:11 | #16 → #1 |
| g31 | de | Welche Anforderungen gelten fuer die menschliche Aufsicht? | 32024R1689:article:14 | #1 → #1 |
| g32 | en | What are the accuracy, robustness and cybersecurity requirements? | 32024R1689:article:15 | #1 → #1 |
| g33 | en | How is conformity assessment carried out for high-risk AI? | 32024R1689:article:43 | #1 → #2 |
| g34 | de | Welche Sanktionen und Geldbussen sieht die KI-Verordnung vor? | 32024R1689:article:99 | #1 → #1 |
| g35 | en | What are AI regulatory sandboxes? | 32024R1689:article:57 | #1 → #3 |
| g36 | de | Was ist der Anwendungsbereich der KI-Verordnung? | 32024R1689:article:2 | #4 → #1 |
| g37 | en | What logging and record-keeping must high-risk AI systems provide? | 32024R1689:article:12 | #2 → #1 |
| g38 | en | What obligations do deployers of high-risk AI systems have? | 32024R1689:article:26 | #1 → #1 |
| g39 | de | Muessen Hochrisiko-KI-Systeme registriert werden? | 32024R1689:article:49 | — → miss |
| g40 | en | What post-market monitoring must providers carry out? | 32024R1689:article:72 | #1 → #1 |