The weekly RAG scorecard before blaming the frontier model.
A product squad blamed GPT until telemetry showed retrieval missed the paragraph half the time — hallucination followed retrieval failure.
Nuqta complements five RAG metrics before blaming an LLM with a single Monday grid so operators cannot hide behind anecdotes [6].
What belongs on the worksheet.
Columns: Recall@k (gold chunk present), citation hit-rate on human-reviewed answers, latency p95 to first streamed token, drift rate when assistants answer sans citing corpora anchors [4]. Pair with retrieval architecture refresher via hybrid lexical + vector retrieval.
Rules that keep dashboards honest.
- Freeze fifty user pain queries thirty days minimum before swapping embedding families [6].
- Snapshot before/after release on identical queries — reroute incident energy if regressions originate upstream [6].
- Lock corpus + query embedding revisions while measuring — otherwise false motion appears as progress.
Four numbers suffice when the query batch is sacred; thirty vanity metrics waste the room.
Caveat.
LLM versioning shifts outputs even when corpus static — annotate provider release IDs every Monday row [6].
The invitation.
Before next Monday noon, circulate the worksheet with fifty locked queries populated — delaying that ritual signals political fear of measurement.
Frequently asked questions.
- Does this replace prior article? Adds operational cadence atop conceptual metrics [5].
- Time cost? Half-day setup first week then fifteen-minute rotations weekly afterward [6].
- Arabic nuance? Measure on your corpora not generic leaderboard corpora.
- Blame attribution? Recall failures route to ingestion — drift routes to prompting or grounding policy [2][6].
- Team size? PM + infra + taxonomy steward suffices under two hundred WAU pilots.
Sources.
[1] Lewis et al. — Dense Passage Retrieval.
[2] NIST — AI RMF governance hooks.
[3] Elasticsearch / Pinecone docs — fused retrieval references (vendor-neutral concept).
[4] Nuqta — frozen Arabic embedding QA protocol, April 2026.
[5] Nuqta — client scorecard appendix referenced here for governance boards.
[6] Nuqta — operational RAG scorecard bundle, June 2026.
Related posts
- Five RAG metrics to check before you blame the LLM.
Before you raise model spend or switch vendors, measure retrieval, chunks, and escalation. Most production hallucination starts in documents and indexes — not parameter count.
- What is RAG — and why your company bot answers like a stranger.
A practical guide to Retrieval-Augmented Generation: how your bot reads documents before answering, and why it costs 10× less than fine-tuning.
- Hybrid search — combining lexical and vector retrieval.
This is not a vendor badge. It is an architecture decision: when token overlap saves you, when embedding similarity saves you, and how to fuse both without doubling cost with nothing to measure.
- Hallucinated citations — auditing RAG source links before you trust the UI.
The UI shows a "source" while the paragraph is missing, truncated, or from the wrong page. This article gives a practical audit path before you ship the assistant to staff or customers.
- Enterprise AI agents vs a RAG-first pipeline — when orchestration is theater.
Most "agents" in production are solid retrieval + a few tools + policies — not a self-driving orchestrator making unsupervised decisions. This article gives a blunt product decision before you multiply complexity.
Share this article