RAG over 10+ databases: what production taught us

TL;DR — In 2026 the consensus is clear: vector-only RAG doesn’t scale in production. At Darwin we built an agentic compliance system with hybrid retrieval across 10+ databases. The best decisions we made had nothing to do with the model — they were about the data layer.

The problem

Our agentic compliance system has to answer questions like:

“Which mango lots from producer X, processed between May and July, met the CTEs required by FSMA 204, and which ones have evidence gaps?”

A single question that combines:

Regulatory knowledge (FSMA 204 CTE/KDE definitions — structured text)
Traceability events (CTEs recorded with timestamp, geolocation, supplier, lot)
Internal business rules (gap analysis, risk scoring — dynamic logic)
Relationships between entities (producer → plant → shipment → retailer)

A single vector store with embeddings of everything mixed together can’t answer that well. The right answer requires joining structured data + semantic retrieval + aggregate computation.

The architecture

Our retrieval stack:

Source	Type	Usage
Qdrant	Vector store	Regulations, doctrine, historical cases (unstructured data)
PostgreSQL	Relational	Traceability events (CTE/KDE with timestamps, IDs, geo)
Firebase/Firestore	Document	Per-customer config, UI state
Cloud Storage	Blob	Original PDFs, audit trails, digital evidence
On-chain (Polygon)	Immutable	Critical attestations, digital signatures

The agent doesn’t know where each thing lives — the orchestrator resolves it.

LangGraph as orchestrator

We use LangGraph to route queries in multiple steps:

Classify — what kind of question is it (regulatory / operational / mixed)
Plan — which retrievals are needed (vector + SQL + graph traversal)
Fan-out — execute retrievals in parallel
Synthesize — pass results to the LLM with structured context
Validate — guardrails to prevent hallucinations on numeric data

The key step was #2: giving the LLM a query planner that decides the retrieval strategy before going to fetch. Without that, the model hallucinates data or pulls in irrelevant context.

What didn’t work

Vector-only with aggressive chunking — our first attempt. It failed on two fronts:

Counting / aggregation queries (how many lots? weekly average?) — the LLM made up numbers when they weren’t explicitly in context
Relational joins (producer X + time window Y + certification Z) — impossible without a structured query

The fix wasn’t “better chunking” — it was separating semantic retrieval from structured queries.

What did work

Explicit query routing → the planner decides whether a question requires vector search, SQL, graph traversal, or a mix.

Numeric guardrails → if the LLM’s answer contains numbers, we verify they match what the structured query returned. If not, fail fast instead of returning wrong data.

Semantic caching at the similar-questions level → cuts LLM costs by ~40% without impact on quality.

Full-trace observability with OpenTelemetry → every query is tracked end-to-end (planner → retrieval → LLM → guardrails). Critical for debugging.

Lessons learned

The bottleneck of RAG in production isn’t retrieval — it’s deciding which retrieval to use
Numeric guardrails save lives when the correctness of an answer drives regulatory decisions
LangGraph beats linear chains for orchestrating conditional retrievals
Multi-store + planner > single vector store with better chunking
LLMs will hallucinate on structured aggregations — no matter how good the model is

What’s next?

The next iteration is to replace some planner rules with router fine-tuning using real production examples. The planner as an LLM is flexible but expensive — caching its decisions into a smaller model is a logical step.

If you’re building RAG for regulated domains, my advice is: start with the query planner, not the vector store.

Are you building something similar? Let’s talk — we’re open to sharing architecture and learning from other cases.

RAG over 10+ databases: what production taught us

The problem

The architecture

LangGraph as orchestrator

What didn’t work

What did work

Lessons learned

What’s next?

Related Posts

Agentic Compliance System with LangGraph: patterns that work in production

Why we put traceability on-chain: FSMA 204 compliance at the protocol level

Offline-first architecture: data capture in rural areas with intermittent connectivity

AI anomaly detection on traceability events: from detection to yield optimization