In the rush to deploy "Chat with Your Data," enterprise architects are over-indexing on the wrong part of the stack. They are obsessing over vector database selection, embedding dimensions, and chunking overlap.
Yet, when I audit failing RAG systems—the ones that are technically "live" but trusted by no one—the root cause is almost never the vector store. The root cause is upstream. It’s data hygiene.
Garbage in, Vectorized Garbage out.
If your source documents are stale, contradictory, or owned by no one, the most advanced retrieval algorithm in the world will simply retrieve the confusion faster.
1. Why “Better Chunking” Is a Red Herring
The industry is drowning in tutorials about "Semantic Chunking" vs. "Fixed-Size Chunking." This distracts from the actual failure mode.
Consider a standard HR bot. The RAG system retrieves a document titled "Remote Work Policy 2021" (chunked perfectly) and generates an answer. But the actual policy is a PDF email attachment sent by the CHRO last Tuesday titled "Update to Hybrid Work - Final V2.pdf".
No amount of chunking optimization can fix a source-of-truth problem. Engineers spend weeks tuning `top_k` parameters when they should be spending weeks building a document lifecycle governance pipeline.
2. The Real Causes of Enterprise Hallucinations
In an enterprise context, "hallucination" is rarely the model inventing facts. It is usually the model correctly summarizing incorrect facts that were retrieved from your own database.
Source-of-Truth Ambiguity
Enterprises are rife with duplicated data. The "Pricing Sheet" exists in SharePoint, in Salesforce, and in 50 local hard drives. If your RAG pipeline ingests all of them, the vector search will return conflicting chunks. The LLM, trying to be helpful, will either blend them (creating a Frankenstein policy) or pick the one that "looks" most relevant—often the oldest, simplest one.
Stale Documents
Documents have lifecycles. A policy document from 2022 is toxic waste in 2025. If your vector index doesn't have a "TTL" (Time To Live) or a robust deletion syncing mechanism, you are building a legacy debt machine.
3. Data Pipelines vs. Prompt Tricks
The solution is not "Prompt Engineering" (e.g., "Only answer from the context"). The solution is Data Engineering.
A production RAG architecture must treat documents like code. They need versioning. They need owners. They need deprecation schedules.
4. RAG Evaluation Is Missing
Ask a typical RAG team: "Why did the answer to 'What is our refund policy?' change between Tuesday and today?" Most cannot answer.
They lack a Golden Dataset—a set of 100+ QA pairs that are ground-truth verified by humans. Without running your RAG pipeline against a Golden Dataset on every commit (CI/CD for Knowledge), you are flying blind. You don't know if your new chunking strategy improved accuracy or just broke the pricing retrieval while fixing the HR retrieval.
5. What a Governed RAG Pipeline Looks Like
To move from "Demo RAG" to "Production RAG," you need:
- Explicit Ownership metadata: Every chunk in the vector DB must have an `owner_id` and `last_verified_date`.
- Ingestion Filters: "If a document hasn't been touched in 2 years, do not index it."
- Access Control Lists (ACLs): The vector search results must be filtered by the user's permissions before generation. You cannot rely on the LLM to "ignore" sensitive data it was fed.
- The Citation Link: Every answer must link back to the specific source document version, allowing the user to verify (and the owner to be held accountable).
1. Who owns the 'Delete' button? When a policy changes in SharePoint, how many minutes until the old chunks are purged from the Vector DB?
2. Show me the Golden Set. Do not accept "it feels better." Demand an automated evaluation score (e.g., RAGAS or TruLens) against a verified dataset.
3. Audit the Source. Pick 5 random documents in the index. Ask: "Is this the absolute current truth?" If 1 is wrong, your trust score is zero.