Building a RAG System: What Actually Goes Wrong

Retrieval-Augmented Generation (RAG) is now a standard pattern for giving AI models access to a knowledge base. Building one that works in a simple demo is straightforward; building one that works reliably in production is significantly harder. Here is what actually goes wrong and how to fix it.

The Anatomy of a RAG System

A RAG system has five components: document processing (ingesting and chunking documents), embedding (converting text chunks to vector representations), storage (a vector database), retrieval (finding the most relevant chunks at query time), and generation (passing retrieved context to the LLM to generate an answer). Each component is a potential failure point, and failures are often silent — the system returns an answer, but the answer is wrong because a component failed invisibly.

Chunking: The Most Underappreciated Problem

How you split your documents into chunks has an enormous impact on retrieval quality. Common mistakes: fixed-size chunking without regard for semantic boundaries — splitting a sentence in the middle, or splitting a table across chunks, means the retrieved chunk is semantically incomplete. Chunks too small (under 100 tokens): contain too little context to be meaningful. Chunks too large (over 1,000 tokens): contain too much unrelated information, diluting the relevance signal. The better approach: semantic chunking (split on paragraph or section boundaries, not arbitrary token counts); document-aware chunking for structured documents (PDFs, HTML) that extracts tables, headings, and body text separately; hierarchical chunking (small chunks for retrieval, but include the parent section as context in the prompt). Parent-child chunking: a common pattern where small chunks are indexed for retrieval precision, but when a small chunk is retrieved, the larger parent chunk is included in the context. This gives you high retrieval precision (the small chunk matched exactly what was asked) but sufficient context for the LLM to generate a good answer.

Embedding Model Choice and Retrieval Quality

Not all embedding models are equal. General-purpose embedding models (OpenAI text-embedding-3-large, Cohere embed-v3, BGE-M3) perform differently on different document types. A model trained on general web text may perform poorly on technical documentation, legal text, or domain-specific content. The MTEB (Massive Text Embedding Benchmark) leaderboard provides standardised retrieval quality metrics across domains — check it for your specific domain before choosing an embedding model. Hybrid retrieval: combining dense retrieval (embedding similarity) with sparse retrieval (BM25 keyword matching) significantly improves retrieval quality for most real-world datasets. Dense retrieval is good at semantic similarity; BM25 is good at exact keyword matches. Combining both with reciprocal rank fusion (RRF) is a standard pattern that typically outperforms either alone. Re-ranking: after initial retrieval, passing the top-K chunks through a cross-encoder re-ranker (Cohere Rerank, BGE-Reranker, Jina Reranker) significantly improves precision. The initial retrieval is fast but approximate; the re-ranker is slower but more accurate. A common production pattern: retrieve top-20 with fast embedding search, re-rank to top-5 with a cross-encoder, include those 5 in the context.

The Generation Failure Modes

“Context stuffing” failure: retrieving too many chunks fills the context window with tangentially relevant information, and the LLM produces a confused or hallucinated answer. The fix: be selective — 3–7 highly relevant chunks usually outperform 20 moderately relevant chunks. Lost in the middle: LLM attention is stronger at the beginning and end of the context window — highly relevant chunks placed in the middle are often used less than chunks at the edges. Put the most important context first or last. Answer grounding: ask the model to only answer from the provided context and to say “I don’t know” if the answer isn’t in the context. Without this instruction, models will often use their parametric knowledge to fill gaps — which may be correct but isn’t RAG behaviour. Hallucination about sources: LLMs sometimes cite sources that are in the context but quote them inaccurately, or cite sources that aren’t in the context at all. Source verification — checking that quoted text actually appears in the cited chunk — is necessary for high-stakes applications.

上一篇 布拉格以外的捷克共和国:还有什么值得参观的
下一篇 构建RAG系统:实际上出了什么问题