Retrieval-Augmented Generation (RAG) is one of the most widely deployed patterns for building AI applications that work with specific, up-to-date, or proprietary information. Here is what it actually is and when to use it.
The Problem RAG Solves
Large language models (LLMs) have training cutoffs — they don’t know about events after their training data ends. They also don’t know your company’s internal documents, your product documentation, or your database contents. You could fine-tune a model on your data, but fine-tuning is expensive, slow to update, and doesn’t work well for retrieval of specific facts. RAG solves this differently: at query time, retrieve relevant documents from your knowledge base, then include them in the prompt as context. The model answers based on the retrieved context, not just its training data.
How RAG Works
The architecture has two phases. Indexing: your documents are split into chunks (typically 500–1000 tokens), each chunk is converted to an embedding (a vector of numbers representing semantic meaning using a model like text-embedding-3-small from OpenAI or sentence-transformers), and these embeddings are stored in a vector database (Pinecone, Chroma, Weaviate, pgvector). Retrieval and generation: when a user asks a question, the question is converted to an embedding, the vector database finds the most similar chunks (cosine similarity search), the top k chunks are included in the prompt as context, and the LLM generates an answer based on the context. The key: the model can now answer questions about documents it was never trained on, and you can update your knowledge base without retraining.
When RAG Is the Right Choice
Use RAG when: your application requires answers from a specific, regularly updated knowledge base; you need the model to cite or reference specific documents; your data is proprietary and cannot be included in fine-tuning; or the volume of data exceeds what can fit in a single prompt context. Do not use RAG when: the information is already in the model’s training data; you need complex reasoning across many documents simultaneously (RAG retrieves relevant chunks, not all of them); or you need the model to learn patterns from data rather than retrieve facts.
Building a Simple RAG System
The minimal stack: a document chunker (LangChain’s RecursiveCharacterTextSplitter or simple Python splitting), an embedding model (OpenAI text-embedding-3-small at $0.02/million tokens), a vector store (Chroma for local development, pgvector for production PostgreSQL integration), and an LLM (Claude or GPT-4 for generation). The prompt template: “Use the following context to answer the question. Context: {retrieved_chunks}. Question: {user_question}.” Adding source citations: include the document name and chunk position in the metadata so the model can cite its sources. This basic architecture handles most production RAG requirements for knowledge base Q&A, documentation search, and internal chatbots.




