RAG Architecture: How to Build Retrieval-Augmented Generation Systems

Q: How do I know if my RAG system is good enough for production?

Define 'good enough' before you start. For internal tools, 80% accuracy with graceful failure is acceptable. For customer-facing apps, aim for 90%+. Key benchmarks: retrieval hit rate above 85% at K=5, faithfulness above 90%, and user satisfaction above 4/5 in testing.

Every company with a knowledge base wants a chatbot that answers questions about it. Every team building one hits the same problems: the AI hallucinates answers, retrieval returns irrelevant documents, and the whole thing works great in demos but fails on real user questions.

RAG (Retrieval-Augmented Generation) is the architecture that solves this, when built correctly. It connects a language model to your actual data so it can answer questions grounded in real information instead of making things up.

This guide covers the full pipeline. Not just the theory, but the practical decisions you'll face at every stage and the mistakes that'll cost you weeks if you don't know about them upfront.

What Is RAG and Why Does It Matter?

RAG is a two-step process. First, retrieve relevant documents from a knowledge base. Second, feed those documents to a language model along with the user's question and ask it to generate an answer using only the provided context.

Without RAG, a language model can only answer based on what it learned during training. It can't access your company's documentation, product specs, or internal knowledge. It either admits it doesn't know (best case) or confidently makes up an answer (worst case).

With RAG, the model has access to your specific data at query time. It doesn't need to "know" everything. It just needs to read the right documents and synthesize an answer.

RAG vs Fine-Tuning: When to Use Each

This is the first decision you'll face, and getting it wrong wastes months.

Use RAG when:

Your knowledge base changes frequently (product docs, policies, FAQ updates)
You need the model to cite specific sources
You have a large corpus of documents the model needs to reference
Accuracy and factual grounding are critical
You want to get started quickly without training infrastructure

Use fine-tuning when:

You need the model to adopt a specific style or format consistently
The knowledge is stable and doesn't change often
You need to reduce per-query costs (embedding the knowledge in weights eliminates retrieval costs)
You want the model to learn new behaviors, not just access new information

Use both when: You need a model that writes in your brand voice (fine-tuning) and references current documentation (RAG). This combination is increasingly common in production systems.

The RAG Pipeline

A RAG system has four major components: document processing, embedding, retrieval, and generation. Let's go through each one.

Stage 1: Document Processing (Chunking)

You can't feed entire documents to a language model for two reasons: they won't fit in the context window, and even if they did, the model would struggle to find the relevant information buried in thousands of pages. You need to break documents into smaller chunks.

Chunking strategy is the single most impactful decision in RAG. Get it wrong and nothing downstream can compensate.

Chunk Size

Smaller chunks (100-200 tokens) give you more precise retrieval. The retrieved chunk is more likely to be relevant to the specific question. But small chunks lose context. A sentence fragment might not make sense without the surrounding paragraph.

Larger chunks (500-1,000 tokens) preserve more context. The model has enough information to generate a complete answer. But large chunks reduce retrieval precision. A chunk might contain one relevant sentence buried in nine irrelevant ones.

The sweet spot for most use cases: 300 to 500 tokens per chunk, with 50 to 100 tokens of overlap between consecutive chunks. The overlap ensures you don't split critical information across chunk boundaries.

Chunking Methods

Fixed-size chunking: Split every N tokens. Simple but ignores document structure. A chunk might start mid-sentence.
Recursive character splitting: Split on paragraphs first, then sentences, then words. Preserves natural boundaries. This is what LangChain's RecursiveCharacterTextSplitter does, and it's the most common approach.
Semantic chunking: Use an embedding model to detect topic shifts and split at semantic boundaries. More expensive but produces more coherent chunks. Good for documents that don't have clear structural markers.
Document-aware chunking: Use the document's own structure. Split on headings, sections, or chapters. Preserves the author's organizational intent. Best for well-structured documents like documentation, textbooks, and legal contracts.

Chunking Best Practice

Start with recursive character splitting at 400 tokens with 100-token overlap. Test with 20 real user questions. If retrieval quality is poor, try document-aware chunking or adjust chunk size. Don't over-engineer chunking before you have test data showing you need to.

Metadata Preservation

Every chunk should carry metadata: source document title, section heading, page number, date, and any other attributes relevant to your use case. This metadata serves two purposes: it enables filtered retrieval ("only search the HR handbook") and it lets the model cite sources in its answers.

Stage 2: Embedding

Embedding converts text chunks into numerical vectors that capture semantic meaning. Similar content produces similar vectors. This is what enables semantic search: finding documents that are conceptually related to a query, not just keyword matches.

Choosing an Embedding Model

The embedding model determines the quality ceiling for your retrieval. A mediocre embedding model means mediocre retrieval, regardless of how good your other components are.

Embedding Model Comparison (2026)

OpenAI text-embedding-3-large: Strong general-purpose performance. 3,072 dimensions. Good for most use cases. Pay-per-use pricing.

Cohere embed-v4: Excellent for multilingual content. Competitive with OpenAI on English benchmarks. Offers compressed embeddings for cost savings.

Open source (BGE, E5, GTE): Free to run. Requires your own infrastructure. Performance is competitive with commercial options. Good choice if you process high volumes and want to avoid per-query costs.

Domain-specific models: PubMedBERT for medical, LegalBERT for legal. Better for specialized domains but narrower applicability. Consider these if your corpus is heavily domain-specific.

One critical rule: the same embedding model must be used for both indexing and querying. If you embed your documents with OpenAI's model but embed queries with Cohere's model, the vectors live in different mathematical spaces and similarity search won't work.

Stage 3: Retrieval

Retrieval is where you find the most relevant chunks for a given query. This is the stage where most RAG systems fail or succeed.

Vector Search

The core retrieval mechanism: embed the user's query, then find the K most similar document embeddings using cosine similarity or dot product. This is what vector databases are built for.

Pure vector search has a weakness: it captures semantic similarity but can miss exact keyword matches. If a user asks about "HIPAA compliance" and your document uses that exact phrase, vector search might rank a semantically similar chunk about "healthcare data privacy regulations" higher than the chunk that literally says "HIPAA compliance requirements."

Hybrid Search

Combine vector search (semantic) with BM25 or keyword search (lexical). This catches both semantic matches and exact keyword matches. Most production RAG systems use hybrid search.

The typical approach: run both searches in parallel, then combine results using reciprocal rank fusion (RRF). RRF merges two ranked lists by scoring each result based on its rank in both lists, producing a final ranking that balances semantic and lexical relevance.

Retrieval Parameters

How many chunks to retrieve (K) is a tuning decision. Too few chunks and you miss relevant information. Too many and you flood the model with noise, making it harder to find the answer.

Start with K=5 for simple question-answering. Increase to K=10 or K=15 for complex questions that might require information from multiple sources. If you're consistently retrieving irrelevant chunks, the problem is usually your chunking strategy or embedding model, not K.

Re-Ranking

After initial retrieval, pass the top-K results through a re-ranking model. Re-rankers are cross-encoders that evaluate each query-document pair jointly, producing more accurate relevance scores than embedding similarity alone.

The tradeoff: re-ranking adds latency (50 to 200ms). But it significantly improves the quality of the final context passed to the generator. For production systems where answer quality matters, re-ranking is almost always worth the latency cost.

Choosing a Vector Database

You need somewhere to store your embeddings and perform similarity search. The options range from simple libraries to managed cloud services.

Vector Database Options

Pinecone: Fully managed, easy to set up, scales automatically. Good for teams that don't want to manage infrastructure. Pay per usage.

Weaviate: Open source with a managed cloud option. Strong hybrid search support built in. Good documentation and active community.

Qdrant: Open source, written in Rust, very fast. Good for self-hosted deployments where you need maximum performance. Excellent filtering capabilities.

pgvector: PostgreSQL extension. If you're already on Postgres, this is the simplest path. Performance is good enough for most use cases. You avoid adding another database to your stack.

Chroma: Lightweight, developer-friendly, good for prototyping. Not recommended for large-scale production without careful benchmarking.

For most teams starting out, pgvector (if you're on Postgres) or Pinecone (if you want managed) are the pragmatic choices. Don't over-optimize your database selection before you've validated that your chunking and embedding strategy actually works.

Stage 4: Generation

This is where the language model takes the retrieved chunks and the user's question and produces an answer. The generation prompt is critical.

Production RAG Generation Prompt

System prompt:
You are a helpful assistant that answers questions based on the provided context. Follow these rules strictly:

1. Only use information from the CONTEXT section below to answer questions.
2. If the context doesn't contain enough information to answer the question fully, say "I don't have enough information to answer that question completely" and explain what's missing.
3. Never make up information that isn't in the context.
4. Cite which source documents you're drawing from.
5. If the question is ambiguous, ask for clarification.

CONTEXT:
{retrieved_chunks_with_source_metadata}

User prompt:
{user_question}

Key decisions in the generation prompt:

Faithfulness instruction: "Only use information from the context" is the most important instruction. Without it, the model will fill gaps with training knowledge, which defeats the purpose of RAG.
Graceful failure: Tell the model what to do when it doesn't have enough information. "I don't know" is better than a hallucinated answer.
Source citation: Include chunk metadata in the context and instruct the model to cite sources. This builds user trust and makes it easy to verify answers.
Temperature: Use low temperature (0 to 0.3) for factual Q&A RAG. Higher temperature increases the risk of the model inventing information.

Evaluation

RAG evaluation is harder than most people expect because you need to evaluate two components separately: retrieval quality and generation quality.

Retrieval Evaluation

Build a test set of 50 to 100 questions paired with the specific chunks that contain the correct answers. Then measure:

Hit rate: How often does the correct chunk appear in the top-K results? If your hit rate at K=5 is below 80%, your chunking or embedding needs work.
Mean Reciprocal Rank (MRR): Where does the correct chunk rank? Appearing at position 1 is better than position 5, even though both are "hits."
Precision@K: What fraction of the top-K results are actually relevant? Low precision means noise is drowning out signal.

Generation Evaluation

Given perfect retrieval (manually provide the correct chunks), evaluate the generated answers for:

Faithfulness: Does the answer only use information from the context? Any claim not supported by the retrieved chunks is a faithfulness failure.
Relevance: Does the answer actually address the question? A faithful answer that doesn't answer the question is still useless.
Completeness: Does the answer cover all aspects of the question that the context can support?

End-to-End Evaluation

Run real questions through the full pipeline and compare answers to gold-standard responses. This is the metric that matters most to users, but it's the hardest to debug because failures could originate in any stage.

Use frameworks like RAGAS or custom eval scripts. Start with manual evaluation on 50 queries to understand your failure patterns before automating.

Common Pitfalls

Pitfall 1: Chunks That Are Too Small

Tiny chunks (under 100 tokens) retrieve precisely but lack enough context for the model to generate useful answers. A chunk that says "Yes, this is covered under Section 4.2" is useless without the content of Section 4.2. Use the overlap parameter to ensure chunks carry enough surrounding context.

Pitfall 2: Ignoring Document Structure

Tables, headers, lists, and code blocks carry structural meaning that gets lost in naive text splitting. A table split across two chunks is useless in both. Pre-process documents to preserve structural elements. Convert tables to text descriptions. Keep code blocks intact.

Pitfall 3: Not Handling "I Don't Know"

Without explicit instructions, models will answer every question, even when the retrieved context is completely irrelevant. Always include instructions for when the context doesn't contain enough information. Test this specifically with questions your knowledge base can't answer.

Pitfall 4: Retrieval Without Filtering

If your knowledge base covers multiple products, time periods, or departments, unfiltered retrieval pulls in irrelevant chunks from other domains. Use metadata filters. "Only retrieve chunks from the 2026 product manual" is much more effective than retrieving from the entire corpus.

Pitfall 5: Testing Only With Easy Questions

Your demo questions will always work. The questions that break your system are the ones real users ask: ambiguous questions, questions that span multiple documents, questions about things that don't exist in your knowledge base, and questions that require synthesizing information from several chunks.

Production Considerations

Latency Budget

A typical RAG query involves: embedding the query (50ms), vector search (20-50ms), re-ranking (50-200ms), and generation (500-2000ms). Total: 600ms to 2.3 seconds. Users expect fast responses. Identify your latency budget and optimize accordingly. Caching, pre-computation, and streaming responses all help.

Cost Management

RAG costs come from three sources: embedding API calls (for new documents and every query), vector database hosting, and LLM generation. At scale, embedding costs dominate. Consider open-source embedding models if you process high volumes. Cache embeddings for repeated queries. Use smaller generation models for simple queries and reserve expensive models for complex ones.

Document Updates

Knowledge bases change. You need a pipeline that re-chunks, re-embeds, and re-indexes updated documents. Partial updates (only re-indexing changed sections) are more efficient than full re-indexing but harder to implement. For most teams, nightly full re-indexing is good enough.

Monitoring

In production, you need visibility into: retrieval quality over time (are relevant chunks being found?), generation quality (are answers correct and grounded?), latency trends, and user satisfaction signals. Log every query, the retrieved chunks, and the generated answer. When quality drops, these logs are your debugging lifeline.

Getting Started

Don't try to build the perfect RAG system on day one. Start simple and iterate.

Week 1: Pick 10 to 20 documents from your knowledge base. Chunk them with recursive character splitting. Embed with OpenAI's text-embedding-3-small. Store in pgvector or Chroma. Write a simple generation prompt. Test with 10 questions.
Week 2: Build a test set of 50 questions with expected answers. Measure retrieval hit rate and answer accuracy. Identify the biggest failure mode and fix it.
Week 3: Add hybrid search. Implement re-ranking. Test with the full document corpus.
Week 4: Add metadata filtering, source citations, and "I don't know" handling. Prepare for production deployment.

Each iteration should be driven by measured failures, not assumptions. Build, test, measure, fix, repeat.

For more on the prompting techniques that make RAG generation work well, check our chain-of-thought tutorial and best practices guide. For career opportunities in this space, browse our job board where RAG experience is one of the most requested skills.

Frequently Asked Questions

How much data do I need to build a useful RAG system?

You can build a useful RAG system with as few as 10 to 20 documents. The value comes from having the right data, not the most data. A 20-page product manual chunked and indexed properly can power an excellent Q&A bot. Start with a focused document set that covers your most common questions. You can always expand later. The complexity of your RAG system should match the complexity of your data, not exceed it.

What's the difference between RAG and just putting documents in the context window?

If your entire knowledge base fits in the model's context window (say, under 100,000 tokens), you could skip RAG and just include everything in the prompt. This is called "stuffing the context." It works for small knowledge bases and is much simpler to implement. RAG becomes necessary when your data exceeds the context window, when you need to search across many documents efficiently, or when you want to control costs (sending 200K tokens per query is expensive). For knowledge bases under 50 pages, try context stuffing first.

How do I handle tables and images in RAG?

Tables are one of the hardest challenges in RAG. Standard text chunking destroys table structure. Options: convert tables to natural language descriptions during preprocessing, use specialized table extraction tools (like Docling or Unstructured.io), or store tables as separate chunks with metadata indicating they're tabular data and include the full table even if it exceeds your normal chunk size. For images, use multimodal embedding models that can embed both text and images, or generate text descriptions of images during preprocessing and embed those descriptions.

How do I know if my RAG system is good enough for production?

Define "good enough" before you start. For most internal tools, 80% answer accuracy with graceful failure on the remaining 20% ("I don't have enough information") is acceptable. For customer-facing applications, aim for 90%+ accuracy. Key benchmarks: retrieval hit rate above 85% at K=5, faithfulness score above 90% (model only uses retrieved context), and user satisfaction above 4/5 in testing. If you're below these thresholds, fix your weakest component (usually chunking or retrieval) before adding complexity.

About the Author

Rome Thorndike is the founder of the Prompt Engineer Collective, a community of over 1,300 prompt engineering professionals, and author of The AI News Digest, a weekly newsletter with 2,700+ subscribers. Rome brings hands-on AI/ML experience from Microsoft, where he worked with Dynamics and Azure AI/ML solutions, and later led sales at Datajoy (acquired by Databricks).