Every week, someone asks me: "Should I fine-tune a model or use RAG?" And every week, my answer is the same: "It depends on what problem you're solving."
That's not a cop-out. Fine-tuning and RAG solve fundamentally different problems. Using the wrong one wastes time and money. Using the right one can be the difference between an AI system that works and one that doesn't.
This guide gives you a clear decision framework. No hand-waving. Specific criteria, cost comparisons, and real examples.
The 30-Second Distinction
Fine-tuning changes how the model behaves. It modifies the model's weights so it responds differently to inputs. Think of it as training a new employee on your company's specific way of doing things.
RAG changes what the model knows. It gives the model access to external information at query time. Think of it as giving an employee a reference manual they can look up before answering questions.
Different problems. Different solutions. The confusion arises because both can improve AI output quality, but they do it through completely different mechanisms.
When to Use RAG
RAG is the right choice more often than fine-tuning. If you're not sure which to use, start with RAG. Here's when it's the clear winner.
Your data changes frequently
RAG pulls information from an external knowledge base at query time. Update the knowledge base, and the model immediately has access to the new information. No retraining required.
If your product documentation changes weekly, your knowledge base grows daily, or your data has a shelf life, RAG is the only practical option. Fine-tuning a model every time your data changes is prohibitively expensive and slow.
You need source attribution
RAG can tell users where its answers came from. "According to section 3.2 of your employee handbook..." This is critical for compliance-sensitive applications in healthcare, legal, and finance. Fine-tuned models can't point to their sources because the information is baked into the weights.
You need factual accuracy on specific documents
When users ask questions about specific documents, policies, or datasets, RAG retrieves the actual text and uses it to generate answers. Fine-tuning teaches the model patterns, not facts. A fine-tuned model might learn to sound like your documentation, but it can still hallucinate specific details. RAG grounds the response in the actual source material.
You want to start fast and iterate
A basic RAG system can be up and running in a day. Load your documents into a vector database, connect it to an LLM, and you have a working system. Iterate on chunking strategies, retrieval methods, and prompts without touching the model itself.
Setup cost: $0-$500. Mostly engineering time. The tools are free or cheap.
Vector database: $0-$100/month for most applications. Free tiers cover development. Production costs scale with data volume.
Embedding generation: $0.01-$0.10 per 1,000 documents. One-time cost per document, re-run only when content changes.
Per-query cost: Standard LLM API costs plus a small retrieval overhead. Roughly $0.005-$0.05 per query depending on model and context size.
Total monthly cost for a typical application: $50-$500/month at moderate usage.
When to Use Fine-Tuning
Fine-tuning is the right choice when you need to change the model's behavior, style, or capabilities. Not what it knows, but how it acts.
You need a specific output style or format
If every response needs to follow a precise format, match a specific tone, or use domain-specific terminology consistently, fine-tuning bakes this into the model. Instead of stuffing style instructions into every prompt (which consumes tokens and sometimes gets ignored), the model just does it by default.
Example: a legal document generator that always uses proper legal citation format, or a customer support bot that must match your brand voice exactly across thousands of different questions.
You need better performance on a specific task
Fine-tuning on high-quality examples of a specific task can dramatically improve performance on that task. If you have a classification problem, a specialized extraction task, or any narrow, well-defined task where you can provide hundreds or thousands of examples, fine-tuning will outperform prompting.
The threshold is roughly: if few-shot prompting with 5-10 examples in the prompt gets you to 80% quality, fine-tuning on 500+ examples can push you to 95%+.
You need to reduce latency or cost per query
A fine-tuned smaller model can match a larger model's performance on specific tasks at a fraction of the cost and latency. Fine-tuning GPT-4o Mini or Claude Haiku on your task might give you GPT-4o-level quality at one-tenth the cost per query. For high-volume applications, this adds up fast.
Fine-tuning also means shorter prompts. You don't need long system prompts, examples, or instructions because the model already knows what to do. Shorter prompts mean fewer input tokens, which means lower cost and faster responses.
You need the model to learn new patterns
If your use case requires understanding domain-specific concepts, jargon, or reasoning patterns that general models handle poorly, fine-tuning can teach these patterns. Medical coding, legal analysis, financial modeling. These domains have specialized knowledge that benefits from training, not just retrieval.
Data preparation: 10-40 hours of engineering time. The most overlooked cost. You need high-quality, formatted training data. Garbage in, garbage out.
Training cost (OpenAI): $3-$25 per training run for GPT-4o Mini, $25-$200 for GPT-4o. Depends on dataset size and epochs.
Training cost (open source on cloud): $50-$500 per training run on AWS/GCP GPU instances. More control, more complexity.
Evaluation and iteration: Plan for 3-10 training runs to get it right. Multiply the training cost accordingly.
Per-query cost: Same or lower than the base model. Fine-tuned models don't cost more to run. They often cost less because you need shorter prompts.
Total cost for a fine-tuning project: $500-$5,000 for most applications, including engineering time. Ongoing costs are the same as regular API usage.
When to Use Both
Here's the part most guides skip. Fine-tuning and RAG aren't mutually exclusive. Some of the best production AI systems use both.
Pattern 1: Fine-tuned model + RAG for knowledge
Fine-tune the model on your output style and task-specific behavior, then use RAG to provide it with current, domain-specific information at query time. The fine-tuning handles the "how" (format, tone, reasoning patterns) while RAG handles the "what" (facts, data, documents).
This is particularly effective for customer support systems. Fine-tune on your brand voice and resolution patterns. RAG retrieves the specific product documentation and account information needed to answer each query.
Pattern 2: RAG with a fine-tuned retriever
Use a fine-tuned embedding model for the retrieval step of RAG. Standard embedding models work well for general-purpose retrieval, but fine-tuning them on your domain's terminology and query patterns can improve retrieval accuracy by 10-30%. Better retrieval means better final answers.
Pattern 3: Fine-tuned classifier + RAG pipeline
Use a fine-tuned model to classify incoming queries (intent detection, topic classification), then route each query to the appropriate RAG pipeline. Different document collections, different retrieval strategies, different prompts. The classifier is cheap and fast. The RAG pipeline handles the heavy lifting.
The Decision Framework
Use this flowchart when evaluating your next AI feature:
1. Can you solve it with prompting alone? Try zero-shot and few-shot prompting first. If you can get to 90%+ quality with good prompts, you might not need either fine-tuning or RAG. Don't over-engineer.
2. Is the problem about WHAT the model knows or HOW it behaves? "What" problems (facts, data, documents) point to RAG. "How" problems (style, format, specialized reasoning) point to fine-tuning.
3. How often does your data change? Changes weekly or more often: RAG. Changes quarterly or less: fine-tuning is viable.
4. Do you need source attribution? Yes: RAG. Fine-tuned models can't cite their sources.
5. Do you have high-quality training data? 500+ examples of ideal input-output pairs: fine-tuning is feasible. Fewer than that: stick with prompting and RAG.
6. Is per-query cost critical? High volume (10K+ queries/day) where cost matters: consider fine-tuning a smaller model. Low-medium volume: the operational simplicity of RAG is worth the per-query premium.
Common Mistakes
Mistake 1: Using RAG when you need behavior change
RAG can't fix a model that writes in the wrong tone, uses the wrong format, or reasons incorrectly about your domain. If you're stuffing style guidelines into your RAG context and the model still doesn't follow them consistently, you have a fine-tuning problem, not a retrieval problem.
Mistake 2: Fine-tuning on facts that change
If you fine-tune a model on your product's pricing and then change the pricing, the model will confidently state the old prices. Fine-tuning is for stable patterns, not volatile data. If the information might change, it should come through RAG.
Mistake 3: Skipping the prompt-only baseline
Both RAG and fine-tuning add complexity. Before committing to either, spend a day seeing how far you can get with careful prompt engineering alone. You might be surprised. A well-crafted system prompt with good examples can often reach 85-90% of the quality you'd get from RAG or fine-tuning, at zero additional infrastructure cost.
Mistake 4: Poor RAG retrieval quality
"RAG didn't work for us" usually means "our retrieval was bad." If you're pulling irrelevant chunks, the model produces hallucinated answers or generic responses. Before abandoning RAG, check: Are your chunks the right size? Is your embedding model appropriate? Are you using re-ranking? Is your query preprocessing adequate? Fix retrieval before blaming the approach.
Mistake 5: Not enough training data for fine-tuning
Fine-tuning with 50 examples rarely works well. You need at least 200-500 high-quality examples for meaningful improvement, and 1,000+ for strong results. If you can't gather that much quality data, focus on prompting and RAG instead.
Real-World Examples
Example 1: Internal knowledge base Q&A
Approach: RAG. The company wiki has 5,000 pages that change daily. Employees ask questions about policies, processes, and project status. RAG indexes the wiki, retrieves relevant pages, and generates answers with source links. Fine-tuning would be useless here because the information changes too frequently.
Example 2: Medical report summarization
Approach: Fine-tuning + RAG. Fine-tune on 2,000 examples of medical reports paired with ideal summaries to nail the output format and terminology. Use RAG to pull in relevant clinical guidelines and reference ranges at query time. The fine-tuning handles style; RAG handles knowledge.
Example 3: Customer email classification
Approach: Fine-tuning. 15 categories, 10,000 labeled examples, stable category definitions. Pure classification task with well-defined patterns. Fine-tuning a small model gets 97% accuracy at pennies per classification. RAG adds nothing here because the task is about pattern recognition, not knowledge retrieval.
Example 4: Legal contract review
Approach: RAG. Attorneys need AI to check contracts against their firm's clause library and flag deviations. The clause library is the knowledge base. RAG retrieves relevant standard clauses and compares them against the contract under review. Source attribution is critical for attorney trust. Fine-tuning can't provide this.
For a deeper dive into building RAG systems, see our RAG architecture guide. For more on prompt optimization techniques that can delay or eliminate the need for either approach, check the prompt engineering best practices guide.