RAG vs Fine-Tuning: Choosing the Right Approach for Your Business
The question is wrong
“Should we use RAG or fine-tuning?” is the question we hear most from CTOs evaluating how to integrate LLMs into their processes. It is a poorly framed question, because it assumes they are mutually exclusive alternatives. They are not. They are complementary techniques that solve different problems. But since most enterprise use cases fall more naturally into one camp, the comparison has practical value.
Here is what we have measured in real projects, not what academic papers claim.
RAG: when the knowledge changes
Retrieval-Augmented Generation (RAG) consists of searching relevant information in a knowledge base and passing it to the LLM alongside the user’s question. The model does not “know” the answer; it finds it in the documents you provide.
When it works well:
- Knowledge base that changes frequently (product documentation, internal policies, regulations, product catalog)
- Answers that must cite specific sources (“according to article 47 of the regulation…”)
- Proprietary knowledge that does not exist in the model’s training data
- Need to strictly control what information is available to the model
Real numbers from our RAG deployments:
| Metric | Typical value |
|---|---|
| End-to-end latency | 1.2 - 3.5 seconds |
| Cost per query | EUR 0.003 - 0.015 |
| Accuracy (correct answer) | 78 - 89% |
| Accuracy with reranking | 84 - 93% |
| Knowledge base update time | Minutes to hours |
RAG latency has two components: the vector store search (50-200ms with pgvector or Qdrant) and LLM generation (1-3 seconds). Cost per query depends on the generation model: GPT-4o mini costs ~EUR 0.003/query, Claude 3.5 Sonnet ~EUR 0.008/query, GPT-4o ~EUR 0.015/query.
Where RAG fails:
RAG quality depends entirely on retrieval quality. If the system retrieves irrelevant documents, the model generates irrelevant answers with full confidence. The classic problem is the “retrieval gap”: the user’s question and the document text use different vocabulary for the same concept. Qdrant and pgvector are the vector stores we use most in production for semantic search. “How do I cancel my subscription?” and “Service termination procedure” are semantically equivalent but lexically distinct.
Solutions that work: quality embeddings (OpenAI’s text-embedding-3-large or Cohere’s models give us the best results), intelligent chunking (splitting documents by logical sections rather than every 500 tokens), metadata filtering (filtering by date, document type, department before semantic search), and reranking (using a reranking model like Cohere Rerank or a cross-encoder to reorder retrieval results).
A well-built RAG pipeline has 5-7 components. A RAG pipeline that works in production has 12-15, counting fallbacks, caches, monitoring, and user feedback mechanisms.
Fine-tuning: when you need different behavior
Fine-tuning consists of partially retraining a model with specific data so it behaves in a particular way. You do not provide information at query time; you change how the model generates responses.
When it works well:
- You need the model to follow a specific style, format, or tone consistently
- Specialized classification tasks where the base model underperforms
- Structured data extraction from texts with an industry-specific format
- Technical vocabulary or jargon the base model handles poorly
Real numbers from our fine-tuned deployments:
| Metric | Typical value |
|---|---|
| Latency | 0.5 - 2.0 seconds |
| Training cost | EUR 50 - 500 per run |
| Cost per query | EUR 0.002 - 0.010 |
| Task-specific accuracy | 88 - 96% |
| Update time | Days (retraining) |
| Minimum dataset | 200 - 1,000 examples |
Fine-tuning has lower latency than RAG because there is no retrieval step. Cost per query also tends to be lower because you can fine-tune a smaller model (fine-tuned GPT-4o mini performs surprisingly well on specific tasks). But the cost of preparing the dataset and training is significant.
Where fine-tuning fails:
Fine-tuning does not reliably add new knowledge. If you train a model on 2023 data and ask about 2025 regulations, it will fabricate the answer. The model internalizes patterns, not verifiable facts. This is critical: if your use case requires answers based on changing data, fine-tuning alone is not the solution.
The other problem is catastrophic forgetting: training on specific data can cause the model to lose general capabilities. A model fine-tuned for invoice classification may become worse at general conversation. This is mitigated with techniques like LoRA (Low-Rank Adaptation), which only modifies a subset of parameters, but it is a real risk.
The practical decision: a decision tree
After implementing both techniques across multiple clients, this is the decision tree we use:
Question 1: Does the information needed to answer change more than once a month?
- Yes -> RAG (or RAG + fine-tuning)
- No -> Continue to question 2
Question 2: Is the primary use case answering questions about specific documents?
- Yes -> RAG
- No -> Continue to question 3
Question 3: Do you need the model to follow a very specific format, style, or behavior?
- Yes -> Fine-tuning (possibly + RAG)
- No -> Continue to question 4
Question 4: Do you have at least 500 quality examples of the task?
- Yes -> Fine-tuning is probably worth it
- No -> Use the base model with a well-designed system prompt
In practice, 70% of the enterprise use cases we see are better served by RAG, 15% by fine-tuning, and 15% by the combination.
RAG + fine-tuning: when both together
The combination is powerful and underused. The typical scenario: you have a knowledge base (RAG) but need responses to follow a very specific format or style (fine-tuning).
Real example: a law firm that needs to search case law (RAG over a database of court decisions) but generate responses in the formal Spanish legal style with opinion-letter structure (fine-tuning the model to generate in that format).
RAG provides the facts. Fine-tuning provides the form. The result is consistently better than either alone.
Another pattern: fine-tuning the embedding model (not the generative LLM) with your domain data. If your industry has technical vocabulary that generic embeddings do not capture well, a fine-tuned embedding model improves retrieval quality. We have measured 12-18% improvements in retrieval accuracy from fine-tuning embeddings with 2,000-5,000 domain-specific (query, relevant document) pairs.
Total cost of ownership
The cost comparison is not just “cost per query”:
| Concept | RAG | Fine-tuning |
|---|---|---|
| Initial setup | 2-4 weeks | 1-3 weeks |
| Infrastructure | Vector DB + LLM API | LLM API (or self-hosted) |
| Data updates | Continuous, automatable | Periodic retraining |
| Monthly maintenance | Monitoring, re-indexing | Monitoring, drift evaluation |
| Scaling | Linear with queries | Linear with queries |
| Cost/query (mid-scale) | EUR 0.005 - 0.015 | EUR 0.002 - 0.010 |
The hidden cost of RAG is knowledge base maintenance: updating documents, re-indexing, managing versions, cleaning obsolete content. If nobody maintains the base, quality degrades progressively.
The hidden cost of fine-tuning is continuous evaluation. Base models update (GPT-4 to GPT-4o, Claude 3 to Claude 3.5), and each update may require retraining the fine-tune. Additionally, you need an evaluation dataset to detect when performance degrades.
What actually matters
The choice between RAG and fine-tuning is an architecture decision, not a technology decision. And like every architecture decision, it depends on context: what data you have, how often it changes, what answer quality you need, and how much you can invest in ongoing maintenance.
What we have learned is that most enterprise AI projects fail not by choosing wrong between RAG and fine-tuning, but by underestimating the data engineering work both require. A RAG with poorly indexed data or a fine-tune with a low-quality dataset will produce mediocre results regardless of technique. Input data quality is the strongest predictor of success. Everything else is implementation.
For a deeper look at how to take these models to production with the right metrics, see our article on MLOps: from notebook to production pipeline. And if your use case involves document classification, our practical guide on NLP for document classification covers fine-tuning step by step.
About the author
abemon engineering
Engineering team
Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.
