RAG Systems: A Practical Guide for Enterprises
Retrieval-Augmented Generation (RAG) has established itself as one of the most important architectures for enterprise use of Large Language Models. Instead of expensively fine-tuning a model with company-specific data, relevant information is retrieved at runtime and passed to the model as context. The result: current, source-based answers built on your own enterprise data — without retraining.
Over the past two years at Intellineers, we have implemented RAG systems across a range of industries — from legal knowledge bases and technical documentation assistants to internal HR chatbots. This guide summarizes the key lessons from that practice.
Why RAG?
The advantages over pure fine-tuning are significant:
- Freshness: New documents are immediately available without retraining. With fine-tuning, an update cycle takes days to weeks.
- Traceability: Sources can be cited directly — a critical factor for compliance and user trust. Hallucinations become visible and verifiable.
- Cost efficiency: No expensive GPU training on proprietary data required. A RAG system for 10,000 documents typically costs €15,000–€40,000 in initial implementation, while a comparable fine-tuning project quickly reaches €50,000–€100,000.
- Privacy: Sensitive data can remain in your own infrastructure while only the query goes to the LLM. More on security in our post on LLM security in the enterprise.
Core Components of a RAG System
1. Document Processing Pipeline
The first and often underestimated step is preparing your documents. The quality of this pipeline determines 60–70% of the overall system performance — a perfect retrieval and generation model cannot compensate for poorly prepared data.
Chunking — The Strategy Matters:
The choice of chunking strategy has a massive impact on answer quality. Here is a comparison of the most common approaches:
| Strategy | Chunk Size | Advantages | Disadvantages | Best For |
|---|---|---|---|---|
| Fixed-Size | 256–512 tokens | Simple to implement, predictable size | Breaks semantic units | Homogeneous text documents |
| Recursive Character | 200–500 tokens | Respects paragraph boundaries, good default | Varies significantly in size | General text documents |
| Semantic Chunking | Variable | Preserves semantic units completely | More compute-intensive, complex | Technical documentation |
| Document Structure | Variable | Leverages headings and sections | Requires well-structured documents | Manuals, policies, legal texts |
| Sliding Window | 256–512 tokens with 50–100 overlap | Prevents information loss at boundaries | More chunks, higher costs | Narrative texts, reports |
Our recommendation: Start with Recursive Character Splitting at 400 tokens and 50-token overlap. Once you have an evaluation pipeline in place, test document-structure-based chunking for your specific document types. In our projects, switching from fixed-size to semantic chunking typically brought 15–25% improvement in answer relevance.
Cleaning and Preprocessing: Remove formatting artifacts (headers, footers, page numbers in PDFs), normalize whitespace and special characters, detect and treat tables separately (tables as Markdown or structured text, not as flowing prose), and extract image content via OCR if relevant.
Metadata Extraction: Systematically extract: author, creation date, document type, department, product reference, version number. This metadata enables targeted filtering later — for example, “only documents from the last 12 months” or “only legal department documents.” In practice, good metadata filtering reduces retrieval latency by 40–60% and significantly improves relevance.
2. Embedding & Vector Store
The processed chunks are converted into high-dimensional vectors that capture semantic similarity. The choice of embedding model is critical.
Embedding Model Comparison (as of 2025/2026):
| Model | Dimensions | MTEB Score | Cost (per 1M tokens) | Hosting |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3,072 | 64.6 | ~$0.13 | Cloud (API) |
| OpenAI text-embedding-3-small | 1,536 | 62.3 | ~$0.02 | Cloud (API) |
| Cohere embed-v3 | 1,024 | 64.5 | ~$0.10 | Cloud (API) |
| BGE-M3 (BAAI) | 1,024 | 63.5 | Free | Self-hosted |
| E5-Mistral-7B | 4,096 | 66.6 | Free | Self-hosted (GPU required) |
| Jina Embeddings v3 | 1,024 | 65.5 | ~$0.02 | Cloud or self-hosted |
Practical tip: For getting started, we recommend text-embedding-3-small — the price-performance ratio is excellent. For multilingual or European-language texts, BGE-M3 delivers strong results as an open-source alternative and can be run locally, which is crucial for privacy-sensitive projects. If maximum quality is required and you have GPU capacity, E5-Mistral-7B is currently the best choice.
Vector Database Selection: The choice of vector database depends on your scaling needs. For getting started and up to 5 million vectors, Qdrant (open source, simple Docker deployment) is our standard recommendation. For teams already using PostgreSQL, pgvector is a pragmatic choice that requires no new infrastructure. For enterprise-scale with over 100 million vectors, Pinecone or Weaviate Cloud are sensible options.
Indexing: HNSW (Hierarchical Navigable Small World) is the de facto standard for Approximate Nearest Neighbor Search. Configure the ef_construction parameter to 128–256 for good recall values, and adjust ef_search at runtime for the desired trade-off between latency and accuracy.
3. Retrieval Strategy
Retrieval quality largely determines answer quality. This is where the wheat separates from the chaff — and where good RAG systems differ from brilliant ones.
Level 1 — Basic: Semantic Search: Purely vector-based similarity search with Cosine Similarity or Dot Product. Works well for natural language queries but fails with exact terms (product numbers, technical terms with specific spelling).
Level 2 — Hybrid Search: Combination of semantic search with classic keyword search (BM25). In practice, a weighted mix of 70% semantic + 30% BM25 delivers the best results for most enterprise use cases. The exact weighting should be optimized through evaluation.
Level 3 — Advanced Retrieval Techniques:
-
HyDE (Hypothetical Document Embeddings): The LLM first generates a hypothetical answer document based on the question. This document is then used for the vector search. Advantage: the search finds documents similar to the desired answer, not just the question. In our tests, HyDE improved retrieval quality by 10–20% for complex queries but increases latency by 1–2 seconds and cost per query.
-
Multi-Query Retrieval: The original question is rewritten by the LLM into 3–5 alternative formulations. Each variant is searched separately, and results are deduplicated and merged. This significantly increases recall rate — especially for ambiguous or short queries.
-
Parent Document Retrieval: Chunks are kept small for precise search (e.g., 200 tokens), but when a match occurs, the parent section (e.g., 1,000 tokens) is passed to the LLM. This gives you precise retrieval with sufficient context.
-
Self-Query / Metadata Filtering: The LLM automatically extracts metadata filters from the user query (e.g., “Show me the travel expense policy from 2025” becomes Filter: document type = policy, year ≥ 2025, topic = travel expenses). This dramatically reduces the search space.
Level 4 — Re-Ranking: After initial retrieval, the top 20 results are re-sorted by a cross-encoder (e.g., Cohere Rerank, BGE-Reranker, or a fine-tuned cross-encoder). Cross-encoders are significantly more accurate than bi-encoders but too slow for initial search. Therefore: bi-encoder for recall, cross-encoder for precision. In practice, re-ranking improves the quality of the top 5 results by 15–30%.
4. Generation with Context
The LLM receives the retrieved documents as context and generates the answer. Prompt engineering and context management determine quality and cost here.
Prompt Engineering for RAG: A good RAG prompt includes a clear role instruction (“You are an assistant for internal policies at company X”), the explicit instruction to answer only based on the provided context, a fallback rule (“If the context does not answer the question, say so explicitly”), and formatting guidelines (bullet points, length, language). For deeper techniques, see our post on fine-tuning vs. prompt engineering.
Context Window Management: Even with 128k-token context windows, “pack everything in” is not a good strategy. Studies show that LLMs use information at the beginning and end of the context better than in the middle (the “Lost in the Middle” effect). Place the most relevant chunks at the beginning, limit the context to 3,000–5,000 tokens (5–10 chunks), and sort by relevance score in descending order.
Citation Handling: Number the source chunks in the prompt ([1], [2], [3]) and instruct the LLM to provide references for its statements. This significantly increases verifiability and user trust.
Evaluation: The RAGAS Framework
Without systematic evaluation, any optimization is flying blind. We recommend the RAGAS framework (Retrieval-Augmented Generation Assessment), which defines four core metrics:
Faithfulness: Measures whether the generated answer is supported by the retrieved context. A faithfulness score below 0.8 indicates hallucinations.
Answer Relevancy: Measures how relevant the answer is to the question asked. Target value: > 0.85.
Context Precision: Measures whether the retrieved documents are relevant to the question. If this value is low, your retrieval has a problem.
Context Recall: Measures whether all information needed for the answer was retrieved. A low value indicates that relevant documents are not being found.
Practical implementation: Create a ground-truth dataset with 50–100 question-answer pairs validated by your domain experts. Run RAGAS evaluation after every change to chunking, retrieval, or prompt. Automate this process as part of your CI/CD pipeline.
Common Pitfalls
In our projects, we regularly encounter the same challenges:
1. Chunks Too Large: When chunks exceed 1,000+ tokens, irrelevant content is retrieved and dilutes the answer. Answer quality drops, token costs rise. Solution: Smaller chunks (300–500 tokens) with Parent Document Retrieval.
2. Missing Metadata: Without metadata, filtering by recency or document type is impossible. The LLM sees outdated policies alongside current ones — a recipe for wrong answers. Solution: Establish metadata extraction as a fixed part of the ingestion pipeline.
3. Ignoring Hybrid Search: Purely semantic search fails with exact terms like product names, article numbers, or acronyms. When a user searches for “SAP transaction VA01,” it must be found exactly — not a semantically similar document about order creation. Solution: Always build in BM25 as a fallback.
4. No Evaluation: Without a ground-truth dataset and systematic metrics, optimization is guesswork. Every change to the system can cause unintended degradation in other areas. Solution: RAGAS evaluation as part of the deployment process.
5. Missing Error Handling: What happens when retrieval finds no relevant documents? Without explicit handling, the LLM hallucinates an answer. Solution: Define a relevance threshold (e.g., Cosine Similarity > 0.7) and explicitly communicate when no matching information was found.
Cost Estimate: RAG in Production
The costs of a RAG system consist of three components. Here is a realistic calculation for 1 million queries per month:
| Cost Item | Calculation | Monthly |
|---|---|---|
| Embedding (queries) | 1M queries × 100 tokens × $0.02/1M tokens | ~$2 |
| LLM inference (GPT-4o-mini) | 1M × 2,000 tokens input × $0.15/1M + 1M × 500 tokens output × $0.60/1M | ~$600 |
| LLM inference (GPT-4o) | 1M × 2,000 tokens input × $2.50/1M + 1M × 500 tokens output × $10/1M | ~$10,000 |
| Vector DB (Qdrant Cloud) | 5M vectors, 1,536 dimensions | ~$100–200 |
| Compute (API server) | 2 vCPU, 8 GB RAM | ~$50–100 |
| Total (with GPT-4o-mini) | ~$750–900 | |
| Total (with GPT-4o) | ~$10,150–10,300 |
The difference between models is dramatic. For many use cases — particularly knowledge bases and FAQ systems — GPT-4o-mini delivers sufficient quality at a fraction of the cost. For complex reasoning tasks, legal analysis, or code generation, GPT-4o or Claude Sonnet is worthwhile. Additional strategies for cost optimization of LLM inference can be found in our separate post.
Our Recommended Stack
Based on our project experience, we recommend for getting started:
- Ingestion: LlamaIndex with unstructured.io for PDF/DOCX parsing
- Chunking: Recursive Character Splitting, 400 tokens, 50-token overlap
- Embedding: OpenAI text-embedding-3-small (or BGE-M3 for privacy requirements)
- Vector Store: Qdrant (open source, Docker deployment, excellent hybrid search support)
- Retrieval: Hybrid Search (Semantic + BM25) with Cohere Rerank
- LLM: GPT-4o-mini for standard queries, GPT-4o or Claude Sonnet for complex cases
- Evaluation: RAGAS with 50+ ground-truth pairs
- Monitoring: LangSmith or Langfuse for tracing and quality control
Production Deployment Checklist
Before a RAG system goes live, verify:
- Ground-truth dataset with at least 50 validated question-answer pairs created
- RAGAS metrics meet target values (Faithfulness > 0.85, Answer Relevancy > 0.85)
- Relevance threshold for “no answer found” calibrated
- Latency under 3 seconds for 95% of queries
- Automated monitoring for answer quality and user feedback established
- Ingestion pipeline for new documents automated
- Access controls implemented (who can see which documents?)
- Fallback strategy for LLM API outages defined
- Cost alerting configured (detect unexpected spikes)
Conclusion
RAG is not a plug-and-play solution, but with the right approach it is a powerful tool that transforms enterprise LLMs from “impressive demo” to “business-critical system.” The key lies in careful data preparation, thoughtful retrieval, and — above all — continuous evaluation.
Start simple, measure everything, and optimize systematically. A good RAG system with simple retrieval beats a complex system without evaluation every time.
Need help implementing a RAG system? Contact us for individual consultation.