RAG Systems: A Practical Guide for Enterprises

Published on March 15, 2025 by Christopher Wittlinger

Retrieval-Augmented Generation (RAG) has established itself as one of the most important architectures for enterprise use of Large Language Models. Instead of expensively fine-tuning a model with company-specific data, relevant information is retrieved at runtime and passed to the model as context. The result: current, source-based answers built on your own enterprise data — without retraining.

Over the past two years at Intellineers, we have implemented RAG systems across a range of industries — from legal knowledge bases and technical documentation assistants to internal HR chatbots. This guide summarizes the key lessons from that practice.

Why RAG?

The advantages over pure fine-tuning are significant:

Core Components of a RAG System

1. Document Processing Pipeline

The first and often underestimated step is preparing your documents. The quality of this pipeline determines 60–70% of the overall system performance — a perfect retrieval and generation model cannot compensate for poorly prepared data.

Chunking — The Strategy Matters:

The choice of chunking strategy has a massive impact on answer quality. Here is a comparison of the most common approaches:

StrategyChunk SizeAdvantagesDisadvantagesBest For
Fixed-Size256–512 tokensSimple to implement, predictable sizeBreaks semantic unitsHomogeneous text documents
Recursive Character200–500 tokensRespects paragraph boundaries, good defaultVaries significantly in sizeGeneral text documents
Semantic ChunkingVariablePreserves semantic units completelyMore compute-intensive, complexTechnical documentation
Document StructureVariableLeverages headings and sectionsRequires well-structured documentsManuals, policies, legal texts
Sliding Window256–512 tokens with 50–100 overlapPrevents information loss at boundariesMore chunks, higher costsNarrative texts, reports

Our recommendation: Start with Recursive Character Splitting at 400 tokens and 50-token overlap. Once you have an evaluation pipeline in place, test document-structure-based chunking for your specific document types. In our projects, switching from fixed-size to semantic chunking typically brought 15–25% improvement in answer relevance.

Cleaning and Preprocessing: Remove formatting artifacts (headers, footers, page numbers in PDFs), normalize whitespace and special characters, detect and treat tables separately (tables as Markdown or structured text, not as flowing prose), and extract image content via OCR if relevant.

Metadata Extraction: Systematically extract: author, creation date, document type, department, product reference, version number. This metadata enables targeted filtering later — for example, “only documents from the last 12 months” or “only legal department documents.” In practice, good metadata filtering reduces retrieval latency by 40–60% and significantly improves relevance.

2. Embedding & Vector Store

The processed chunks are converted into high-dimensional vectors that capture semantic similarity. The choice of embedding model is critical.

Embedding Model Comparison (as of 2025/2026):

ModelDimensionsMTEB ScoreCost (per 1M tokens)Hosting
OpenAI text-embedding-3-large3,07264.6~$0.13Cloud (API)
OpenAI text-embedding-3-small1,53662.3~$0.02Cloud (API)
Cohere embed-v31,02464.5~$0.10Cloud (API)
BGE-M3 (BAAI)1,02463.5FreeSelf-hosted
E5-Mistral-7B4,09666.6FreeSelf-hosted (GPU required)
Jina Embeddings v31,02465.5~$0.02Cloud or self-hosted

Practical tip: For getting started, we recommend text-embedding-3-small — the price-performance ratio is excellent. For multilingual or European-language texts, BGE-M3 delivers strong results as an open-source alternative and can be run locally, which is crucial for privacy-sensitive projects. If maximum quality is required and you have GPU capacity, E5-Mistral-7B is currently the best choice.

Vector Database Selection: The choice of vector database depends on your scaling needs. For getting started and up to 5 million vectors, Qdrant (open source, simple Docker deployment) is our standard recommendation. For teams already using PostgreSQL, pgvector is a pragmatic choice that requires no new infrastructure. For enterprise-scale with over 100 million vectors, Pinecone or Weaviate Cloud are sensible options.

Indexing: HNSW (Hierarchical Navigable Small World) is the de facto standard for Approximate Nearest Neighbor Search. Configure the ef_construction parameter to 128–256 for good recall values, and adjust ef_search at runtime for the desired trade-off between latency and accuracy.

3. Retrieval Strategy

Retrieval quality largely determines answer quality. This is where the wheat separates from the chaff — and where good RAG systems differ from brilliant ones.

Level 1 — Basic: Semantic Search: Purely vector-based similarity search with Cosine Similarity or Dot Product. Works well for natural language queries but fails with exact terms (product numbers, technical terms with specific spelling).

Level 2 — Hybrid Search: Combination of semantic search with classic keyword search (BM25). In practice, a weighted mix of 70% semantic + 30% BM25 delivers the best results for most enterprise use cases. The exact weighting should be optimized through evaluation.

Level 3 — Advanced Retrieval Techniques:

Level 4 — Re-Ranking: After initial retrieval, the top 20 results are re-sorted by a cross-encoder (e.g., Cohere Rerank, BGE-Reranker, or a fine-tuned cross-encoder). Cross-encoders are significantly more accurate than bi-encoders but too slow for initial search. Therefore: bi-encoder for recall, cross-encoder for precision. In practice, re-ranking improves the quality of the top 5 results by 15–30%.

4. Generation with Context

The LLM receives the retrieved documents as context and generates the answer. Prompt engineering and context management determine quality and cost here.

Prompt Engineering for RAG: A good RAG prompt includes a clear role instruction (“You are an assistant for internal policies at company X”), the explicit instruction to answer only based on the provided context, a fallback rule (“If the context does not answer the question, say so explicitly”), and formatting guidelines (bullet points, length, language). For deeper techniques, see our post on fine-tuning vs. prompt engineering.

Context Window Management: Even with 128k-token context windows, “pack everything in” is not a good strategy. Studies show that LLMs use information at the beginning and end of the context better than in the middle (the “Lost in the Middle” effect). Place the most relevant chunks at the beginning, limit the context to 3,000–5,000 tokens (5–10 chunks), and sort by relevance score in descending order.

Citation Handling: Number the source chunks in the prompt ([1], [2], [3]) and instruct the LLM to provide references for its statements. This significantly increases verifiability and user trust.

Evaluation: The RAGAS Framework

Without systematic evaluation, any optimization is flying blind. We recommend the RAGAS framework (Retrieval-Augmented Generation Assessment), which defines four core metrics:

Faithfulness: Measures whether the generated answer is supported by the retrieved context. A faithfulness score below 0.8 indicates hallucinations.

Answer Relevancy: Measures how relevant the answer is to the question asked. Target value: > 0.85.

Context Precision: Measures whether the retrieved documents are relevant to the question. If this value is low, your retrieval has a problem.

Context Recall: Measures whether all information needed for the answer was retrieved. A low value indicates that relevant documents are not being found.

Practical implementation: Create a ground-truth dataset with 50–100 question-answer pairs validated by your domain experts. Run RAGAS evaluation after every change to chunking, retrieval, or prompt. Automate this process as part of your CI/CD pipeline.

Common Pitfalls

In our projects, we regularly encounter the same challenges:

1. Chunks Too Large: When chunks exceed 1,000+ tokens, irrelevant content is retrieved and dilutes the answer. Answer quality drops, token costs rise. Solution: Smaller chunks (300–500 tokens) with Parent Document Retrieval.

2. Missing Metadata: Without metadata, filtering by recency or document type is impossible. The LLM sees outdated policies alongside current ones — a recipe for wrong answers. Solution: Establish metadata extraction as a fixed part of the ingestion pipeline.

3. Ignoring Hybrid Search: Purely semantic search fails with exact terms like product names, article numbers, or acronyms. When a user searches for “SAP transaction VA01,” it must be found exactly — not a semantically similar document about order creation. Solution: Always build in BM25 as a fallback.

4. No Evaluation: Without a ground-truth dataset and systematic metrics, optimization is guesswork. Every change to the system can cause unintended degradation in other areas. Solution: RAGAS evaluation as part of the deployment process.

5. Missing Error Handling: What happens when retrieval finds no relevant documents? Without explicit handling, the LLM hallucinates an answer. Solution: Define a relevance threshold (e.g., Cosine Similarity > 0.7) and explicitly communicate when no matching information was found.

Cost Estimate: RAG in Production

The costs of a RAG system consist of three components. Here is a realistic calculation for 1 million queries per month:

Cost ItemCalculationMonthly
Embedding (queries)1M queries × 100 tokens × $0.02/1M tokens~$2
LLM inference (GPT-4o-mini)1M × 2,000 tokens input × $0.15/1M + 1M × 500 tokens output × $0.60/1M~$600
LLM inference (GPT-4o)1M × 2,000 tokens input × $2.50/1M + 1M × 500 tokens output × $10/1M~$10,000
Vector DB (Qdrant Cloud)5M vectors, 1,536 dimensions~$100–200
Compute (API server)2 vCPU, 8 GB RAM~$50–100
Total (with GPT-4o-mini)~$750–900
Total (with GPT-4o)~$10,150–10,300

The difference between models is dramatic. For many use cases — particularly knowledge bases and FAQ systems — GPT-4o-mini delivers sufficient quality at a fraction of the cost. For complex reasoning tasks, legal analysis, or code generation, GPT-4o or Claude Sonnet is worthwhile. Additional strategies for cost optimization of LLM inference can be found in our separate post.

Based on our project experience, we recommend for getting started:

Production Deployment Checklist

Before a RAG system goes live, verify:

Conclusion

RAG is not a plug-and-play solution, but with the right approach it is a powerful tool that transforms enterprise LLMs from “impressive demo” to “business-critical system.” The key lies in careful data preparation, thoughtful retrieval, and — above all — continuous evaluation.

Start simple, measure everything, and optimize systematically. A good RAG system with simple retrieval beats a complex system without evaluation every time.

Need help implementing a RAG system? Contact us for individual consultation.