Building production RAG systems: lessons from 6 deployments

We have deployed Retrieval-Augmented Generation systems for legal research, medical documentation, technical support, and financial analysis. Every deployment taught us something new about the gap between demo and production.

The naive approach

Most RAG tutorials follow this pattern:

Load documents
Split by fixed character count
Embed with OpenAI text-embedding-ada-002
Store in Pinecone
Retrieve top-3 chunks for every query
Stuff into GPT-4 prompt

This works for demos. It fails in production.

Chunking is everything

Fixed-size chunking destroys semantic boundaries. We have seen legal clauses split mid-sentence, code examples broken across chunks, and tables rendered unreadable.

Our current approach uses hierarchical chunking:

Level 1: Document sections (H1/H2 boundaries)
Level 2: Semantic paragraphs (NLTK sentence tokenization)
Level 3: Sliding windows for overlap (20% overlap between adjacent chunks)

For structured data (JSON, tables, code), we preserve structure integrity. A table row never splits across chunks.

Embedding models matter

text-embedding-3-large outperforms ada-002 significantly, but domain-specific fine-tuning is where the real gains are. For a medical client, we fine-tuned on PubMed abstracts and saw retrieval accuracy improve from 71% to 94%.

The process:

Collect 10,000+ query-document relevance pairs
Fine-tune with Multiple Negatives Ranking loss
Evaluate with MRR and NDCG@10
Deploy with fallback to base model if fine-tuned model drifts

Re-ranking is non-optional

Vector similarity finds roughly relevant chunks. A cross-encoder re-ranker finds precisely relevant chunks. We use ms-marco-MiniLM-L-6-v2 for re-ranking because it is fast enough for real-time use.

Without re-ranking: 23% of top-3 chunks were irrelevant With re-ranking: 7% of top-3 chunks were irrelevant

Query rewriting

Users do not write optimal queries. "That thing with the database" is a real query we received. We implemented a query rewriting step:

Extract entities and intent
Expand acronyms and synonyms
Rewrite as a declarative sentence
Generate 3 query variants
Retrieve for all variants and deduplicate

This increased recall by 34%.

Evaluation framework

Production RAG needs continuous evaluation. Our pipeline:

Daily: Sample 100 queries, log retrieval results
Weekly: Human annotators score relevance
Monthly: Retrain re-ranker if drift detected
Quarterly: Evaluate embedding model performance, consider fine-tuning

The hard truth

RAG is not a solved problem. Every deployment requires domain-specific tuning. The companies that treat it as infrastructure rather than a feature are the ones that succeed.