Building production RAG systems: lessons from 6 deployments
Retrieval-Augmented Generation sounds simple until you hit production. Here is what we learned about chunking, embedding, and relevance scoring.
We have deployed Retrieval-Augmented Generation systems for legal research, medical documentation, technical support, and financial analysis. Every deployment taught us something new about the gap between demo and production.
The naive approach
Most RAG tutorials follow this pattern:
- Load documents
- Split by fixed character count
- Embed with OpenAI
text-embedding-ada-002 - Store in Pinecone
- Retrieve top-3 chunks for every query
- Stuff into GPT-4 prompt
This works for demos. It fails in production.
Chunking is everything
Fixed-size chunking destroys semantic boundaries. We have seen legal clauses split mid-sentence, code examples broken across chunks, and tables rendered unreadable.
Our current approach uses hierarchical chunking:
- Level 1: Document sections (H1/H2 boundaries)
- Level 2: Semantic paragraphs (NLTK sentence tokenization)
- Level 3: Sliding windows for overlap (20% overlap between adjacent chunks)
For structured data (JSON, tables, code), we preserve structure integrity. A table row never splits across chunks.
Embedding models matter
text-embedding-3-large outperforms ada-002 significantly, but domain-specific fine-tuning is where the real gains are. For a medical client, we fine-tuned on PubMed abstracts and saw retrieval accuracy improve from 71% to 94%.
The process:
- Collect 10,000+ query-document relevance pairs
- Fine-tune with Multiple Negatives Ranking loss
- Evaluate with MRR and NDCG@10
- Deploy with fallback to base model if fine-tuned model drifts
Re-ranking is non-optional
Vector similarity finds roughly relevant chunks. A cross-encoder re-ranker finds precisely relevant chunks. We use ms-marco-MiniLM-L-6-v2 for re-ranking because it is fast enough for real-time use.
Without re-ranking: 23% of top-3 chunks were irrelevant With re-ranking: 7% of top-3 chunks were irrelevant
Query rewriting
Users do not write optimal queries. "That thing with the database" is a real query we received. We implemented a query rewriting step:
- Extract entities and intent
- Expand acronyms and synonyms
- Rewrite as a declarative sentence
- Generate 3 query variants
- Retrieve for all variants and deduplicate
This increased recall by 34%.
Evaluation framework
Production RAG needs continuous evaluation. Our pipeline:
- Daily: Sample 100 queries, log retrieval results
- Weekly: Human annotators score relevance
- Monthly: Retrain re-ranker if drift detected
- Quarterly: Evaluate embedding model performance, consider fine-tuning
The hard truth
RAG is not a solved problem. Every deployment requires domain-specific tuning. The companies that treat it as infrastructure rather than a feature are the ones that succeed.