Skip to main content
All articles
AI··2 min read

Building production RAG systems: lessons from 6 deployments

Retrieval-Augmented Generation sounds simple until you hit production. Here is what we learned about chunking, embedding, and relevance scoring.

AP
Aisha Patel
AI Engineer

We have deployed Retrieval-Augmented Generation systems for legal research, medical documentation, technical support, and financial analysis. Every deployment taught us something new about the gap between demo and production.

The naive approach

Most RAG tutorials follow this pattern:

  1. Load documents
  2. Split by fixed character count
  3. Embed with OpenAI text-embedding-ada-002
  4. Store in Pinecone
  5. Retrieve top-3 chunks for every query
  6. Stuff into GPT-4 prompt

This works for demos. It fails in production.

Chunking is everything

Fixed-size chunking destroys semantic boundaries. We have seen legal clauses split mid-sentence, code examples broken across chunks, and tables rendered unreadable.

Our current approach uses hierarchical chunking:

  • Level 1: Document sections (H1/H2 boundaries)
  • Level 2: Semantic paragraphs (NLTK sentence tokenization)
  • Level 3: Sliding windows for overlap (20% overlap between adjacent chunks)

For structured data (JSON, tables, code), we preserve structure integrity. A table row never splits across chunks.

Embedding models matter

text-embedding-3-large outperforms ada-002 significantly, but domain-specific fine-tuning is where the real gains are. For a medical client, we fine-tuned on PubMed abstracts and saw retrieval accuracy improve from 71% to 94%.

The process:

  1. Collect 10,000+ query-document relevance pairs
  2. Fine-tune with Multiple Negatives Ranking loss
  3. Evaluate with MRR and NDCG@10
  4. Deploy with fallback to base model if fine-tuned model drifts

Re-ranking is non-optional

Vector similarity finds roughly relevant chunks. A cross-encoder re-ranker finds precisely relevant chunks. We use ms-marco-MiniLM-L-6-v2 for re-ranking because it is fast enough for real-time use.

Without re-ranking: 23% of top-3 chunks were irrelevant With re-ranking: 7% of top-3 chunks were irrelevant

Query rewriting

Users do not write optimal queries. "That thing with the database" is a real query we received. We implemented a query rewriting step:

  1. Extract entities and intent
  2. Expand acronyms and synonyms
  3. Rewrite as a declarative sentence
  4. Generate 3 query variants
  5. Retrieve for all variants and deduplicate

This increased recall by 34%.

Evaluation framework

Production RAG needs continuous evaluation. Our pipeline:

  • Daily: Sample 100 queries, log retrieval results
  • Weekly: Human annotators score relevance
  • Monthly: Retrain re-ranker if drift detected
  • Quarterly: Evaluate embedding model performance, consider fine-tuning

The hard truth

RAG is not a solved problem. Every deployment requires domain-specific tuning. The companies that treat it as infrastructure rather than a feature are the ones that succeed.

RAGLLMVector DBEmbeddingsAI Agents