Building Production RAG Systems: Lessons Learned
Chunking strategies, embedding model selection, hybrid retrieval, and evaluation — practical lessons from building RAG in production.
RAG (Retrieval-Augmented Generation) is the most practical way to give LLMs access to your data without fine-tuning. But production RAG is harder than the tutorials suggest.
Chunking matters more than the embedding model. Recursive chunking (split on paragraphs, then sentences, then tokens) with 10-20% overlap handles most documents well. Semantic chunking (split when embedding similarity drops) works better for dense technical documents.
For embedding models, text-embedding-3-small from OpenAI is the best cost/quality ratio for English text. For code, Voyage AI's voyage-3 is significantly better.
Hybrid retrieval (dense vectors + BM25 keyword search) with reciprocal rank fusion outperforms either method alone. Always add a re-ranking step — cross-encoders like Cohere Rerank narrow your top-20 candidates to the actual best 5.
The most important metric is faithfulness: is the answer actually grounded in the retrieved context? Use Ragas for automated evaluation on a golden test set of 100+ labeled query-answer pairs.
Dive deeper with our AI/LLM Engineering career path.