Back to blog
airagllmembeddings

Building Production RAG Systems: Lessons Learned

Chunking strategies, embedding model selection, hybrid retrieval, and evaluation — practical lessons from building RAG in production.

RAG (Retrieval-Augmented Generation) is the most practical way to give LLMs access to your data without fine-tuning. But production RAG is harder than the tutorials suggest.

Chunking matters more than the embedding model. Recursive chunking (split on paragraphs, then sentences, then tokens) with 10-20% overlap handles most documents well. Semantic chunking (split when embedding similarity drops) works better for dense technical documents.

For embedding models, text-embedding-3-small from OpenAI is the best cost/quality ratio for English text. For code, Voyage AI's voyage-3 is significantly better.

Hybrid retrieval (dense vectors + BM25 keyword search) with reciprocal rank fusion outperforms either method alone. Always add a re-ranking step — cross-encoders like Cohere Rerank narrow your top-20 candidates to the actual best 5.

The most important metric is faithfulness: is the answer actually grounded in the retrieved context? Use Ragas for automated evaluation on a golden test set of 100+ labeled query-answer pairs.

Dive deeper with our AI/LLM Engineering career path.