Building Production RAG Applications: A Complete Guide
Learn how to build Retrieval-Augmented Generation systems that actually work in production — from chunking strategies to evaluation frameworks.
Retrieval-Augmented Generation (RAG) combines the knowledge of your documents with the reasoning ability of LLMs. But getting it to work reliably in production is harder than tutorials make it seem.
The RAG Pipeline
A production RAG system has four stages:
- Ingestion — Process and chunk your documents
- Indexing — Create embeddings and store in a vector database
- Retrieval — Find the most relevant chunks for a query
- Generation — Use retrieved context to generate an answer
Step 1: Smart Chunking
The most common mistake is naive chunking by character count. Instead, use semantic chunking:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_documents(documents)Chunk Size Matters
Chunks that are too small lose context. Chunks that are too large dilute relevance. Start with 512 tokens and adjust based on your evaluation metrics.
Step 2: Embedding and Indexing
Use a high-quality embedding model and store vectors in a purpose-built database:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)Step 3: Retrieval Strategies
Basic similarity search is rarely enough. Layer these techniques:
# Hybrid search: combine semantic + keyword
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance
search_kwargs={
"k": 5,
"fetch_k": 20,
"lambda_mult": 0.7, # Balance relevance vs diversity
},
)Pro Tip
Add a reranking step using a cross-encoder model. It dramatically improves precision by scoring query-document pairs directly rather than relying on embedding similarity alone.
Step 4: Generation with Context
Structure your prompt to make the best use of retrieved context:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant. Answer the question based ONLY
on the provided context. If the context doesn't contain enough information,
say so clearly. Cite your sources by referencing document names.
Context:
{context}"""),
("human", "{question}"),
])Evaluation Framework
You can't improve what you can't measure. Track these metrics:
| Metric | What it measures | Tool |
|---|---|---|
| Faithfulness | Does the answer stay true to the context? | RAGAS |
| Relevance | Are retrieved chunks relevant to the query? | RAGAS |
| Answer correctness | Is the final answer correct? | Human eval |
| Latency | End-to-end response time | Custom |
| Cost | Token usage and API costs | LangSmith |
Common Pitfalls
- Not preprocessing documents — Clean HTML, remove boilerplate, normalize formatting
- Ignoring metadata — Add source, date, section headers as filterable metadata
- No fallback — When retrieval confidence is low, say "I don't know" instead of hallucinating
- Skipping evaluation — Build an eval set of 50+ question-answer pairs before optimizing
What's Next
In the next post, we'll cover advanced RAG patterns: hypothetical document embeddings (HyDE), query decomposition, and agentic RAG architectures.