Building Production RAG Applications: A Complete Guide

Retrieval-Augmented Generation (RAG) combines the knowledge of your documents with the reasoning ability of LLMs. But getting it to work reliably in production is harder than tutorials make it seem.

The RAG Pipeline

A production RAG system has four stages:

Ingestion — Process and chunk your documents
Indexing — Create embeddings and store in a vector database
Retrieval — Find the most relevant chunks for a query
Generation — Use retrieved context to generate an answer

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/building-rag-applications

Step 1: Smart Chunking

The most common mistake is naive chunking by character count. Instead, use semantic chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
 
chunks = splitter.split_documents(documents)

Chunk Size Matters

Chunks that are too small lose context. Chunks that are too large dilute relevance. Start with 512 tokens and adjust based on your evaluation metrics.

Step 2: Embedding and Indexing

Use a high-quality embedding model and store vectors in a purpose-built database:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
 
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

Step 3: Retrieval Strategies

Basic similarity search is rarely enough. Layer these techniques:

# Hybrid search: combine semantic + keyword
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance
    search_kwargs={
        "k": 5,
        "fetch_k": 20,
        "lambda_mult": 0.7,  # Balance relevance vs diversity
    },
)

Pro Tip

Add a reranking step using a cross-encoder model. It dramatically improves precision by scoring query-document pairs directly rather than relying on embedding similarity alone.

Step 4: Generation with Context

Structure your prompt to make the best use of retrieved context:

from langchain_core.prompts import ChatPromptTemplate
 
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant. Answer the question based ONLY
    on the provided context. If the context doesn't contain enough information,
    say so clearly. Cite your sources by referencing document names.
 
    Context:
    {context}"""),
    ("human", "{question}"),
])

Evaluation Framework

You can't improve what you can't measure. Track these metrics:

Metric	What it measures	Tool
Faithfulness	Does the answer stay true to the context?	RAGAS
Relevance	Are retrieved chunks relevant to the query?	RAGAS
Answer correctness	Is the final answer correct?	Human eval
Latency	End-to-end response time	Custom
Cost	Token usage and API costs	LangSmith

Common Pitfalls

Not preprocessing documents — Clean HTML, remove boilerplate, normalize formatting
Ignoring metadata — Add source, date, section headers as filterable metadata
No fallback — When retrieval confidence is low, say "I don't know" instead of hallucinating
Skipping evaluation — Build an eval set of 50+ question-answer pairs before optimizing

What's Next

In the next post, we'll cover advanced RAG patterns: hypothetical document embeddings (HyDE), query decomposition, and agentic RAG architectures.

Retrieval-Augmented Generation (RAG) combines the knowledge of your documents with the reasoning ability of LLMs. But getting it to work reliably in production is harder than tutorials make it seem.

The RAG Pipeline

A production RAG system has four stages:

Ingestion — Process and chunk your documents
Indexing — Create embeddings and store in a vector database
Retrieval — Find the most relevant chunks for a query
Generation — Use retrieved context to generate an answer

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/building-rag-applications

Step 1: Smart Chunking

The most common mistake is naive chunking by character count. Instead, use semantic chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
 
chunks = splitter.split_documents(documents)

Chunk Size Matters

Chunks that are too small lose context. Chunks that are too large dilute relevance. Start with 512 tokens and adjust based on your evaluation metrics.

Step 2: Embedding and Indexing

Use a high-quality embedding model and store vectors in a purpose-built database:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
 
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

Step 3: Retrieval Strategies

Basic similarity search is rarely enough. Layer these techniques:

# Hybrid search: combine semantic + keyword
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance
    search_kwargs={
        "k": 5,
        "fetch_k": 20,
        "lambda_mult": 0.7,  # Balance relevance vs diversity
    },
)

Pro Tip

Add a reranking step using a cross-encoder model. It dramatically improves precision by scoring query-document pairs directly rather than relying on embedding similarity alone.

Step 4: Generation with Context

Structure your prompt to make the best use of retrieved context:

from langchain_core.prompts import ChatPromptTemplate
 
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant. Answer the question based ONLY
    on the provided context. If the context doesn't contain enough information,
    say so clearly. Cite your sources by referencing document names.
 
    Context:
    {context}"""),
    ("human", "{question}"),
])

Evaluation Framework

You can't improve what you can't measure. Track these metrics:

Metric	What it measures	Tool
Faithfulness	Does the answer stay true to the context?	RAGAS
Relevance	Are retrieved chunks relevant to the query?	RAGAS
Answer correctness	Is the final answer correct?	Human eval
Latency	End-to-end response time	Custom
Cost	Token usage and API costs	LangSmith

Common Pitfalls

Not preprocessing documents — Clean HTML, remove boilerplate, normalize formatting
Ignoring metadata — Add source, date, section headers as filterable metadata
No fallback — When retrieval confidence is low, say "I don't know" instead of hallucinating
Skipping evaluation — Build an eval set of 50+ question-answer pairs before optimizing

What's Next

In the next post, we'll cover advanced RAG patterns: hypothetical document embeddings (HyDE), query decomposition, and agentic RAG architectures.

Building Production RAG Applications: A Complete Guide

The RAG Pipeline

Step 1: Smart Chunking

Step 2: Embedding and Indexing

Step 3: Retrieval Strategies

Step 4: Generation with Context

Evaluation Framework

Common Pitfalls

What's Next

Related Posts

The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

Naive RAG Is Dead. Here's What Replaced It.

Comments

Leave a comment

Building Production RAG Applications: A Complete Guide

The RAG Pipeline

Step 1: Smart Chunking

Step 2: Embedding and Indexing

Step 3: Retrieval Strategies

Step 4: Generation with Context

Evaluation Framework

Common Pitfalls

What's Next

Related Posts

The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

Naive RAG Is Dead. Here's What Replaced It.

Comments

Leave a comment