RAG (Retrieval-Augmented Generation) — Chunking, Embeddings, Context Windows, and Relevance

Why RAG? The Problem RAG Solves

Claude's training data has a knowledge cutoff. It doesn't know your proprietary information, your company's internal docs, or data that was published after training ended. You can put all of that in the context using RAG.

Retrieval-augmented generation means: search through your data to find information relevant to the user's question, then put that information into Claude's context. Claude then answers based on what you've retrieved. This is better than fine-tuning because it's fast, doesn't require retraining, and scales to massive document collections.

The tradeoff: you have to search correctly. If you retrieve irrelevant documents or miss relevant ones, Claude can only work with what you've provided. RAG quality depends entirely on retrieval quality.

Chunking: Breaking Documents Into Pieces

You can't feed entire documents directly to Claude every time. You need to split them into chunks, index those chunks, then retrieve only the relevant ones. Chunking sounds simple but is actually critical: bad chunking strategy breaks RAG.

Fixed-size chunks: Split documents every N tokens (e.g., 500 tokens per chunk). Simple but naive. A chunk might end mid-sentence or split a paragraph that should stay together. Works as a baseline but produces incomplete context.

Semantic chunking: Split at logical boundaries (paragraphs, sections, headings). Requires understanding document structure. For markdown or HTML, parse the structure. For plain text, look for double newlines. This preserves semantic meaning within chunks.

Optimal chunk size: Too small (50 tokens) and you retrieve too many fragments, making context bloated. Too large (2000 tokens) and you retrieve less-relevant text alongside relevant text. Sweet spot is usually 300-700 tokens. Experiment with your data.

Overlap between chunks: If a key concept spans two adjacent chunks, overlapping them slightly ensures you don't lose context at boundaries. 10-20% overlap is common.

Embeddings and Vector Search

Once you have chunks, you need to find the relevant ones quickly. Embeddings convert text into vectors (lists of numbers). Text with similar meaning has similar vectors. Vector search finds chunks whose vectors are closest to the user's query vector.

Embedding quality matters. Some embedding models are specifically trained for semantic search. Others are general-purpose. Using the right embedding model directly affects retrieval quality. Test a few models on your data: embed a query, embed some relevant chunks, and see if the vectors are actually close.

Vector search scaling: For small collections (thousands of chunks), in-memory similarity search works fine. For larger collections (millions of chunks), use vector databases like Pinecone or Weaviate that can scale and provide fast approximate nearest-neighbor search.

Hybrid search: Embeddings capture meaning but can miss exact keyword matches. Many systems use hybrid search: combine vector search with keyword search, then rank results by both signals. User searches for "API pricing" — keyword search finds documents mentioning "pricing", vector search finds docs about "API cost" and "payment tiers", and you return the best of both.

Context Window Management

Claude's context window is finite. If you're building a chatbot that retrieves documents, you have to budget tokens: some for the conversation history, some for the system prompt, some for retrieved documents, and some for Claude's response.

Retrieve conservatively. Retrieve only as many chunks as you can actually fit. If you can fit 5 relevant chunks, retrieve 5. Don't retrieve 20 hoping Claude will ignore the irrelevant ones — it won't, and you'll waste tokens.

Rank and re-rank. Retrieve more chunks than you need, then use a re-ranker model to score them by relevance. Keep only the top N. This ensures you're passing Claude the most relevant information within your token budget.

Summarization when needed. If you've retrieved very long documents and need to fit them in context, summarize sections or use Claude itself to extract key points before passing to your main interaction.

Measuring RAG Quality

How do you know if your RAG system is working? Measure it. Collect queries where you know the right answer comes from your documents. Run the query through your RAG system and check: Did retrieval return the relevant document? Did Claude use it to answer correctly?

Retrieval metrics: Precision (of retrieved chunks, how many are actually relevant) and recall (of relevant chunks, how many did you retrieve). A retrieval system that finds 50% of relevant chunks (high recall) but returns lots of junk (low precision) is useless.

End-to-end metrics: Does Claude answer the question correctly? If you retrieve perfect information but Claude misinterprets it, your RAG fails. You need both retrieval accuracy and Claude accuracy.

Ready to test your knowledge?

Take the RAG practice test to validate your understanding of retrieval patterns.

Take the test →