RAG & Context Management

Question 1

What is the core idea of retrieval-augmented generation?

Accepted Answer

Retrieving relevant context at query time and supplying it in the prompt — RAG fetches relevant passages from an external store at query time and injects them into the prompt as grounding, letting the model answer from current, proprietary, or large corpora without retraining.

Answer

Fine-tuning the model on your documents

Answer

Increasing `max_tokens`

Answer

Lowering temperature to 0

Question 2

Why add a short, document-aware summary to each chunk before embedding (contextual retrieval)?

Accepted Answer

To preserve surrounding context so isolated chunks remain meaningful and retrievable — Chunks stripped of their document context retrieve poorly. Prepending a brief, situating summary to each chunk before embedding improves retrieval accuracy by keeping each chunk self-explanatory.

Answer

To reduce the embedding dimension

Answer

To bypass the context window

Answer

To avoid using an embedding model

Question 3

To make answers auditable, what should a grounded RAG system ask the model to produce?

Accepted Answer

Citations or quotes pointing to the source passages used — Requiring citations or supporting quotes tied to retrieved passages makes responses verifiable and discourages unsupported claims, which is essential for trust in production RAG.

Answer

Longer answers

Answer

Higher temperature output

Answer

A second opinion from another model

Question 4

Retrieval quality is poor because chunks are too large and mix unrelated topics. What is the most direct fix?

Accepted Answer

Use smaller, semantically coherent chunks with some overlap — Oversized, topic-mixed chunks dilute embeddings. Smaller, coherent chunks (often with slight overlap to avoid cutting ideas at boundaries) embed and retrieve more precisely.

Answer

Increase `max_tokens`

Answer

Raise temperature

Answer

Remove the system prompt

Question 5

When the relevant knowledge base fits comfortably within the context window, what is a valid simplification over a vector database?

Accepted Answer

Placing the full corpus directly in the prompt, ideally with prompt caching — If the corpus fits the context window, you can skip retrieval infrastructure and put the whole corpus in the prompt. Prompt caching keeps this cost-effective across repeated queries.

Answer

Fine-tuning instead

Answer

Using a smaller model

Answer

Disabling retrieval entirely and relying on training data

Question 6

In a vector-search RAG pipeline, what does the embedding model actually produce?

Accepted Answer

A numeric vector capturing semantic meaning, so similar text lands near in vector space — Embeddings map text to dense numeric vectors whose distances reflect semantic similarity. At query time the question is embedded and nearest-neighbor search returns the most semantically relevant chunks, capturing meaning beyond exact keyword overlap.

Answer

A natural-language summary of each chunk

Answer

A keyword index of exact terms

Answer

A compressed copy of the document

Question 7

Pure vector search misses results that depend on exact terms like error codes or product SKUs. What is a common remedy?

Accepted Answer

Hybrid search combining semantic (vector) retrieval with keyword/lexical search — Hybrid retrieval runs semantic and lexical (e.g. BM25/keyword) search together and merges the results, so exact identifiers are caught by keyword matching while conceptual matches come from embeddings. A reranking step often refines the merged set.

Answer

Increase `temperature`

Answer

Switch to a larger generation model

Answer

Remove the embedding step