Citation Prompting: Integrating Retrieval Results with LLMs in RAG
Previous posts have explored optimizing the "R" in RAG (Retrieval-Augmented Generation) through better chunking, embedding models, and hybrid search. Now, we turn our attention heads to the "A" (Augmentation) and "G" (Generation) phases – specifically, I want to provide some tips for how to effectively integrate the retrieved context into the LLM prompt and evaluate the resulting output.
Simply retrieving relevant documents and shoving them into the context (usually) isn't enough. The way one presents this information to the LLM is vital for generating accurate, trustworthy, and grounded responses. Mishandling this hand-off, or completely overlooking its importance, is a common pitfall that can negate the benefits of even the best retrieval system.
We call the prompt that wraps the retrieved context the citation prompt. This post is a non-exhaustive list of tips and suggestions, please reach out with salient ones I’ve missed.
Structuring Context for LLM Consumption
The prompt serves as the LLM's instruction manual and its source of truth for a given query. It tells the LLM what the passed context is and how to use it. How you structure your citation prompt, especially the retrieved context, significantly impacts the output quality.
Some tips for structuring the retrieved context:
Numbering/Labeling: Assign unique identifiers (e.g., [doc-1], [source-A]) to each chunk. This facilitates accurate citation later.
Delimiters/Tags: Use clear separators or even XML-like tags (e.g., <context source="doc_xyz.pdf" page="3">...</context>) to demarcate different pieces of retrieved information. This helps the LLM distinguish between different sources and the main query/instructions.
Context Ordering: Does the order matter? Often, yes. Placing the most relevant chunks (perhaps identified by a reranker, as discussed previously) earlier in the context window might improve retrieval of facts from those chunls. LLMs often exhibit a "lost in the middle" problem where information in the middle of long contexts is less attended to.
Metadata Inclusion: If there are useful bits of metadata with your corpus docs, include that. This might be stuff like the original document title or URL, the date the doc was created or last updated, perhaps the author.
LLMs have finite context windows. If your retrieved context exceeds this limit, you need a strategy to best use the contet tokens you have. Some approaches are:
Naive Truncation: Simply cutting off excess context is easy but risks losing vital information.
Selective Inclusion: Prioritize chunks based on relevance scores (from retrieval/reranking) and include only the top K that fit.
Context Summarization: Use another LLM call (or a cheaper model) to summarize the retrieved chunks before inserting them into the main generation prompt – this trades off potential detail loss for fitting more information overall.
Don't assume the LLM knows how to use the context. While some LLM providers are now releasing models more focused on RAG you shouldn’t assume much of anything when working with LLMs. Your prompt must explicitly guide the LLM. You really can’t be too specific here, this is where you can imbue the prompt with some domain expertise and where you can define behaviour. Examples:
"Answer the following query based *solely* on the provided context documents. Do not use any prior knowledge."
"Synthesize the information from the provided sources [doc-1], [doc-2], ... to answer the user's question."
"If the context does not contain the answer, state that the information is unavailable in the provided documents."
Evaluating Generation
Standard LM evaluation metrics (like BLEU, ROUGE for summarization, or general coherence/fluency assessments) are insufficient for RAG (arguably insufficient for anything outside translation). We need metrics focused on the interaction between the generated answer and the provided context.
Key RAG Generation Metrics:
Faithfulness / Groundedness: Does the generated answer accurately represent information present only in the provided context? Does it contain contradictions or information not supported by the sources (hallucinations)?
Answer Relevance: Is the generated answer directly relevant to the original user query? (This is distinct from retrieval relevance, which measures if the retrieved chunks were relevant to the query). An LLM might faithfully summarize irrelevant context, resulting in a faithful but irrelevant answer.
Evaluation Methodologies:
LLM-as-a-Judge: Use an “oracle” LLM with carefully crafted prompts to assess the generated answer against the query and context for faithfulness and relevance. Structured outputs are super helpful here, at least while iterating on this prompt ask for a reason from the LLM for it's decision, this will help you hone the eval quickly. If you have a golden dataset you can use an LLM to compare the generated answer with the golden answer in this way.
Human Evaluation: The gold standard, but expensive, boring and slow. This is especially difficult if you’re building RAG in a domain where you are not a subject matter expert.
Automated “Canary” Metrics: Simpler methods like checking n-gram overlap between the answer and context can provide proxy signals for faithfulness, but lack semantic understanding. Tools and frameworks specifically for RAG evaluation (e.g., RAGAs, TruLens) are emerging to automate aspects of this but I’m yet to see value commensurate with lift required to set up.
It’s a PITA, but you’re probably going to have to manually write some query answer pairs. While laborious, I promise you this exercise is worth it. It pays dividends, both in the output artifact and in familiarizing yourselves with every part of the end to end system.
Domain Specificity Matters in Prompting
The best citation prompt structure, instructions, and even the choice of LLM can vary significantly depending on the application domain. In these examples I’m short-cutting the prompt structuring advice I give above for brevities sake.
Legal RAG: Prompts might need to enforce extreme literalness: "Extract the exact definition of 'Force Majeure' from clause 12.b of [doc-contract]. Do not interpret or paraphrase." Citation accuracy is paramount in the legal domain.
Customer Support RAG: Prompts might encourage synthesis and empathy: "Review the following chat logs [log-1], [log-2] and knowledge base article [kb-article-4] regarding printer setup issues. Provide a step-by-step solution for the customer in a friendly tone, citing the article if applicable."
Technical/Code RAG: Prompts need to handle code formatting and potentially reason across multiple technical documents: "Based on the API documentation [api-docs] and the error message log [error-log], explain why the 'authentication failed' error might be occurring and suggest code fixes, citing relevant API endpoints."
Other things to think about
Context Processing Overhead: Complex context structuring, summarization, or selection adds latency before the main generation call. You have to think about your end-user here, the impact of this increased latency may or may not matter. With the advent of “reasoning” models and tools like deep research, I do think that people are becoming more comfortable with waiting on good, well supported answers.
Instruction Following Reliability: LLMs, especially smaller/faster ones, might struggle with adherence (e.g., fail to cite, use outside knowledge, hallucinate). This often requires iterative prompt refinement, few-shot examples in the prompt, or potentially using more capable (but slower/costlier) models.
Balancing Act: There's a constant trade-off between prompt and citation processing complexity (for better control and grounding) and system latency/cost.
RAG is more than just sophisticated retrieval—it's about the artful choreography between context presentation and generation. By investing time in thoughtfully structuring your citation prompts and establishing rigorous evaluation processes, you'll create systems that not only retrieve relevant information but transform it into trustworthy, grounded responses that truly serve your users' needs.