RAG in Production: Avoiding Common Pitfalls

Jan 8

Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to enhance Large Language Models (LLMs) by grounding them in external knowledge sources. While building a basic RAG prototype can seem straightforward, deploying a robust, reliable, and scalable RAG system in production presents a unique set of challenges. There’s so much involved that it can feel overwhelming.

In this post I want to explore a few pitfalls encountered when implementing RAG in production and how to think about mitigating them. I’ll save deeper dives on the points raised for follow up posts.

Naive Chunking Strategies

When building RAG over documents we often want to implement semantic search, this typically means embedding the documents with an encoder language model. Since most of the models have a maximum sequence length of 512 (some exist with much longer sequence lengths but there’s questions around the efficacy of these models) we need to chunk our documents into manageable passages. While simply splitting documents into fixed-size chunks is a natural approach, and should be tried first in my opinion, it’s rarely the most appropriate strategy.

The Problem: Fixed-size chunks can awkwardly split sentences or paragraphs, severing context. They might separate a question from its answer, a joke from its punchline or a term from its definition. This can lead to irrelevant retrieved passages and poor generation quality. Chunking in this way ignores document structure, like headings, lists, tables etc. There’s a reason we structure documents, ignoring this makes the LLM’s job much harder at the G step of RAG.
Mitigation: Explore more sophisticated chunking methods, especially ones that match the documents in YOUR domain. Consider semantic chunking (grouping text based on topic similarity), sentence splitting, or recursive chunking that respects document boundaries (paragraphs, sections). Tailor the strategy to your specific data types and expected queries. If you have to build generally, look to chunk with overlaps to help mitigate chunk boundary issues.

One could write an entire series of blog posts on chunking.

Suboptimal Embedding Models

The choice of embedding model is crucial for effective retrieval, and the default option isn't always the best. Look into the training data for the model, is data like yours represented there?

The Problem: Generic, off-the-shelf embedding models might not capture the nuances of domain-specific jargon or specialized concepts within your knowledge base. This leads to poor semantic similarity matching and irrelevant search results.
Mitigation: Evaluate different embedding models, including those specifically trained for certain domains (e.g., finance, biomedical). Consider fine-tuning an embedding model on your own data to improve its understanding of your specific terminology and relationships.

Ineffective Retrieval Mechanisms

Getting the right information to the LLM is harder than it seems. So much discourse on RAG assumes vector search as THE way to do RAG - that’s simply not the case. There are many ways to retrieve relevant information, semantic search is just one avenue and has it’s flaws.

The Problem: Basic vector similarity search might retrieve documents that are semantically close but not actually relevant to the user's specific query. It can also struggle with keywords or specific entities. Furthermore, retrieving too many or too few chunks can overwhelm the LLM or starve it of necessary context.
Mitigation: Explore hybrid search approaches that combine semantic search with traditional keyword search techniques (BM25 is often used). Use reranking models after the initial retrieval step to prioritize the most relevant chunks based on the specific query. Experiment with query transformations (e.g., expanding queries, generating hypothetical answers) to improve retrieval recall and precision.

Poor Integration with the LLM

While one of the impressive things about modern LLMs is their robustness to distractions in the prompt, simply stuffing retrieved context into the LLM's prompt often yields suboptimal results. You want to make the LLM’s job as simple as possible, poor retrieval leads to distracting citations and a lessening of adherence to the instructions given.

The Problem: LLMs have (ever increasing) finite context windows. Overloading the prompt with too much retrieved text can push out crucial instructions or the original query. The LLM might also struggle to synthesize information scattered across multiple retrieved chunks or ignore the provided context altogether and make up some bullshit (hallucinations).
Mitigation: Prompt engineer and evaluate your citation prompts. Spend the time to craft eval datasets for this, it’s so worth it. In your prompt clearly instruct the LLM on how to use the provided context. Experiment with different context presentation formats (e.g., numbering sources, summarizing chunks). If you’ve got latency wiggle room experiment with strategies to condense, summarize or otherwise select the most vital information from retrieved citations before passing them to the LLM.

Scalability and Performance Bottlenecks

Production systems face demands far exceeding those of a simple prototype.

The Problem: Handling millions or billions of documents, managing high query throughput, and maintaining low latency requires careful engineering. Vector databases can become bottlenecks if not properly configured, indexed, and scaled. The embedding process itself can be computationally expensive.
Mitigation: Choose a vector database that meets your scalability requirements. Optimize indexing strategies (e.g., IVF, HNSW parameters). Implement caching mechanisms for frequently accessed data or queries. Distribute embedding computation and retrieval processes. Monitor system performance closely and proactively scale resources, better yet implement auto-scaling to actively meet demand.

Evaluation and Monitoring Gaps

Evals, evals, evals! Knowing if your RAG system actually works well in production can be complex endeavour.

The Problem: Simple accuracy metrics used in prototyping don't capture the full picture. How relevant are the retrieved documents? Is the generated answer faithful to the source material? Is the system fast enough? Without robust evaluation and monitoring, it's impossible to iterate and improve.
Mitigation: Define a comprehensive set of evaluation metrics beyond just answer correctness, try to find metrics that map to the broader desired outcomes. Why are building RAG in the first place? Measure your system’s impact on that. Evaluate the components independently and together. Include retrieval metrics (e.g., NDCG, precision@k), generation quality metrics (e.g., faithfulness, relevance, fluency), and operational metrics (e.g., latency, throughput). Implement logging and tracing to understand system behavior. Establish feedback loops (e.g., user ratings) to continuously refine the system.

Data Management and Freshness

Most of the time the knowledge base underpinning your RAG system isn't static. Data changes, models drift, the march of time is relentless.

The Problem: Documents get updated, new information becomes available, and old data becomes obsolete. Keeping the all components synchronized with the source data is critical. Stale data leads to incorrect or outdated answers. Nothing frustrates and loses end user trust more than being misled by the system.
Mitigation: Build robust data ingestion pipelines that automatically process and index new or updated documents. Implement strategies for efficiently updating or removing stale artifacts, for example old vectors from the index. Ideally you can do this in a way without having to reindex your entire data. Version your data, track creation and update times and implement time-based relevance filtering.

Conclusion

RAG offers immense potential, but transitioning from a promising prototype to a production-grade system requires overcoming significant hurdles. The hope is that by anticipating and mitigating these common pitfalls teams can build more robust, reliable, and effective RAG applications that deliver real value.

RAGEmbeddingChunkingPrompts

Dan Woolridge