Choosing Embedding Models for RAG

In a vector search powered Retrieval-Augmented Generation (RAG) system, the embedding model acts as the translator, converting vast amounts of text into numerical vectors that aim to capture semantic meaning. The quality of this translation directly dictates how well the system can find relevant information to answer user queries. Choosing the first embedding model you find, or even a popular general-purpose one, can often be a trap, leading to suboptimal performance that undermines the entire RAG pipeline.

Understanding why a model might be suboptimal and knowing how to evaluate alternatives is crucial for building a high-performing RAG system. What follows is a non-exhaustive list of things to keep in mind when picking embedding models for your RAG system.

A quick aside

It’s not always true that semantic vector search is needed for a RAG system. If your scenario doesn’t require this, you don’t need to worry about embedding models and you can skip this blog post.

What Makes an Embedding Model a Poor Fit for Your Use Case?

Not all embedding models are created equal, and even state-of-the-art models can perform poorly if misaligned with your specific needs and data. Here’s how a model can fall short:

  1. Poor Domain Alignment: Models trained primarily on general web text (like CommonCrawl, The Pile, etc.) may struggle to grasp the nuances of specialized domains like legal contracts, medical research papers, financial reports, or internal company jargon. They might fail to recognize synonyms or related concepts specific to your field, leading to poor, or worse irrelevant, search results.

  2. Inadequate Granularity: Some models excel at capturing the overall topic of a paragraph but perform poorly when comparing the meaning of individual sentences. Others might do the reverse. If your chunking strategy creates sentence-level chunks, but your model is better at paragraph-level semantics, you'll have a mismatch that hurts retrieval precision.

  3. Mismatched Dimensionality: Embedding vectors have a specific number of dimensions.

    • Higher dimensions can potentially capture more detailed semantic information but come with higher computational costs (more storage, slower search) and can sometimes suffer from the "curse of dimensionality," where distances become less meaningful.

    • Lower dimensions are faster and require less storage but might compress information too much, losing important nuances. The optimal dimensionality often depends on the dataset size, complexity, and performance requirements.

  4. Language Mismatches: Using a model predominantly trained on English for querying documents in German, or vice-versa, will yield poor results. Even multilingual models have varying levels of proficiency across languages; ensure the model strongly supports your target language(s).

  5. Sensitivity to Phrasing: Some models are less robust to variations in how a query is phrased compared to the text in the knowledge base. Ideally, the model should retrieve the same relevant chunks whether the user asks "What were Q3 profits?" or "Show me the third-quarter earnings report." Some other models require task-specific prefixes for retrieval and performance drops significantly if these are missed.

  6. Outdated Knowledge: Embedding models capture relationships based on the data they were trained on. A model trained several years ago might not understand newer concepts, technologies, or terminology relevant to your current knowledge base. See what BERT thinks is similar to “COVID”.

How to Choose the Right Embedding Model

Define Requirements

Of course, domain specific evaluations are key for picking an embedding model. But before you get to evaluating models, understand your context:

  • Domain: What subject matter does your knowledge base cover? Is it general or highly specialized?

  • Data Characteristics: What is the typical length and structure of your text chunks?

  • Query Types: What kinds of questions will users ask? Are they keyword-based, natural language questions, or something else?

  • Language(s): What language(s) are your documents and queries in?

  • Performance Needs: What are your latency requirements for retrieval? What are your computational/budget constraints?

Explore Candidate Models

Start with a small model, get a baseline, hill climb from there.

  • Check leaderboards like MTEB (Massive Text Embedding Benchmark) which evaluate models on diverse tasks. (Use this as a guide, not gospel, benchmarks are not without their drawbacks)

  • Look for models specifically trained or fine-tuned for your domain (e.g., BioBERT for biomedical text, FinBERT for financial text).

  • Consider model size and dimensionality in relation to your performance needs. If you are tight on resources consider quantized models, or models that support Matroyska truncation of their embeddings.

Evaluate Offline with a Golden Dataset

Create a representative test set consisting of sample queries and the corresponding "ground truth" relevant chunks from your knowledge base.

  • Embed your test chunks using candidate models.

  • Run the sample queries against the embedded chunks.

  • Measure retrieval performance using metrics like:

    • Hit Rate (is the correct chunk in the top K results?)

    • MRR (Mean Reciprocal Rank - how high up is the first correct result?)

    • NDCG (Normalized Discounted Cumulative Gain - considers the position of all relevant results).

Evaluate Qualitatively (Online/User Testing):

Offline metrics are helpful but don't tell the whole story.

  • Integrate promising candidate models into a prototype RAG system.

  • Test with real user queries. Are these searches finding what you expect?

  • Assess the actual relevance of the retrieved chunks through manual review (Yes, look at your data!), expert feedback, or user testing. Sometimes a model scores well on metrics but retrieves subtly irrelevant information.

Balance Cost vs. Performance:

Compare the retrieval quality achieved by different models against their associated costs: embedding computation time, vector storage size, and search latency. A slightly less accurate but significantly faster/cheaper model might be the pragmatic choice for production.

Fine-Tuning

What if no off-the-shelf model performs well enough on your specific data? This is where fine-tuning comes in.

  • What? You can take a pre-trained embedding model and continue its training process using your own data. This typically involves creating pairs (or triplets) of queries and relevant/irrelevant document passages from your domain.

  • Why? Fine-tuning helps the model adapt to your specific terminology, nuances, and data distribution, often leading to significant improvements in retrieval relevance. Sometimes you’ll see this referred to as “domain adaptation”.

  • But! Fine-tuning requires curated training data, compute, and some experience in model training / debugging. It's a more involved process than simply selecting a pre-trained model. That being said, it’s never been easier to fine-tune your own embedding models.

    Author’s note: I’m working on a guide to fine-tuning that includes how to use LLMs to build a synthetic dataset for training - coming soon.

A note on API based models

If you're not constrained by privacy concerns and don't have the infrastructure to self-host an embedding model, API-based solutions offer a convenient alternative. However, this convenience comes with its own set of considerations that can significantly impact your RAG system's performance, reliability, and cost structure.

Key Considerations for API-Based Embedding Models

Training Data Transparency

Few API providers share or explain the training data use to train their embedding models. Is your data well represented in the training data? The more esoteric your domain the less likely this is to be true.

Version Stability and Deprecation Policies

API providers may update their models without warning or deprecate older versions. What happens when the embedding model you've built your system around changes? Ensure the provider has clear version control and deprecation policies that allow you sufficient time to adapt.

Consistent Vector Spaces

When embedding models change, the vector space they generate may shift dramatically. This means previously embedded documents and newly embedded queries might not be meaningfully comparable, effectively breaking your retrieval system. Look for providers that guarantee stable vector spaces across updates or provide migration paths.

Latency Implications

API calls introduce network latency that a self-hosted model doesn't face. For time-sensitive applications, this additional delay (often 100-500ms per request) can accumulate and impact user experience, especially during high-volume periods or when embedding large document collections.

Usage-Based Pricing

Most API embedding services charge based on the volume of text processed. This creates a direct relationship between your data size and operational costs. Calculate the total embedding cost for your initial corpus and expected query volume before committing to an API-based approach.

Rate Limiting and Throughput

API providers typically impose rate limits that can bottleneck batch processing jobs. Understanding these limits is crucial when planning large-scale document ingestion or reindexing operations. Some providers offer enterprise tiers with higher limits, but these come at premium prices.

Data Privacy and Compliance

While providers may promise not to store your data, sending sensitive information to third-party services might still violate internal policies or regulations like GDPR, HIPAA, or industry-specific requirements. Verify the provider's privacy policies and compliance certifications.

API Availability and SLAs

Your RAG system will inherit the reliability of its embedding API. Review the provider's uptime guarantees, incident history, and support response times. For mission-critical applications, consider implementing fallback mechanisms or multi-provider strategies.

Request Size Limitations

Many APIs limit the maximum text length for a single embedding request. If your chunking strategy produces segments that exceed these limits, you'll need to implement additional splitting logic or choose a different chunking approach.

Balancing the Trade-offs

The decision to use API-based embedding models involves weighing these factors against your specific needs. For many teams, especially those in early development or with moderate data volumes, the simplicity and reduced operational overhead of API solutions outweigh their limitations. Certainly fine for privacy agnostic prototypes. As your system scales or requirements evolve, you can always transition to self-hosted or third party hosted open models if necessary.

When evaluating API providers, don't just look at benchmark performance – consider the entire operational picture. A slightly less accurate model with better reliability, clearer versioning policies, and more predictable pricing might be the better choice for production environments.

Remember that even with API-based embeddings, the principles of evaluation we discussed earlier still apply. Test multiple providers, create a representative evaluation dataset, and measure both quantitative metrics and qualitative results before making your final selection.

Conclusion

In summary, if you’re building semantic search powered RAG the choice of embedding model can be a hugely impactful decision. The embedding model is the cornerstone of your RAG system's retrieval capability, directly determining how well it connects users with relevant information. Start simple, build evals, look at your data.

Previous
Previous

Citation Prompting: Integrating Retrieval Results with LLMs in RAG

Next
Next

You Should Probably Be Using Hybrid Search