Connect with us

Thought Leaders

How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks

mm

Retrieval-Augmented Generation (RAG) is critical for modern AI architecture, serving as an essential framework for building context-aware agents.

But moving from a basic prototype to a production-ready system involves navigating significant hurdles in data retrieval, context consolidation, and response synthesis.

This article provides a deep dive into seven typical RAG failure points and the evaluation metrics with practical coding examples.

The Anatomy of RAG Breakdown – 7 Failure Points (FPs)

According to researchers Barnett et al., Retrieval Augmented Generation (RAG) systems encounter seven specific Failure Points (FPs) throughout the pipeline.

The below diagram illustrates these stages:

Figure A. Indexing and Query processes required for creating a RAG system. The indexing process is done at development time and queries at runtime. Failure points identified in this study are shown in red boxes (source)

Figure A. Indexing and Query processes required for creating a RAG system. The indexing process is done at development time and queries at runtime. Failure points identified in this study are shown in red boxes (source)

Let us explore each FP arranged according to the pipeline sequence, following the top-left to bottom-right progression shown in Figure A.

FP1. Missing Content

Missing content happens when the system is asked a question that cannot be answered because the relevant information is not present in the available vector store in the first place.

The failure occurs when an LLM provides a plausible-sounding but incorrect response instead of stating it doesn’t know.

FP2. Missed the Top-Ranked Documents

This is a situation where a correct document exists in the vector store, but the retriever fails to rank it highly enough to include it in top-k documents fed to an LLM as context.

In consequence, the correct information never reaches the LLM.

FP3. Not in Context (Consolidation Strategy Limitations)

This is a situation where a correct document exists and is retrieved from the vector store, but is excluded during the consolidation process.

This happens when too many documents are returned and the system must filter them down to fit within an LLM’s context window, token limits, or rate limits.

FP4. Not Extracted

This is a situation where an LLM fails to identify the correct information in the context, even though the correct information was in the vector store, and successfully retrieved/consolidated.

This happens when the context is overly noisy or contains contradictory information that confuses the LLM.

FP5. Wrong Format

This is a situation where storage, retrieval, consolidation, and LLM interpretation are successfully handled, but the LLM fails to follow specific formatting instructions provided in the prompt, such as a table, a bulleted list, or a JSON schema.

FP6. Incorrect Specificity

An LLM’s output is technically present, but either too general or too complex compared to the user’s needs.

For example, an LLM generates simple answers to a user query with a complex professional goal.

FP7. Incomplete Answers

This is a situation where an LLM generates an output not necessarily wrong, but missing key pieces of information that were available in the context.

For example, when a user asks a complex question like “What are the key points in documents A, B, and C?” , the LLM only addresses one or two of the sources.

How FPs Compromise RAG Pipeline Performance

Each of these FPs impact performance of RAG pipelines:

Data Integrity & Trust Failures

When missing or incorrect information is present, the system is no longer a reliable source of information. Primary FPs include:

  • FP1 (Missing Content): The answer is not in the doc in the first place.
  • FP4 (Not Extracted): The LLM decides to ignore the correct answer in the doc.
  • FP7 (Incomplete): The LLM gives half-truths, missing important pieces.

Retrieval & Efficiency Bottlenecks

The RAG pipeline can be inefficient when it misses key information in the retrieval and consolidation stages. Primary FPs include:

  • FP2 (Missed Top Ranked): The embedding model fails to select top-k embeddings.
  • FP3 (Consolidation Strategy): The script to trim docs to fit the LLM limits drops the most important parts.

User Experience & Formatting Errors

Although correct, an output with poor readability or in a wrong format can compromise user experience. Primary FPs include:

  • FP5 (Wrong Format): The LLM fails to follow the specific output format like JSON.
  • FP6 (Incorrect Specificity): The LLM generates a lengthy output for a simple yes/no question, or vise versa (too brief answer to a complicated question).

The Evaluation Stack: Frameworks to Mitigate FPs

Evaluation metrics are designed to systematically mitigate these FPs.

This section explores major evaluation metrics with practical use cases.

Major RAG Evaluation Metrics:

  • DeepEval
  • RAGAS
  • TruLens
  • Arize Phoenix
  • Braintrust

DeepEval – The Unit Test before Deployment

DeepEval calculates a weighted score based on the criteria.

An LLM-as-a-judge (e.g., GPT-4o) evaluates each criteria against an LLM’s output:

DeepEval leverages G-eval, a chain-of-thought (CoT) framework which takes the multi-step approach to evaluate the output:

  1. Define a criteria to measure (e.g., “coherence,” “fluency,” or “relevance”).
  2. Generate evaluation steps (using an evaluator LLM).
  3. Follow the evaluation step and analyzes the input and the LLM’s output.
  4. Calculates an expected weighted sum of the score of each criteria.

Common Scenario in Practice

  • Situation: A technical documentation assistant (bot) for a complex software product seems to be working every time the engineer team updates the codebase.
  • Problem: No quantitative proof if the bot can still answer the user query (You just “think” it’s working…).
  • Solution: Integrate a PyTest function as CI/CD regression suite into Github Action where DeepEval runs G-Eval and others metrics over a test case:
  • Expected results: If any score of the metrics drops below the threshold (0.85), the PyTest raises AssertionError – immediately failing the CI build, preventing the silent regression from reaching production.

Pros

  • A variety of metrics (50+) including specialized bias and toxicity checks are available.
  • Seamlessly integrates with existing CI/CD pipelines.
  • No reference needed. Assess an output based solely on the prompt and provided context.

Cons

  • The quality of evaluation heavily depends on the judge LLM’s capabilities.
  • Computationally expensive when the judge LLM is a high-end model.

Developer Note – The Test Case for DeepEval
A set of LLMTestCase objects defines the test case that DeepEval runs.

In practice, this test case should contain most important user queries and labeled outputs with the retrieved context.

These can be retrieved from a JSON or CSV file.

RAGAS – The Needle in a Haystack Optimizer

Retrieval Augmented Generation Assessment (Ragas) aims to evaluate RAG without human-annotated dataset by generating synthetic test sets.

Then, it computes flagship metrics:

Figure B. The RAGAS evaluation triad diagram connecting Question, Context, and Answer through Precision, Recall, Faithfulness, and Relevancy metrics (Created by Kuriko IWAI)

Figure B. The RAGAS evaluation triad diagram connecting Question, Context, and Answer through Precision, Recall, Faithfulness, and Relevancy metrics (Created by Kuriko IWAI)

The flagship metrics are categorized into the three groups:

  • Retrieval pipeline (black, solid line, Figure B): Context precision, context recall.
  • Generation pipeline (black, dotted line, Figure B): Faithfulness, answer relevancy.
  • Ground truth (red box, Figure B): Answer semantic similarity, answer correctness.

Common Scenario in Practice

  • Situation: The RAG system for legal contracts is missing key clauses. You are unsure if the problem is in the Search (Retriever) or the Reading (Generator).
  • Problem: No idea on the optimal top-k (number of chunks retrieved).
  • Solution: Use RAGAS to create a synthetic test set with 100 pairs of questions and evidence. Then, run the RAG pipeline against the test set to calculate context recall and context precision:
  • Expected result: Depending on the metric results, action plan can be the following:
Metric Score Diagnostic Action Plan
Context Recall Low The retriever missed the correct info. – Increase top-k.
– Try hybrid search (BM25 + Vector).
Context Precision Low Top-k chunks contain too much filter and noise – confusing the LLM. – Decrease top-k
– Implement a Reranker (e.g., Cohere).
Faithfulness Low The generator is hallucinating despite having data. – Adjust system prompt.
– Check for context window limits.

Table 1. RAGAS Diagnostic Action Plan – Mapping Scores to System Adjustments.

Pros

  • Excellent for an early-stage project without ground-true datasets (As we saw in the code snippet, RAGAS can make a synthetic test set).

Cons

  • Synthetic test set might miss nuanced factual errors.
  • Requires a robust extractor model to break down answers into individual claims (I used gpt-4o in the example).

TruLens – The Feedback Loop Specialist

TruLens focuses on the internal mechanics of the RAG process rather than just the final output by using feedback functions.

It also uses an LLM-based score reflecting how well the response satisfies the query’s intent, using a 4-point Likert scale (0-3), making it superior for ranking the quality of different search results.

Common Scenario in Practice

  • Situation: A medical advisor bot answers a user’s question correctly but adds a pro-tip that isn’t in the vetted PDF base.
  • Problem: The add-on pro-tip might be helpful, but not grounded.
  • Solution: Use TruLens to implement a groundedness feedback function with a threshold like score > 0.8.
  • Expected results: When the LLM generates a response that contains information not present in the retrieved chunks, TruLens flags the record in your dashboard.

Pros

  • Visualizes the reasoning chain to identify exactly where the agent went off-track.
  • Provides built-in support for grounding to catch hallucinations in real-time.

Cons

  • Learning curve for defining custom feedback functions.
  • The dashboard can feel heavyweight for simple scripts.

Arize Phoenix – The Silent Failure Map

Arize Phoenix is an open-source observability and evaluation tool to evaluate LLM outputs, including complex RAG systems.

Built on OpenTelemetry by Arize AI, it focuses on observability by treating LLM evaluation as a subset of MLOps.

In the context of RAG evaluation, Phoenix excels at embedding analysis, using Uniform Manifold Approximation and Projection (UMAP) to reduce high-dimensional vector embeddings into 2D/3D space.

This embedding analysis mathematically reveals if the failed queries are semantically grouped together, which indicates a gap in the vector database.

Common Scenario in Practice

  • Situation: A customer support bot works great for refunds, but gives nonsensical answers to warranty claims.
  • Problem: Data hole in the vector database (Cannot find in logs).
  • Solution: Use Arize Phoenix to generate a Umap Embedding Visualization (UEV), a 3D map for the vector database – to overlay user queries on the document chunks.
  • Expected results: Visually see a cluster of user queries landing in the dark zone where no documents exist, telling that some documents are forgotten to upload to the vector store.

Pros

  • OpenTelemetry-native; integrates with existing enterprise monitoring stacks.
  • The best tool for visualizing blind spots of the vector store.

Cons

  • Less focused on scoring, more on observing.
  • Can be overkill for small-scale applications or single-agent tools.

Braintrust – The Prompt Regression Safety Net

Braintrust is designed for high-frequency iteration cycles by using cross-model comparison.

Common Scenario in Practice

  • Situation: An engineer team upgrade prompt from “Answer the question” (Case A) to a more complex 500-word system instruction (Case B).
  • Problem: Improving the prompt for Case B might accidentally break Case A.
  • Solution: Use Braintrust to create a golden dataset with a set of N perfect examples (e.g., N = 50). Let Braintrust run side-by-side (SxS) comparison every time the team updates a single word in prompt:
  • Expected result: A difference report showing exactly which cases got better/worse for each of the golden dataset (N = 50).

Pros

  • Extremely fast to test before the deployment.
  • Great UI for non-technical stakeholders to review and grade the output.

Cons

  • Proprietary/SaaS-focused (though they have open-source components).
  • Fewer built-in deep-tech metrics compared to DeepEval or Ragas.

Wrapping Up

When handled with proper evaluation frameworks, RAG can be a competitive tool to provide an LLM context most relevant to the user query.

Implementation Strategy: Mapping Metrics to Failure Points

Although there’s no one-fit-all solution, Table 2 shows which evaluation metrics to apply for each FP we covered in this article:

Failure Point Evaluation Metric Idea Feature to Use
FP1: Missing Content RAGAS Faithfulness / Answer Correctness
FP2: Missed Ranking TruLens Context Recall / Precision
FP3: Consolidation Arize Phoenix Retrieval Tracing & Latency Analysis
FP4: Not Extracted DeepEval Faithfulness / Contextual Recall
FP5: Wrong Format DeepEval G-Eval (Custom Rubric)
FP6: Specificity Braintrust Manual Grading & Side-by-Side Eval
FP7: Incomplete RAGAS Answer Relevancy

Table 2. The Failure Point Mitigation Matrix – Which Tool Solves Which FP?

DeepEval and RAGAS can leverage their faithfulness metrics to measure data integrity failures (FP1, FP4, FP7).

TruLens leverages its context precision / recall to measure the context relevance to the output – effectively assessing FP2.

Arize Phoenix provides a visual trace of the retrieval process, making it easy to see if the document retrieved was lost during the consolidation (FP3).

For UX failures, DeepEval creates custom metrics to assess UX failures, while Braintrust excels at ground-truth dataset comparison.

Kuriko IWAI is Senior ML Engineer at Kernel Labs, a research and engineering hub specialized in transiting ML researches to automated, production-ready pipelines.

She specializes in building ML systems, focusing on Generative AI architecture, ML Lineage, and Advanced NLP.
With extensive experience in product ownership throughout Southeast Asia, Kuriko excels at aligning technical experimentation with business value.

She is currently working with a team at Indeed to build automation pipelines.