| Developers (GitHub - auto-rag-eval) |
|---|
| Pouya Omran |
| Tanya Dixit |
| Jingyi Wang |
Introducing adaptive benchmarks for evaluating your RAG Systems on Vertex AI
The RAG evaluation gap
Retrieval-Augmented Generation (RAG) has rapidly evolved from experimental prototypes to critical enterprise infrastructure. However, as organizations reach “Day 2” of adoption—where systems must be maintained, updated, and scaled—they face a significant “Evaluation Gap”. While it is easy to build a RAG prototype, evaluating its performance remains complex, unreliable, and resource-intensive.
Currently, development teams often rely on three suboptimal methods:
-
Manual curation: Subject matter experts manually write question-answer pairs. This process is slow, expensive, and inherently biased towards the expert’s specific knowledge.
-
Standardized benchmarks: Off-the-shelf benchmarks rarely reflect an organization’s specific domain, document structure, or user query patterns, leading to metrics that do not correlate with real-world effectiveness.
-
Ad-hoc “vibes” testing: Informal testing lacks rigor and repeatability, making it impossible to confidently track regression or improvement over time.
To bridge this gap, we developed auto-rag-eval (also known internally as RAG-Crusher), an open-source framework built on Google Cloud’s Vertex AI. This solution automates the generation of high-quality, multi-faceted benchmarks directly from an organization’s own document corpus, enabling objective and consistent measurement of RAG system performance.
Architecture and methodology
The core of auto-rag-eval is a multi-stage pipeline designed to produce unbiased (Question, Context, Answer) triplets. Unlike simple generation scripts, this solution addresses a critical flaw in automated benchmarking: the “circular dependency,” where the evaluation data is generated using the same retrieval logic as the system being tested.
The framework leverages Vertex AI Search for document retrieval and Gemini models for intelligent content generation through the following stages:
1. Parallel Context Distillation
The foundation of a reliable benchmark is an unbiased “ground truth” context. Standard methods often pull raw document chunks, which can be noisy or incomplete. Our approach uses Parallel Context Distillation to construct a precise context.
-
Clue Generation: Rather than processing a raw chunk, the system first uses an LLM to identify “clues”—potential topics or questions—within the text.
-
Dual-Index Retrieval: Using these clues, the system performs a semantic search across the entire data store. It evaluates sentences in two ways: individually (for precise fact retrieval) and collectively (for broader context).
-
Distillation: The system merges these results to form a “distilled context”—a focused set of sentences that contains exactly the information needed to answer the potential question, stripped of irrelevant noise. This ensures the ground truth is constructed orthogonally to standard RAG retrieval methods, avoiding bias.
2. Adaptive Profile Selection
To ensure the benchmark reflects the diversity of real-world interactions, we employ Adaptive Profile Selection. An LLM analyzes the distilled context and intelligently selects the most suitable Q&A “profiles”. These profiles define the characteristics of the question along three dimensions:
-
Persona: The profiles include options ranging from “The Novice” requesting definitions to “The Expert” needing strategic synthesis.
-
Difficulty: Levels are calibrated to include “Easy,” “Mild,” and “Hard.”
-
Type: Formats vary, encompassing explanatory, comparative, or decision-support queries.
The system defaults to these dimensions, but users can define custom dimensions, values, and natural language descriptions for their profiles, ensuring generated questions are not only diverse but strictly relevant to the information present in the source text.
3. Unyielding quality control
Automated generation can suffer from hallucinations or low-quality outputs. To mitigate this, auto-rag-eval implements a rigorous Multi-Agent Review system. Every generated Q&A pair is debated by three distinct AI agents—an “Analyst,” a “Synthesizer,” and a “Practical” critic. The pair is only accepted into the final benchmark if it passes a majority vote across two rounds of review, guaranteeing high integrity and reliability.
Experimental validation
We validated the framework using a real-world financial document (Alphabet Inc.'s Form 10-Q) and tested four distinct RAG configurations varying in model size (Gemini 1.5 Flash vs. Gemini 1.5 Pro) and retrieval depth. The experiments yielded two key findings regarding the benchmark’s reliability:
1. Distinguishing power A primary concern with synthetic benchmarks is whether they can differentiate between good and bad systems. Our results showed that the benchmark correctly identified superior configurations.
-
Performance consistently increased with stronger models (Pro > Flash) and deeper retrieval (10 chunks > 3 chunks).
-
Crucially, a benchmark generated by a smaller model (Gemini Flash) successfully differentiated the performance of larger models (Gemini Pro), proving that the tool can be used to evaluate systems more powerful than the generator itself.
2. Difficulty calibration The “Adaptive Profile” mechanism proved effective in creating a graded evaluation. We observed a consistent performance degradation across all RAG systems as the question difficulty increased.
-
Easy: High accuracy (up to 67%) on foundational, definitional questions.
-
Hard: Lower accuracy (dropping to ~33-45%) on complex, strategic synthesis questions.
This stratification allows developers to pinpoint exactly where their system fails—whether it struggles with basic retrieval or complex reasoning.
Future work: Closing the loop
The ultimate goal of evaluation is improvement. We are currently extending the framework to include Automated Root Cause Analysis.
In our proof-of-concept, the system ingests failed test cases and clusters them by error type (e.g., missing_steps, incomplete_information). It then analyzes these clusters to generate actionable recommendations for prompt revisions. For example, if an agent consistently fails to provide mandatory procedural details, the tool can automatically suggest adding a “Procedural Completeness” instruction to the system prompt.
Conclusion
The auto-rag-eval framework transforms RAG evaluation from a manual bottleneck into a scalable, data-driven workflow. By leveraging Parallel Context Distillation and rigorous multi-agent review, organizations can now generate trustworthy benchmarks that drive rapid iteration and confident deployment.
We invite the community to explore the solution, which is available as an open-source repository.
Get started:
- Code repository: GitHub - auto-rag-eval
