Authors: Liam Connell, Aman Tyagi
Generative AI has revolutionized content generation and discovery. Yet, as organizations move from “Chat with your Data” demos to complex workflows, a distinct “Capability Gap” remains.
Many knowledge-intensive business processes remain difficult to enhance with AI. We delineate a set of these called Diligence tasks. Unlike simple search, these scenarios require exhaustive, defensible findings derived from specific document sets.
Consider technical due diligence in M&A, regulatory compliance in Pharma, or audit defense in Finance. In these high-stakes domains, a simple summary is not enough. The output requires rigor, traceability, and critical thinking applied to the inherent messiness of enterprise knowledge.
This post explores how we are closing this gap with a new framework called the Targeted Assessment Agent. We will outline how developers can build structured reasoning layers on Gemini and Google Cloud to solve these persistent challenges.
The challenge: Discovery vs. Diligence
To understand the capability gap, we must first distinguish between two fundamental types of knowledge work: Discovery and Diligence.
Discovery focuses on exploration. When a user asks, “What is our remote work policy?”, the intent is to find the most relevant answer quickly. Standard Retrieval Augmented Generation (RAG) excels here. It uses probabilistic search to surface the most likely documents, prioritizing speed and relevance over exhaustiveness.
Diligence is about verification and risk assessment. The user asks, “Does this target company’s operational model align with our investment thesis?” The intent here is not just to learn, but to defend a position or certify a state of affairs.
The value of AI in diligence is not just finding information, but ensuring that no critical detail is overlooked. Consequently, diligence imposes a set of hard requirements that standard search architectures cannot meet:
-
Exhaustive Consideration: Within the finite scope of evidence provided, no leaf can be unturned.
-
Traceable Reasoning: The output must provide a clear logical path from the finding back to the specific evidence. Without auditability, the output is unusable.
-
Critical Thinking: Enterprise documents are rarely perfectly consistent. They contain conflicting dates, biased perspectives, and superseded contracts. The system must not just read the text but appraise the validity of the source.
When considered from this perspective, it becomes clear that enterprise AI tools have so far been designed for discovery-related use cases. With advances in model capabilities, the time has come for a new class of products and solutions.
Structured reasoning and the Coding-Extraction-Synthesis methodology
To address these challenges, we developed the Coding-Extraction-Synthesis methodology. This approach applies structured reasoning to the entire corpus, mirroring the rigorous workflow of human qualitative researchers.
Instead of filtering documents before reading them, the agent processes the full curated corpus through three distinct stages:
1. Deductive Coding: The agent analyzes every document in parallel against a predefined rubric. Crucially, this stage performs critical appraisal alongside topic identification. Drawing inspiration from the common research methods, the agent “codes” the document for relevancy to a set of predefined topics and other metadata like credibility, bias, and validity. Effectively this creates a structured map of why a document matters, not just what it contains.
2. Evidence Extraction: Using the metadata created in phase one, the agent revisits specific document-criteria pairs to extract verbatim snippets of evidence, along with annotations for what these snippets imply. This process also has a parallel to a human researcher, who might highlight and annotate a text while building an argument.
3. Findings Synthesis: Finally, the agent aggregates the extracted evidence to answer the user’s core questions. Because it has processed the entire corpus, it can synthesize findings that account for the full context, weighing contradictory evidence and highlighting gaps.
This method allows state of the art models to use the best of their advanced reasoning capabilities. The resulting reports are akin to a student who finally listened to the professor: “make an arguable assertion in your thesis rather than summarizing the content and support it with clear lines of reasoning.”
We’ve applied this Coding-Extraction-Synthesis methodology in our Targeted Assessments Agent (pending public release), which has been battle tested with several of our customers.
Critical Document Appraisal ensures exhaustive document consideration and enables high-level reasoning
One of the primary reasons high-value use cases fail in production is that knowledge documents are inherently “messy.”
In a real-world audit, you might find a signed contract from 2022, a draft amendment from 2023, and a biased email chain discussing the terms. A standard LLM treats all these text chunks as equal “truth.” A human auditor, however, applies Critical Document Appraisal.
The Targeted Assessment Agent engineers this judgment into the pipeline. During the “Coding” phase, we task the model with extracting higher-order attributes:
-
Credibility: Is this a signed PDF or a draft Word doc? Is the source authoritative?
-
Bias: Does the author have an incentive to spin the narrative (e.g., a sales pitch vs. a technical specification)?
-
Recency: Has this information been superseded?
-
Validity: Is the methodology sound?
By structuring these “arguable” judgments, we transform subjective nuance into structured data.
This enables a “Glass Box” interface. If a Subject Matter Expert (SME) disagrees with the AI’s assessment of a document’s credibility, they can simply toggle an attribute in a UI. The agent then re-synthesizes the findings based on the human’s guidance. This allows the AI to handle the heavy lifting of reading while keeping the human in the loop for critical judgment calls.
The Evidence Graph enables reasoning audit with traceable justifications
In regulated industries and other high-stakes use cases, an answer is useless if it’s not auditable.
Because the Targeted Assessment Agent persists data across all three stages, it naturally generates an Evidence Graph. This is a data structure that links every final conclusion back to its origin:
This allows us to build user interfaces that prioritize auditability. A user reading a generated Due Diligence Report can click into any conclusion to see the findings and specific evidence that supports it, and click further to open the original document to the exact page with a highlighted snippet. This traceability is what creates the trust required to deploy AI in legal, financial, and healthcare settings.
Evaluation results and comparison with vanilla model
To validate the Coding-Extraction-Synthesis methodology, we conducted a rigorous evaluation of the Targeted Assessment Agent across nine different code bases. These datasets ranged from small infrastructure configurations to massive e-commerce platforms, covering domains such as security audits, architecture reviews, and API security .
We compared the agentic approach (using both Gemini 2.5 Pro and Flash) against a “Vanilla” Gemini implementation that simply loads as many files as possible in its context. The results highlight the distinct advantage of structured reasoning over simple context window stuffing.
Aggregate Performance: Precision over Speed: Across the board, the agentic architecture prioritized accuracy. The TAA achieved the highest overall performance with an average F1 score of 0.723 and precision of 0.698 . By comparison, while the Vanilla Gemini approach was faster (88.0s vs 205.4s), it suffered significantly in precision (0.544), often hallucinating findings or misinterpreting the strict rubric .
The Scale Divergence: The true necessity of the Targeted Assessment architecture reveals itself when the data volume exceeds a standard context window.
In smaller datasets, the Vanilla model performed adequately because all files fit into memory . However, when tested against a large E-commerce Platform consisting of 133 files, the performance gap widened dramatically .
-
Targeted Assessment Agent (Pro/Flash): Maintained high rigor with a Precision of 0.840 and an F1 score of 0.506 .
-
Vanilla Gemini: Collapsed under the complexity, with F1 dropping to 0.278 and Recall plummeting to just 0.172 .
Implications for Enterprise AI: These metrics confirm that “Chat with your Data” strategies struggle with “Diligence” tasks at scale. While Deep Research tools are built for open-domain exploration, our findings suggest that for specific, high-stakes audits, a domain-specific agentic solution provides the superior architecture. The Targeted Assessment Agent proves that to maintain high precision in large-scale environments, we must move beyond simple retrieval and embrace structured, agentic reasoning.
Conclusion: Operationalizing reasoning
Model reasoning capabilities are advancing rapidly, but raw intelligence alone is likely insufficient to unlock value in high-stakes knowledge domains. Even the most capable human analyst requires a methodology to produce reliable work. AI agents are no different.
To close the capability gap, we must stop treating AI as a magic search box and start treating it like a human colleague. This means prescribing workflows that follow established research methods, and demanding traceable justifications that can be easily interpreted by the humans who remain responsible for the resulting business decisions.
Structured reasoning appears to be the key to this transition. The Coding-Extraction-Synthesis methodology is likely just one example of a broader class of workflows that will emerge as the industry takes on the broad class of knowledge-intensive enterprise applications. We invite developers to explore how Vertex AI and Gemini can be orchestrated to build these rigorous systems, turning potential intelligence into professional diligence.



