Agentic Transformation : Data challenges
Author : Vijay Sekhri Jamie Ulrich
Date : Jul 25, 2025
Introduction
The era of the agentic enterprise is upon us. Intelligent agents have the opportunity to revolutionize how we work by automating tasks, providing instant insights, and driving decisions with unprecedented speed. But this gleaming future has an Achilles’ heel: the data it is built upon. When these autonomous systems consume enterprise data, they inherit its flaws, turning silent data issues into active operational risks.
What happens when an agent acts on last quarter’s sales numbers, gets confused by three conflicting versions of a project plan, or bases a critical recommendation on an inaccurate policy document? The promise of efficiency quickly crumbles, replaced by flawed decisions, user confusion, and a fundamental lack of trust. As data is increasingly consumed and acted upon by agents, the human oversight that might have previously caught these errors is now a step removed, creating an urgent need for dedicated agents that can evaluate and ensure data quality.
This blog dives into the four primary data challenges that can sabotage your agentic transformation: stale, duplicated, incorrect, and missing data. We will move beyond theory to offer concrete strategies and a detailed reference design for a “Data Quality Guardian” (DQG)—a specialized multi-agent system built to proactively identify and remediate these issues. Read on to learn how to fortify your data foundation and ensure your intelligent agents are architects of progress, not chaos.
Challenge #1: Stale data
One significant hurdle in the seamless operation of Agentic Transformation lies in the pervasive issue of stale data. When data becomes outdated, its utility diminishes rapidly, leading to a cascade of negative consequences within the Agents. Specifically, the reliance on such information can cause the Agentic system to deliver inaccurate or obsolete insights. This directly translates into flawed decision making processes, as the agents operate on a foundation of incorrect assumptions. Consequently, the actions taken by these agents, which are derived from these flawed decisions, will also be misguided and potentially detrimental. This underscores the critical importance of ensuring data freshness and accuracy to maintain the integrity and effectiveness of any Agentic Transformation initiative.
Experience from the field
Experience developing agents for the enterprise has taught us that stale data has compounding, detrimental effects on the development, testing, and deployment phases of projects. During a recent project that relied on indexing websites to feed a RAG system, the agent being built was routinely returning incorrect responses, grounded in stale data. To testers interacting with the agent via a GUI, it was difficult to discern if the agent itself was failing (hallucinating, choosing the wrong tool, failing to understand the user request) or if the agent was doing an excellent job of finding out of date information. In this scenario, stale data was not only creating bad outputs, which is detrimental to confidence in the solution and its value, but also adding an extra step in evaluation and troubleshooting. A foundation of relevant, fresh data allows developers and testers to more easily determine that a bad response was a mechanical failure of the agent, rather than a well executed task that found out of date information. This challenge is particularly detrimental to implementations when considered in the context of an enterprise’s personnel. Typically, teams that manage website content are separate from organizations that build and develop agents, so addressing issues with stale website content requires reaching outside of the development team to understand why the content was still published and address it.
Prescriptions
- Analyze all data sources connected to Agents. Determine the required data freshness for each source based on its usage (e.g. real time sales data vs. quarterly financial reports).
-
Implement Robust Data Synchronization Mechanisms
-
Real Time: If using Agentspace, then for critical, frequently changing data, leverage Agentspace’s planned “Real time sync” capabilities .
-
Incremental Syncs: For less volatile data, configure incremental syncs to capture changes since the last update . This is more efficient than full syncs.
-
Scheduled Full Syncs: For a comprehensive refresh, schedule periodic full syncs, especially for sources where incremental syncs might be complex or less reliable. Balance frequency with system load.
-
-
Establish Data Lifecycle Management Policies
-
Define policies for:
-
Content Expiry & Review: Implement alerts for content owners to review or update information before it becomes stale.
-
Archival: Define rules for archiving old, but potentially still valuable, data. Agentspace should be configurable to search within archives if needed or exclude them by default. This can be done with standard filters.
-
Deletion: Establish clear criteria and processes for deleting truly obsolete data from source systems. Can state data be detected in the source system ?
-
-
-
Implement Content Versioning (at Source)
-
Encourage or enforce version control in source systems. This allows tracking changes and potentially reverting to previous versions if needed. Is it possible to always version the source data and have a standard versioning across many different data sources ? Can one time data analysis be done and a version be added on different sources ?
-
The solution could potentially be enhanced to recognize and differentiate between versions if this metadata is exposed through some data attribute connectors. Then, we can use boost & bury controls for those attributes .
-
-
Introduce a User Feedback Mechanisms for Stale Content
-
Allow Agentic application users to flag content they believe is outdated. This feedback should trigger a review process by the content owner.
-
Integrate a feedback button within the Agentic UI next to search results or document previews. Aggregate this data and review it bi weekly to build concrete action items for fixing the source data . This is useful if data at the source cannot be fixed, neither versioning can be added. This is post ingestion strategy where the stale data is already index but we are relying on the users to tell us
-
-
Monitor and Alert on Stale Data
-
Develop metrics to track data freshness (e.g., percentage of documents not updated in X months). Set up alerts for content owners or administrators when data exceeds its defined freshness threshold. This may not be applicable to many data sources like HR policies that change very less frequently .
-
Utilize Agents application’s datastore schema to track content age if possible to add that during data ingestion . It can help identify potentially stale but frequently accessed items.
-
Challenge #2: Duplicated data
Duplicate data creates confusion, skews analytics, and hinders Agentic systems. Users lose trust due to conflicting information (e.g., multiple addresses for one contact), wasting time and causing frustration. Analytics suffer as duplicate records inflate metrics, leading to flawed decisions (e.g., understated revenue per customer). Agentic systems retrieve inconsistent information, causing erroneous responses, incorrect actions, and incomplete tasks, undermining efficiency and personalized experiences.
Experience from the field
During a project for a retail company, an agent building team faced duplicated data challenges relating to specific products. The agent had access to higher level product descriptions, the type commonly seen in marketing materials and on ecommerce websites, but also deep technical specifications contained in PDFs, with both types of assets stored in the same repository. While neither source of data was incorrect, inconsistency in which source of data was chosen affected the quality of response in a given agent output. During testing, the variation in responses created a sub optimal experience. The solution was to split the assets into different repositories, and better prompt the agent about which data source to use, based on the nature of the user query.
Prescriptions
-
Identify a “Single Source of Truth” (SSOT)
-
For critical data entities, designate a primary system or database as the SSOT. Strive to have Agent prioritize or heavily weight information from these SSOTs.
-
Connector configurations could allow weighting or prioritization of certain data sources.
-
-
Implement Deduplication Techniques during Ingestion/Indexing
-
On the source side we have to generate unique fingerprints (hashes) for documents or data chunks during ingestion. If an identical hash is encountered, it’s flagged as a duplicate.
-
This would be a core enhancement to the Agents data ingestion pipeline. The data connector will need to be adjusted to honor some sort of fuzzy similarity logic and ignore very similar documents . Out of box this functionality may not be available in Agents frameworks.
-
-
Utilize Agents Knowledge Graph for Deduplication
Some Agentic services like Agentspace may provide knowledge graph access
-
Enhance the Agentspace Knowledge Graph to identify and link duplicate entities or documents. For example, if two documents refer to the same project but have slightly different titles, the Knowledge Graph can help resolve this.
-
The Knowledge Graph can create relationships like “duplicate of” or “version of” allowing Agentspace to present a consolidated view or a preferred version.
-
Standardize Data Entry Processes (at Source)
- Implement standardized templates and naming conventions for document creation and data entry in source systems. This reduces the chances of inadvertent duplication.
-
Provide Tools for Manual Review and Merging
-
For cases where automated deduplication is uncertain, provide an interface for data stewards or administrators to review potential duplicates and decide to merge, delete, or keep them.
-
Admin controls in Agents UI could include a “potential duplicates” queue.
-
-
User Reporting of Duplicates
-
Similar to stale content, allow users to flag suspected duplicate content within Agent application .
-
A feedback mechanism in the UI. This will be a post corrective action.
-
-
Vector Embeddings
- You could also generate a model with vector embedding that stored semantic information of the data . During the ingestion or as post scheduled process, find vector similarities of data and the ones that are very similar generate a report for rectification
Challenge #3: Incorrect or inaccurate data
The detrimental impact of incorrect data on the efficacy of agentic systems is profound. When Agents are fed erroneous information, they inevitably propagate inaccuracies, leading to a cascade of negative consequences. This isn’t merely about inconvenience, it directly translates to the dissemination of false information, which can erode trust, mislead users, and undermine the credibility of the entire system.
Furthermore, flawed data directly drives flawed decisions. Agentic systems, by their very nature, are designed to analyze information and make recommendations or take actions based on that analysis. If the foundational data is compromised, the decisions derived from it will be inherently flawed. This can manifest in various ways, from inefficient resource allocation and misguided strategic planning to critical errors in sensitive operations. The potential for harm extends across diverse domains, including financial decisions, medical diagnoses, logistical operations, and even critical infrastructure management. The integrity of data is therefore paramount, as it forms the bedrock upon which reliable, intelligent, and impactful agentic systems are built. Without accurate and clean data, the promise of agentic transformation remains elusive and potentially dangerous.
Experience from the field
Working with a healthcare provider customer, part of the agentic solution was to make recommendations to patients about where to seek care. One input to the agent’s reasoning about where a patient should consider getting care was the hours of operation for each location, accessed via an API. The team discovered quickly that the developed agent could perfectly execute reasoning, but would still end up making an illogical recommendation based on incorrect hours associated with particular locations in the database. In digging into the issue, the complications became apparent quickly. Special hours could be added for a location, just a few days before the day in question. A one time update to the system wouldn’t account for the fact that facilities updated their hours based on holidays and staffing on relatively short notice. The solution, in this case, was to implement a policy around lead time ahead of implementing special hours and to make that policy known to the agent. In this way, when reasoning about which facilities may be the best for a patient within that lead time, the agent could speak to the hours of the facility with confidence.
Prescriptions
-
Establish Strong Upstream Data Governance and Authoring Processes
- This is fundamentally very important. Implement clear ownership, accountability, and review/approval workflows for content creation and modification in source systems. Train employees on data accuracy standards.
-
Implement Data Validation Rules at Ingestion
-
Configure Agents connectors (or the source systems themselves) with validation rules to check data types, formats, ranges, and consistency before data is indexed. For example, a “project start date” should always be before the “project end date”.
-
Advanced connector specific settings such as content filtering at ingestion could be extended to include validation rules.
-
-
Utilize AI-Powered Anomaly Detection
-
Employ AI models within Agents to identify outliers or data points that deviate significantly from established patterns, potentially indicating errors.
-
Google’s AI capabilities can be applied to the ingested data to flag anomalies for review.
-
-
Integrate User Feedback and Correction Workflows
-
Allow users to flag incorrect information within the Agents application. This should trigger a workflow for the data owner to verify and correct the information in the source system.
-
Feedback mechanisms and potential integration with task management systems via custom agents can be developed.
-
-
Implement Citations and Traceability
- Ensure Agent provides clear citations for its answers, linking back to the source documents . This allows users to verify information and identify the source of any inaccuracies.
-
Regular Data Audits and Quality Scoring
- Conduct periodic audits of key data sources to assess accuracy. Develop data quality scores for different datasets.
Challenge #4: Missing data
Missing data poses significant hurdles, leading to several critical issues within an organization. Primarily, it results in incomplete search results, meaning that users are unable to retrieve a comprehensive view of available information. This directly impacts the effectiveness of information retrieval systems and can lead to missed opportunities or flawed decision-making due to a lack of complete data.
Furthermore, missing data severely cripples the ability of automated agents or AI systems to perform tasks effectively. These agents rely on complete and accurate datasets to execute their functions, whether it’s processing requests, automating workflows, or providing intelligent insights. When data is absent, agents either fail to initiate tasks, produce erroneous outputs, or are unable to complete their assigned duties, negating the benefits of automation.
Finally, and perhaps most broadly, missing data prevents a complete and accurate understanding of enterprise knowledge. Enterprise knowledge encompasses all the information, insights, and experiences accumulated within an organization. Gaps in data mean that critical pieces of this knowledge are absent, leading to an incomplete and potentially misleading picture of the business landscape, operational processes, and customer interactions. This fragmented understanding can hinder strategic planning, innovation, and overall organizational efficiency.
Prescriptions
-
Monitor Search Queries and Agent Task Failures for Gaps
-
Track Agents searches that yield no results or results with low user engagement (e.g., no clicks). This often indicates missing information.
-
Monitor when Agents fail to complete tasks due to missing required data.
-
Utilize Agents inbuilt analytics to identify common queries with poor results.
-
-
Implement Automated Alerts and Workflows for Content Creation
-
When missing data patterns are identified (e.g. multiple users searching for a non-existent “Policy X”), trigger alerts or automated tasks for designated content owners or subject matter experts to create the missing content.
-
Build custom agents could be built to manage this workflow.
-
-
Proactive Content Gap Analysis
-
Identify key knowledge areas critical to business operations or upcoming projects.
-
Use Agents to search for existing content related to these areas and identify gaps.
-
Build search capabilities and potentially specialized “content audit” agents.
-
-
User-Driven Content Requests
-
Provide a mechanism within Agents for users to formally request information or documentation they couldn’t find.
-
Develop a dedicated feature or an agent designed for content requests. This would be an enhanced custom UI
-
-
Utilize Imputation Techniques (Cautiously)
-
For certain types of structured data where appropriate, consider using statistical or ML-based imputation methods to fill in missing values. However, this should be done with caution, clearly marking imputed data, as it can introduce inaccuracies if not handled correctly.
-
Advanced AI models within Agents could potentially support sophisticated imputation if deemed appropriate for specific use cases, clearly flagging such data.
-
Reference design
Here is a design of another multi agent application that can help with solving the data challenges with other Agenetic applications. The design is using the Agent Development Kit (ADK) for its flexibility in creating complex, multi step agentic workflows and integrating various tools. The DQG will act as a root orchestrator agent.
1. DQG Root Agent (ADK Workflow Agent)
Orchestration Logic
-
Runs on a configurable schedule (e.g. weekly, bi weekly) initiated by Cloud Scheduler triggering a Vertex AI Pipeline.
-
Invokes specialized “Detector” child agents/tools.
-
Aggregates findings from all detectors.
-
Prioritizes issues based on severity/impact (configurable rules).
-
Formats alerts, incorporating links to remediation playbooks.
-
Sends alerts via a notification module (e.g., Pub/Sub connected to email, Slack, or a ticketing system like Jira/ServiceNow via their APIs or potentially future Agentspace actions).
-
Logs all operations and findings to Cloud Logging for audit and monitoring.
State Management (ADK Session/State)
- Maintains context for each run, tracks previously alerted issues (to avoid re alerting if not yet resolved or explicitly snoozed), and stores run configurations.
Configuration
-
Managed through a configuration file or a simple UI (if extended), defining:
-
Data sources/connectors in scope for Agentspace.
-
Staleness thresholds (e.g. “documents in SharePoint ‘Old Policies’ site not updated in 6 months”).
-
Similarity thresholds for duplicate detection.
-
Keywords/topics for missing data analysis.
-
Notification channels and recipients.
2. Specialized Detector Child Agents / Tools (Built with ADK)
These can be a mix of ADK FunctionTool (for deterministic logic) and LlmAgent (where more nuanced understanding is needed).
Knowledge Gap Identifier (ADK LLM Agent + FunctionTool):
Mechanisms
Agents Search Log Analysis (FunctionTool):
-
Requires access to Agentspace’s search analytics (planned for Q2 '25 in the roadmap).
-
Analyzes queries with zero results, low click-through rates, or consistently poor user feedback (if available).
Targeted Probing (LLM Agent)
-
Maintains a list of critical business topics, or common employee questions. This is also a golden dataset that keeps evolving and is kept up to date
-
Periodically, the LLM agent formulates natural language questions on these topics and “asks” Agent (simulating a user query).
-
It then evaluates the quality, completeness, and relevance of Agent’s retrieved results and generated answers. Poor or missing answers indicate a knowledge gap. The LLM uses its reasoning to assess this.
Content Inventory Check (FunctionTool)
-
Scans for the presence of expected document types or sections related to critical processes (e.g Is there an onboarding guide for X role?).
-
Alert Content
-
Frequent searches for ‘term’ yield no/poor results in Agents. Content may be needed.
-
Agentspace provided a weak answer when asked about ‘critical topics’, suggesting a knowledge gap.
-
Stale Content Detector (ADK FunctionTool)
Mechanism
-
Queries Agents for documents based on some metadata (e.g., last_modified_date, created_date).
-
Compares dates against configured staleness thresholds per source/category.
Alert Content
- Document X (link) from Source Y appears stale (last updated ZZZZ-MM-DD). Policy states review after N months.
Duplicate Content Detector (ADK LLM Agent or advanced FunctionTool)
Mechanism
Candidate Generation
- Identifies potential duplicates based on title similarity, file name patterns, or very similar short descriptions/summaries fetched via Agents.
Deep Analysis - using Vertex AI
-
For candidate pairs/clusters, fetches full content or significant excerpts via Agents.
-
Generates embeddings for the content using Vertex AI Embeddings API.
-
Calculates similarity scores. Documents above a configured threshold are flagged.
-
Optionally, an LLM (Gemini) can be used to review highly similar pairs for semantic equivalence beyond just textual overlap.
Alert Content
- Potential duplicates found: Doc A (link) and Doc B (link) share X% similarity. Review for consolidation.
Inaccurate Detector (ADK LLM Agent + FunctionTool)
Mechanisms
User-Reported Inaccuracies
- Integrates with a system where users can flag content within Agents (this might be a custom feedback tool if not native). The DQG agent ingests these reports.
Pattern-Based Detection (FunctionTool)
- For specific, known types of errors in structured or semi structured data available via Agents (e.g Product SKU X listed with conflicting prices across Document A and Document B).
Contradiction Detection (LLM Agent - Experimental)
- For critical topics, extract summaries from multiple trusted documents via Agent. Use an LLM (Gemini API) prompted to identify direct contradictions.
Alert Content
-
User X flagged content Y as incorrect regarding Z.
-
Contradictory information found regarding ‘topic’ in Doc A and Doc B.
3. Notification & Playbook Module
Alerting
-
The DQG Root Agent uses an alerting tool that uses Google Cloud Pub/Sub to publish structured alert messages.
-
Cloud Functions (or other subscribers) trigger off Pub/Sub messages to:
-
Send formatted emails via SendGrid or another email API.
-
Post messages to Slack channels using Slack APIs.
-
Create tickets in Jira/ServiceNow (via their APIs, or Create Jira ticket using actions).
-
Conclusion
The success of your agentic transformation hinges on the quality of your data. Stale, duplicated, incorrect, and missing data can undermine the effectiveness of your intelligent agents, leading to flawed decisions and a lack of trust. By implementing the strategies outlined in this article and considering a proactive approach like the Data Quality Guardian, you can fortify your data foundation and ensure that your intelligent agents are powerful catalysts for progress. The journey to an agentic enterprise requires a commitment to data quality, and with the right tools and strategies, you can unlock the full potential of this transformative technology.
