From Chaos to Clarity: Fast-Track Data Engineering with Autocleaning

From Chaos to Clarity: Fast-Track Data Engineering with Autocleaning

Data engineers spend too much time cleaning messy data. Hours are lost writing complex parsing logic, profiling raw sources, and fixing pipeline crashes.

What if an agent could do that for you?

In this post, we’ll show how the Data Engineering Agent uses its Autocleaning feature to instantly turn unstandardized raw sources into clean, production-ready tables—saving you hours of tedious work.

The Challenge: A2Z Retail’s Data Integration Obstacles

Imagine A2Z Retail (a fictitious sample company) just acquired a third-party platform. They need to merge the new platform’s user data (partner_users_raw) into their core subscription_users table to power their marketing intelligence. If you want to follow along, point your Data Engineering Agent to the messiest file or table you can think of in your environment, or use the schema provided below to create a similar table. The instructions and steps will be very similar.

Let’s look at the characteristics of the raw partner data:

  • Nested Structs: The user’s full name was trapped inside a user_profile struct and needed to be split into first_name and last_name.
  • Arrays of Mixed Data: Contact info was an array containing emails, phone numbers, and social handles all mixed together. A2Z only wanted the email.
  • Unstructured JSON Blobs: Geographic tracking and demographics were stored as stringified JSON blobs with inconsistent keys (like age_str vs age or gender_code vs sex).
  • Inconsistent Dates: The registration_date came in multiple formats (e.g., MM-dd-yyyy, MMM d, yyyy) and even included invalid dates like 99/99/2022.

Manually mapping and cleaning this would require significant time spent writing custom SQL transformations.

The Solution: Enter Autocleaning

Instead of doing this manually, the user provided a highly simple prompt with no specific cleansing instructions to the Data Engineering Agent:

“Create a plan and pause for approval to clean and append the new data from a2z_retail_sources.partner_users_raw into existing table a2z_retail_sources.subscription_users.”

An "Ask agent" prompt requesting to clean new user data.

The Agent utilized its Autocleaning capabilities to:

  1. Proactively Profile: It analyzed the source data to identify anomalies and understand the schema characteristics.

  2. Map Schemas: It looked at the schema of both the source and target table and identified the necessary conversions.

  3. Correct Data Types: It cleaned string prefixes using type-casting.

  4. Deconstruct Complex Types: It automatically parsed nested JSON and unnested array elements.

  5. Split and Restructure Values: It separated compound strings like full names into distinct columns.

  6. Detect Invalid Dates: It safely handled mixed date formats and neutralized impossible dates (e.g., “99/99/2020”) without breaking the query.

  7. Standardize Values: It mapped inconsistent raw entries (like ‘Fb_ads’, ‘Organic Search’) into clean, uniform labels (like ‘Facebook’ or ‘Search’).

  8. Construct Geographic Objects: It extracted coordinates from JSON and constructed valid GEOGRAPHY types.

Schema Details

To understand the complexity, let’s look at the schemas of both tables.

Source Table (partner_users_raw) Schema:

Column Name Data Type Mode
partner_user_id STRING NULLABLE
user_profile RECORD NULLABLE
user_profile.full_name STRING NULLABLE
user_profile.contact_info RECORD REPEATED
user_profile.contact_info.type STRING NULLABLE
user_profile.contact_info.value STRING NULLABLE
user_profile.demographics JSON NULLABLE
location_data STRING NULLABLE
registration_date STRING NULLABLE
referral_source STRING NULLABLE

Target Table (subscription_users) Schema:

Column Name Data Type Mode
id INTEGER NULLABLE
first_name STRING NULLABLE
last_name STRING NULLABLE
email STRING NULLABLE
age INTEGER NULLABLE
gender STRING NULLABLE
street_address STRING NULLABLE
city STRING NULLABLE
state STRING NULLABLE
postal_code STRING NULLABLE
country STRING NULLABLE
latitude FLOAT NULLABLE
longitude FLOAT NULLABLE
traffic_source STRING NULLABLE
created_at TIMESTAMP NULLABLE
user_geom GEOGRAPHY NULLABLE

The Proposed Plan

Below is the exact Autocleaning plan generated by the Data Engineering Agent to standardize the raw dataset. Users have full flexibility to adjust this plan by replying with conversational comments (e.g., “Exclude middle names during split”), and the agent will autonomously regenerate the strategy to match:

A 'new_users' data pipeline interface with autocleaning steps.

Autocleaning Steps

  1. partner_user_id:
    • Issue Detected: The id in the destination table is an integer, while partner_user_id is a string with a ‘P-’ prefix.
    • Transformation Applied: Remove the ‘P-’ prefix and cast the remaining string to an INTEGER.
  2. user_profile.full_name:
    • Issue Detected: The destination table has separate first_name and last_name columns, while the source has a single full_name.
    • Transformation Applied: Split full_name into first_name (the first word) and last_name (the remaining part).
  3. user_profile.contact_info:
    • Issue Detected: The email needs to be extracted from an array of structs.
    • Transformation Applied: Unnest the contact_info array and filter for the entry where type is ‘email’ to extract the value.
  4. user_profile.demographics:
    • Issue Detected: age is a string with negative values, and gender is a code.
    • Transformation Applied: Parse the JSON, and for age, cast age_str to INTEGER and take the absolute value. For gender, map the gender_code to the corresponding full string value (e.g., ‘F’ for ‘female’, ‘M’ for ‘male’, ‘unknown’ to NULL).
  5. location_data:
    • Issue Detected: This is a JSON string with nested data and inconsistent country codes.
    • Transformation Applied: Parse the JSON to extract address, city, region, zip, country_code, lat, and long. Standardize country_code (e.g., ‘USA’ to ‘US’). Create the user_geom field using ST_GEOGPOINT.
  6. registration_date:
    • Issue Detected: Contains multiple formats and invalid dates (e.g., ‘99/99/2020’).
    • Transformation Applied: Use COALESCE with SAFE.PARSE_TIMESTAMP for different formats (‘%m/%d/%Y’, ‘%b %d, %Y’) to convert the date to a TIMESTAMP. Invalid dates will become NULL.
  7. referral_source:
    • Issue Detected: Inconsistent values for the same traffic source (e.g., ‘Google’ vs. ‘Organic Search’).
    • Transformation Applied: Use a CASE statement to standardize the values (e.g., ‘Google’, ‘Organic Search’, ‘Search_Organic’ all become ‘Google’).

The Result: Rapid Generation of Production-Ready SQL

Generated data pipeline with autocleaning

Below is the actual query the agent generated autonomously:

SELECT
  SAFE_CAST(REPLACE(raw.partner_user_id, 'P-', '') AS INT64) AS id,
  (SELECT part FROM UNNEST(SPLIT(raw.user_profile.full_name, ' ')) part LIMIT 1) AS first_name,
  SUBSTR(raw.user_profile.full_name, STRPOS(raw.user_profile.full_name, ' ') + 1) AS last_name,
  (SELECT value FROM UNNEST(raw.user_profile.contact_info) WHERE type = 'email' LIMIT 1) AS email,
  ABS(SAFE_CAST(JSON_VALUE(raw.user_profile.demographics, '$.age_str') AS INT64)) AS age,
  CASE
    WHEN JSON_VALUE(raw.user_profile.demographics, '$.gender_code') = 'F'
    THEN 'Female'
    WHEN JSON_VALUE(raw.user_profile.demographics, '$.gender_code') = 'M'
    THEN 'Male'
    ELSE NULL
  END AS gender,
  JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.address') AS street_address,
  JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.city') AS city,
  JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.region') AS state,
  JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.zip') AS postal_code,
  CASE
    WHEN JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.country_code') IN ('USA', 'US') THEN 'United States'
    ELSE JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.country_code')
  END AS country,
  SAFE_CAST(JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.loc.lat') AS FLOAT64) AS latitude,
  SAFE_CAST(JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.loc.long') AS FLOAT64) AS longitude,
  ST_GEOGPOINT(SAFE_CAST(JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.loc.long') AS FLOAT64), SAFE_CAST(JSON_VALUE(SAFE.PARSE_JSON(raw.location_data), '$.loc.lat') AS FLOAT64)) AS user_geom,
  CASE
    WHEN raw.referral_source IN ('Google', 'Organic Search', 'Search_Organic')
    THEN 'Google'
    WHEN raw.referral_source IN ('Fb_ads', 'Facebook')
    THEN 'Facebook'
    ELSE raw.referral_source
  END AS traffic_source,
  COALESCE(
    SAFE.PARSE_TIMESTAMP('%m/%d/%Y', raw.registration_date),
    SAFE.PARSE_TIMESTAMP('%b %d, %Y', raw.registration_date)
  ) AS created_at
FROM
  ${ref("partner_users_raw")} raw

Key Highlights of the Generated Query:

  • Resilient Date Parsing: The COALESCE with multiple SAFE.PARSE_TIMESTAMP format patterns ensures that mixed date formats (e.g., MM/dd/yyyy, MMM d, yyyy) are handled safely without breaking the sequence.
  • Safe JSON & Geographic Processing: The query utilizes SAFE.PARSE_JSON securely and constructs a ST_GEOGPOINT autonomously from loose coordinates for geographic analysis.
  • String & Array Restructuring: Unnesting techniques extract unique target elements securely (like the single target email), while string functions split compound full names autonomously.
  • Numeric Cleanups: It applies mathematical rules autonomously (like taking the absolute value of age) to rectify entry errors securely.
  • Standardizing categories: Uses robust CASE expression mappings to group inconsistent strings into clean uniform labels autonomously.

Are you following along with your own tables? We’d love to hear how the Data Engineering Agent performed! Drop a comment below to share your results, feedback, or questions.

Next Chapter: Campaign Attribution using Raw GCS CSV Logs

With the core user table successfully cleaned, A2Z Retail turns its attention to their next objective: identifying exactly which marketing campaigns drove these users to sign up. The acquired system drops daily campaign click records as flat CSV files into a Google Cloud Storage (GCS) bucket.

Here is a sample of the raw data sitting in GCS:

click_id,user_id,click_timestamp,campaign_metadata,cost,session_duration,revenue
CLK001,P-101,2025-03-01T15:41:00Z,source=email;medium=ad;campaign=back_to_school,USD 3.00,57s,€1385.76
CLK002,P-102,2025-03-02T10:22:00Z,source=email;medium=post;campaign=user_retention,£17.85,118s,997.68 USD
CLK004,P-104,2025-03-16T01:30:00Z,source=organic;medium=search;campaign=back_to_school,€14.08,6m 17s,$165.69

Notice the multiple unstandardized formats and units: the campaign_metadata column is a single string with semicolon-separated key-value pairs, cost has mixed currency symbols, session_duration uses various time formats, and revenue contains mixed textual indicators. This presents a complex data cleaning challenge for raw unmanaged object storage.

How the Agent Solves It:

With another highly minimal prompt carrying no explicit instruction rules on string parsing or type casting:

“Create a plan to clean the campaign click logs in gs://campaign_clicks and join them with our subscription users to attribute campaigns.”

Upon scanning the CSV flat file in the GCS bucket, the Data Engineering Agent autonomously profiles the column data to formulate a transformation plan safely:

Generated plan for GCS cleaning.

Autocleaning Steps for GCS Logs

  1. user_id:
    • Issue Detected: The user_id has a ‘P-’ prefix and is a string.
    • Transformation Applied: Remove the ‘P-’ prefix and cast the remaining string to an INTEGER.
  2. campaign_metadata:
    • Issue Detected: The campaign metadata is a single string of key-value pairs.
    • Transformation Applied: Parse the string to extract source, medium, and campaign into separate columns.
  3. cost:
    • Issue Detected: The cost column contains different currency symbols (‘€’, ‘£’) and is a string.
    • Transformation Applied: Standardize the cost to USD by applying conversion rates (1 EUR = 1.1 USD, 1 GBP = 1.2 USD) and cast the result to a FLOAT64.
  4. session_duration:
    • Issue Detected: The session_duration column has multiple formats (e.g., ‘5m’, ‘6m 17s’, ‘573s’).
    • Transformation Applied: Convert the duration to seconds. A regular expression will be used to extract minutes and seconds and calculate the total duration in seconds as an INTEGER.
  5. revenue:
    • Issue Detected: The revenue column contains different currency symbols (‘€’) and suffixes (‘USD’).
    • Transformation Applied: Standardize the revenue to USD by applying conversion rates and removing currency symbols and suffixes. The result will be cast to a FLOAT64.

The agent handles this by:

  1. Creating a Dataform operations file to define an external table for the GCS CSV files.

    CREATE OR REPLACE EXTERNAL TABLE ${self()} (
      click_id STRING,
      user_id STRING,
      click_timestamp TIMESTAMP,
      campaign_metadata STRING,
      cost STRING,
      session_duration STRING,
      revenue STRING
    )
    OPTIONS (
      format = 'CSV',
      uris = ['gs://campaign_clicks/logs.csv'],
      skip_leading_rows = 1
    );
    
  2. Using Autocleaning to autonomously profile and resolve inconsistencies:

    • Standardizing user_id by removing the ‘P-’ prefix autonomously and type-casting to INT64.
    • Parsing the campaign_metadata string using regular expressions to unbundle source, medium, and campaign into separate columns.
    • Unifying cost into cost_usd by stripping non-numeric characters, applying proportional exchange conversions, and casting to FLOAT64.
    • Normalizing session_duration into total seconds by using regular expressions to sum minute and second markers into a standard INT64.
    • Trimming revenue into revenue_usd by removing currency signs or text flags safely and converting to USD autonomously before type-casting to FLOAT64.

Generated code with autocleaning.

  SELECT
    click_id,
    SAFE_CAST(REPLACE(user_id, 'P-', '') AS INT64) AS user_id,
    click_timestamp,
    REGEXP_EXTRACT(campaign_metadata, r'source=([^;]+)') AS source,
    REGEXP_EXTRACT(campaign_metadata, r'medium=([^;]+)') AS medium,
    REGEXP_EXTRACT(campaign_metadata, r'campaign=([^;]+)') AS campaign,
    SAFE_CAST(
      CASE
        WHEN REGEXP_CONTAINS(cost, '€') THEN CAST(REGEXP_REPLACE(cost, r'[^0-9.]', '') AS FLOAT64) * 1.1
        WHEN REGEXP_CONTAINS(cost, '£') THEN CAST(REGEXP_REPLACE(cost, r'[^0-9.]', '') AS FLOAT64) * 1.2
        ELSE CAST(REGEXP_REPLACE(cost, r'[^0-9.]', '') AS FLOAT64)
      END AS FLOAT64
    ) AS cost_usd,
    SAFE_CAST(
      COALESCE(SAFE_CAST(REGEXP_EXTRACT(session_duration, r'(\d+)m') AS INT64) * 60, 0) +
      COALESCE(SAFE_CAST(REGEXP_EXTRACT(session_duration, r'(\d+)s') AS INT64), 0) +
      COALESCE(SAFE_CAST(REGEXP_EXTRACT(session_duration, r'^(\d+)$') AS INT64), 0)
    AS INT64) AS session_duration_seconds,
    SAFE_CAST(
      CASE
        WHEN REGEXP_CONTAINS(revenue, '€') THEN CAST(REGEXP_REPLACE(revenue, r'[^0-9.]', '') AS FLOAT64) * 1.1
        ELSE CAST(REGEXP_REPLACE(revenue, r'[^0-9.]', '') AS FLOAT64)
      END AS FLOAT64
    ) AS revenue_usd
  FROM
    ${ref("campaign_clicks_external")}
  1. Generating the SQL to join this cleaned click data with the subscription_users table we created earlier, producing a final campaign_attribution table.

    SELECT
      s.id AS user_id,
      s.first_name,
      s.last_name,
      s.email,
      s.age,
      s.gender,
      s.country,
      c.click_id,
      c.click_timestamp,
      c.source AS campaign_source,
      c.medium AS campaign_medium,
      c.campaign AS campaign_name,
      c.cost_usd,
      c.session_duration_seconds,
      c.revenue_usd
    FROM
      ${ref("subscription_users")} s
    JOIN
      ${ref("campaign_clicks_staging")} c
      ON s.id = c.user_id
    

This is the power of the Data Engineering Agent in action. By autonomously cleaning messy CSV flat files, it transforms days of tedious string-parsing into minutes of rapid review. Engineers can now seamlessly link GCS CSV datasets with production tables without writing a single manual regular expression.

Conclusion: Seamless Autocleaning for BigQuery and GCS CSV Files

Data integration doesn’t have to be a grind. Whether you are unnesting complex structures in BigQuery or standardizing messy CSV flat files in Cloud Storage, the Data Engineering Agent’s Autocleaning feature handles the heavy lifting of regex extraction, JSON parsing, and type casting autonomously. Days of manual profiling shrink into simple, single-click approvals. Engineers are finally free to focus on strategy and architecture without compromising on top-tier data quality!

This is really cool! I see it is adding the pipelines to a default Dataform repo. Will it support adding them to a determined existing repo?

Yes, it should work with an existing dataform repository as well. You can engage with the agent from the Dataform editor in Google Cloud.