DataPlex’s Data Quality Scan, when configured in incremental mode with the Updated_TS timestamp column, uses a watermark mechanism to determine which records to process. In the initial run—absent any previously stored state—the scan processes all records in the table, spanning two years of data, and then records the maximum Updated_TS value as the high watermark. This watermark serves as the baseline for subsequent scans, which will only consider records updated after this recorded timestamp.
However, this default behavior does not align with the specific goal of scanning only records updated during the previous day. In consecutive runs, the scan will include any records with an Updated_TS greater than the previous high watermark, regardless of whether those records fall within the desired “previous day” timeframe. This means that without further refinement, the scan may process data outside the intended window or even return no results if no updates exceed the existing watermark.
To precisely target records updated between 12:00 AM and 11:59 PM on the day before the scan runs, a row filter must be applied. For example, in BigQuery one might add the condition:
DATE(Updated_TS) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
This row filter restricts the scan to only those records whose date part of Updated_TS matches yesterday’s date. With the filter in place, even the initial run will effectively limit processing to records from the previous day, storing the maximum timestamp from that filtered set as the high watermark. Subsequent runs will continue to use both the incremental logic and the row filter, ensuring that only the intended subset of data is evaluated each day.
It is essential to ensure that the time zones for Updated_TS, the DataPlex scan settings, and the CURRENT_DATE() function are all aligned, as discrepancies can lead to incorrect filtering. Additionally, if there is a possibility of late-arriving or backfilled data, further adjustments—such as implementing a lookback window—may be necessary.
While the default incremental scan behavior in DataPlex does not restrict scans to a specific day’s records, incorporating an appropriate row filter enables the scan to target only records updated on the previous day, thereby meeting the desired objective efficiently.