Dataform - Incremental Update (MERGE), But Delete When Not Matched By Source

Avoiding a full table scan for the deletion of unmatched rows in an incremental update scenario can be challenging, but there are strategies you can employ to make the process more efficient. The key is to limit the scope of the deletion to only those rows that are likely to be affected by the recent changes.

Here are some approaches:

1. Timestamp-Based Filtering:

  • Leverage timestamp columns (e.g., last_updated) to filter rows needing deletion.
  • Scan only rows updated since the last Dataform run.

Example SQL:

post_operation {
  delete from `target_table`
  where last_updated < [timestamp of last successful run]
  and order_id not in (select order_id from `source_table`);
}

2. Incremental Deletion with Temporary Table:

  • Create a temporary table storing processed IDs (e.g., order_id) from the current run.
  • Use the temporary table to limit deletion to unprocessed rows.

Example SQL:

-- Create temporary table with processed IDs
create temporary table processed_ids as select order_id from `source_table`;

-- Delete using the temporary table
delete from `target_table` where order_id not in (select order_id from processed_ids);

3. Partitioning:

  • Utilize table partitioning, especially by date, to restrict deletion to relevant partitions.
  • Reduces data scanned during deletion.

4. Change Data Capture (CDC):

  • If your source system supports CDC, track deletions and only delete CDC-flagged rows in the target table.
  • Requires additional setup in the source system.

5. Use of Indexes:

  • Ensure proper table indexing, especially on join columns.
  • Efficient indexing can significantly reduce full table scan cost.

6. Batch Processing:

  • For large data volumes, consider breaking down the deletion into smaller batches.
  • Improves manageability and resource optimization.

Remember:

  • Each method has trade-offs and may require additional setup or maintenance.
  • Choose the best approach based on your specific data, update frequency, and source system capabilities.
  • Thoroughly test any new method to ensure it meets your performance and accuracy requirements.
2 Likes