Equivalent of Insert Overwrite strategy in DataForm

Hello, i’m testing DataForm instead of DBT, and i’m missing a feature we use a lot.

In DBT incremental mode we have a mode called incremental_overwrite, what it does is the following:


- create a temp table with the data you want to add incrementally (table_temp)

- set partitions_to_delete = select distinct <partition_column> from table_temp

- MERGE on <table> source USING(table_temp dest) ON False

when not matched by source and dest.<partition_column> IN (partitions_to_delete) then delete

when not matched by source then insert ...

It’s pretty nice when the data volume is importance (making merge too slow) and no clear id exists to deduplicate.

I’m looking to do this and struggle a bit, i saw those conversations about custom merge logic: https://www.googlecloudcommunity.com/gc/Data-Analytics/Use-custom-merge-logic-for-incremental-table-in-Dataform/m-p/681288 and the option Creating a Custom Operations SQLX File.

Would those same options apply ? or is there an easier way ? or a repo where i could check a few merge strategies already written ?

Thanks

2 Likes

While Dataform doesn’t have a direct equivalent to incremental_overwrite, you can achieve a similar effect with the following approach:

Custom Merge Logic:

  1. Incremental Table: Configure your Dataform table as type: "incremental".
  2. Custom SQLX Operations: Create a SQLX operations file to house your merge logic within a post_operations block. This will handle partition deletion and insertion based on your data.

Example:

config { type: "incremental" } 

pre_operations {
  -- Logic to create/populate 'table_temp' if needed
}

post_operations {
  DELETE FROM `${self()}` 
  WHERE <partition_column> IN (SELECT DISTINCT <partition_column> FROM table_temp);

  INSERT INTO `${self()}`
  SELECT * FROM table_temp;
}

Important Considerations:

  • Partitioning: Partition your target table by the partition_column for optimal efficiency.
  • Error Handling: Incorporate error handling in your SQLX operations to ensure robustness.
  • Dataform Version: Be mindful of potential syntax differences between Dataform versions.
2 Likes