Incremental Merge with Backfill

SK28 · June 1, 2024, 9:51am

I need to perform incremental merge on a partitioned table. I am trying to using the below property updatePartitionFilter to limit the amount of data scanned on this large table during merge operation. But, table updates are very dynamic and is not limited to 24 hour static interval as specified below.

Whether this need to be completed achieved using completely custom SQLX file (using ‘OPERATIONS’ block). Any better ways to handle this scenario in dataform?

 updatePartitionFilter:
        "timestamp >= timestamp_sub(current_timestamp(), interval 24 hour)"

@ms4446 - Please provide your inputs

ms4446 · June 4, 2024, 5:47pm

There are definitely ways to address this scenario in Dataform that offer more flexibility than a static 24-hour interval for partition filtering. Below are some options and recommendations:

Dynamic Partition Filtering with SQLX and Variables

Core Idea: Leverage Dataform’s variable system to calculate a dynamic timestamp for filtering. This timestamp can be based on your data’s actual update patterns.

Implementation:

Create a Variable: Define a variable (e.g., last_update_timestamp) in your Dataform project.
Set Variable Value: SQLX Query: Write a SQLX query (in an “operations” block) that determines the appropriate timestamp. This could be based on the maximum timestamp of your source data, a metadata table, or other logic specific to your data update frequency.

-- SQLX file (e.g., set_last_update.sqlx)
config {
  type: "operations"
}

DECLARE last_update_timestamp TIMESTAMP DEFAULT (
  SELECT MAX(update_time) FROM your_source_data_table
);

-- Using Dataform's JS variable system to store the result
SET @last_update_timestamp = last_update_timestamp;

Use Variable in updatePartitionFilter:

config {
  type: "incremental",
  updatePartitionFilter: "timestamp >= @last_update_timestamp"
}

Benefits:

Highly flexible; adapts to the actual update patterns of your data.

Incremental Merge with Backfill

Core Idea: If your incremental updates are infrequent or periodic, you could perform a backfill merge on a less frequent schedule. This would involve merging a larger dataset less often, while still handling smaller incremental updates in between.

Implementation:

Backfill Merge: Schedule a Dataform action to perform a merge on a wider date range (e.g., weekly, monthly) without a partition filter.
Incremental Merge: For smaller, more frequent updates, use your dynamic filtering approach to merge only the recent changes.

Considerations:

Best suited when incremental updates are relatively small compared to the overall dataset.
May require careful orchestration of Dataform actions and data update schedules.

Custom SQLX Operations

Core Idea: If the above options don’t fully fit your scenario, you have complete control with a custom SQLX file. This gives you the most granular control over the merge logic, including complex filtering.

Benefits:

Maximum flexibility; ideal for scenarios that require highly specific logic.

Example: Dynamic Partition Filtering (SQLX + Variable)

-- SQLX file (e.g., incremental_merge.sqlx)
config {
  type: "incremental",
  uniqueKey: "id",
  partitionBy: "timestamp"
}

-- Set Variable (operations block)
DECLARE last_update_timestamp TIMESTAMP DEFAULT (
  SELECT MAX(update_time) FROM your_source_data_table
);

SET @last_update_timestamp = last_update_timestamp;

-- Incremental Merge with Dynamic Filter
MERGE INTO your_target_table AS target
USING (
  SELECT * FROM your_source_data_table WHERE timestamp >= @last_update_timestamp
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET
  target.column1 = source.column1,
  target.column2 = source.column2
WHEN NOT MATCHED THEN INSERT (id, column1, column2)
VALUES (source.id, source.column1, source.column2);

Choosing the Right Approach

Factors to Consider:

Data Update Frequency: How often does your source data change?
Data Volume: How large are the incremental updates?
Complexity: How specific does your merge logic need to be?

Start with the dynamic filtering approach as it’s flexible and relatively straightforward to implement. If you need more specialized logic, explore custom SQLX operations for the ultimate control.

SK28 · June 5, 2024, 4:18pm

Thank you so much for your detailed explanation.

Also, whether dataform has any config features like merge_exclude_cols similar to dbt.

Potentially, would like to construct created_at and updated_at fields while performing merge operation.

Kindly let me know your inputs.

Topic		Replies	Views
Custom MERGE (INCREMENTAL TABLE) with DataForm Data Analytics bigquery , dataform	3	127	July 28, 2023
Equivalent of Insert Overwrite strategy in DataForm Data Analytics bigquery , dataform	1	58	March 18, 2024
Dataform Incremental table - How to limit the table scan on the target table Data Analytics bigquery , dataform	1	92	April 3, 2025

Incremental Merge with Backfill

AI Suggested topics