There are definitely ways to address this scenario in Dataform that offer more flexibility than a static 24-hour interval for partition filtering. Below are some options and recommendations:
- Dynamic Partition Filtering with SQLX and Variables
Core Idea: Leverage Dataform’s variable system to calculate a dynamic timestamp for filtering. This timestamp can be based on your data’s actual update patterns.
Implementation:
- Create a Variable: Define a variable (e.g., last_update_timestamp) in your Dataform project.
- Set Variable Value: SQLX Query: Write a SQLX query (in an “operations” block) that determines the appropriate timestamp. This could be based on the maximum timestamp of your source data, a metadata table, or other logic specific to your data update frequency.
-- SQLX file (e.g., set_last_update.sqlx)
config {
type: "operations"
}
DECLARE last_update_timestamp TIMESTAMP DEFAULT (
SELECT MAX(update_time) FROM your_source_data_table
);
-- Using Dataform's JS variable system to store the result
SET @last_update_timestamp = last_update_timestamp;
- Use Variable in updatePartitionFilter:
config {
type: "incremental",
updatePartitionFilter: "timestamp >= @last_update_timestamp"
}
Benefits:
- Highly flexible; adapts to the actual update patterns of your data.
- Incremental Merge with Backfill
Core Idea: If your incremental updates are infrequent or periodic, you could perform a backfill merge on a less frequent schedule. This would involve merging a larger dataset less often, while still handling smaller incremental updates in between.
Implementation:
- Backfill Merge: Schedule a Dataform action to perform a merge on a wider date range (e.g., weekly, monthly) without a partition filter.
- Incremental Merge: For smaller, more frequent updates, use your dynamic filtering approach to merge only the recent changes.
Considerations:
- Best suited when incremental updates are relatively small compared to the overall dataset.
- May require careful orchestration of Dataform actions and data update schedules.
- Custom SQLX Operations
Core Idea: If the above options don’t fully fit your scenario, you have complete control with a custom SQLX file. This gives you the most granular control over the merge logic, including complex filtering.
Benefits:
- Maximum flexibility; ideal for scenarios that require highly specific logic.
Example: Dynamic Partition Filtering (SQLX + Variable)
-- SQLX file (e.g., incremental_merge.sqlx)
config {
type: "incremental",
uniqueKey: "id",
partitionBy: "timestamp"
}
-- Set Variable (operations block)
DECLARE last_update_timestamp TIMESTAMP DEFAULT (
SELECT MAX(update_time) FROM your_source_data_table
);
SET @last_update_timestamp = last_update_timestamp;
-- Incremental Merge with Dynamic Filter
MERGE INTO your_target_table AS target
USING (
SELECT * FROM your_source_data_table WHERE timestamp >= @last_update_timestamp
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET
target.column1 = source.column1,
target.column2 = source.column2
WHEN NOT MATCHED THEN INSERT (id, column1, column2)
VALUES (source.id, source.column1, source.column2);
Choosing the Right Approach
Factors to Consider:
- Data Update Frequency: How often does your source data change?
- Data Volume: How large are the incremental updates?
- Complexity: How specific does your merge logic need to be?
Start with the dynamic filtering approach as it’s flexible and relatively straightforward to implement. If you need more specialized logic, explore custom SQLX operations for the ultimate control.