Hey guys! I have 3 queries below, I’m struggling to understand why the bytes billed are so different between Q2 & Q3. The results of all 3 queries are below. When I manually type in the value of the max(timestamp), bytes billed are in the KB, but when I run a subquery it’s 92 MB, whereas just the actual subquery it self is only 2 MB?
Is there a way I can reduce billing when I run this in Dataform? I tried pre_operations, regular declare, everything seems to return an error.
#Q1
select max(timestamp) from dummy_a;
#Q2
select * from dummy_b where timestamp = '2025-06-01 23:48:30 UTC';
Q3
select * from dummy_b where timestamp = (select max(timestamp) from dummy_a);
Q1:
Q2:
Q3:
Hi @Pavitra ,
Welcome to Google Cloud Community!
The difference in bytes billed is due to partition pruning and how BigQuery’s optimizer handles literals vs. subquery results in the WHERE clause.
- Q1: select max(timestamp) from dummy_a; (2.48 MB)
- BigQuery scans dummy_a (or relevant parts if dummy_a itself is partitioned/clustered on timestamp) to find the maximum timestamp. The 2.48 MB is the amount of data read from dummy_a.
- Q2: select * from dummy_b where timestamp = ‘2025-06-01 23:48:30 UTC’; (1.25 KB)
- This is extremely efficient. The 1.25 KB strongly suggests that dummy_b is partitioned by the timestamp column (e.g., daily, hourly).
- When you provide a literal timestamp value in the WHERE clause, BigQuery’s query planner can immediately identify which specific partition(s) contain this timestamp. It then only scans those partitions.
- If the data for that specific timestamp within its partition is very small, or if timestamp is also a clustering column within the partition, the amount of data scanned will be minimal.
- Q3: select * from dummy_b where timestamp = (select max(timestamp) from dummy_a); (97.56 MB)
- Even though the subquery (select max(timestamp) from dummy_a) resolves to the exact same literal value (‘2025-06-01 23:48:30 UTC’), the BigQuery optimizer might not be able to “see” this resolved value at the stage where it decides which partitions of dummy_b to scan.
- Optimization Fence: The subquery can act as an “optimization fence.” The planner knows it needs a value from the subquery, but it might not be able to leverage that dynamic value for partition pruning on dummy_b as effectively as it does with a compile-time literal.
- As a result, it might scan a much larger portion of dummy_b (perhaps many partitions, or even the whole table if pruning completely fails) before filtering down to the rows that match the subquery’s result. The 97.56 MB is likely the size of dummy_b or a significant, unpruned portion of it that was scanned.
With regards to reduction of billing in Dataform and Big Query in general, the goal is to evaluate the MAX(timestamp) once and then use its literal result in the query against dummy_b so that partition pruning can kick in effectively. BigQuery Scripting is perfect for this. You can use BigQuery scripting within your Dataform SQLX files.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @ruthseki ,
Thank you for your response! That makes sense in theory but seems very sub optimal for BQ to do in practice.
Anyway, could you please give me an example script to use in Dataform? I’ve tried using a declare and set in a pre-operations
config, but it rendered to no scripting variable available. Thank you in advance!