My table has a very large logical size (in Tb’s) and is already partitioned, clustered, and benefits from query caching, but costs are still high because the connected application is used frequently. I am considering a multi-tenant design with separate sharded tables for each customer’s data source. I have read that querying across many shards (max 8-9 per customer, in my case) can lead to performance issues when multiple data sources need to be queried together. Is it true? Also, should I instead shard only by customer, and would that help reduce cost? I have not found much reliable information on this, so I wanted guidance on what would be the most effective BigQuery design in this case.
My understanding is that explicitly shareded multiple tables can lead to more I/O because under the covers, BigQuery can’t optimize the storage on your behalf. Functionally, it will work. My common recommendation is to run experiments. Setup a some copies of your data for short periods and run some queries. Know how to interpret the resource consumption of your queries and see what structuring best serves your needs.
I didn’t follow your thoughts on “… because the connected application is used frequently …”.
It sounds like you have the core notions in hand … BigQuery partitioning and BigQuery clustering. Usually having all your data in fewer tables and using the partitioning/clustering are the way to go. Run your experiments and post back your findings and we can review the results.
Hey,
Hope you’re keeping well.
Sharding in BigQuery can reduce scanned data cost only if queries consistently target a small subset of shards, but it often increases complexity and can hurt performance when you need to join or union many tables. BigQuery executes parallel scans across shards, but cross-shard joins or unions bypass some optimizations and can lead to higher slot usage, especially if clustering and partition pruning aren’t effective. If most queries are per-customer, a single partitioned table with a customer_id column and proper clustering can be more efficient, since BigQuery will prune partitions and clusters automatically without extra metadata overhead. You can monitor actual scanned bytes per query in the Query Plan tab to verify pruning effectiveness, and consider materialized views or table decorators for frequently accessed subsets. Sharding by customer is generally only worth it if isolation is needed for governance or billing, not purely for compute cost control.
Thanks and regards,
Taz