Datastream has granularity explosion ingesting CloudSQL Postgres Partitioned Tables

We use Datastream to mirror (Cloud SQL) Postgres tables in BigQuery. We are hitting an issue when trying to replicate **partitioned** Postgres tables into BQ; our Datastream pipelines never finish backfilling and the data throughput is 10 or ever 100X higher than the total size of the target tables. Is there someone who can help us debug this? We’ve have experimented with our Cloud SQL Postgres setup in accordance with the docs and looked around online, but we haven’t found any solution to this problem that will be a deal breaker for using Datastream if we can’t resolve it.

This seems like table stakes for a CDC system to be able to handle partitioned tables so I’m hoping there is a simple solution.

Hello @Rob_Senseman,

Not knowing what you’ve already tried exactly, here’s what I would try:

  • Delete every Datastream-related object:
    • Datastream itself
    • BigQuery resources
    • Publication/Subscription on PostgreSQL
  • Create a publication using publish_via_partition_root=true as mentioned by the doc you shared
  • Recreate a fresh Datastream, targeting (at first) only one table and setting the appropriate max_staleness

I think that deleting everything is a good start to avoid any remaining configuration that could still be used, especially on BigQuery. Also, I remember having an issue linked to insane throughput last year that was related to the max_staleness.