Hi! In my team we have a Datastream pipeline PostgreSQL 13.20 → BigQuery. Recently, one of our tables had a schema issue and we needed to (pause the Datastream and) recreate the affected table in order to fix the schema. After restarting the stream, queries on the recreated table became very slow!
We ruled out the possibility of this being a BigQuery slot issue because it didn’t happen before (and doesn’t happen on the backup table), we checked in the jobs explorer that there were slots available, and mainly because in the execution plan we can see that the source of the queries is no longer the table updated by Datastream, but instead a table with this format: CDC_TABLE_xxxxx_table_name.
I haven’t found any reference to this behavior in Datastream documentation or forums.
If anyone can help, I’d really appreciate it!
And if you could also share any paper or technical deep-dive on Datastream (if it exists) that would be great to better understand what’s going on under the hood.
When Datastream resumes after a schema change, BigQuery may temporarily read from a CDC staging table (e.g., CDC_TABLE_xxxxx_table_name) instead of the final destination table. This behavior is part of Datastream’s internal mechanism to ensure transactional consistency during catch-up or schema reconciliation.
Here are some possible solutions you can consider:
Re-establish Table and Stream: A possible solution is to pause the Datastream, drop the affected BigQuery target table, and then restart the stream to trigger a full historical backfill. While you paused and restarted the stream after recreating the source table, the missing step was likely dropping the BigQuery target table to allow a complete and clean re-synchronization. This ensures the target table is rebuilt correctly, with proper primary key recognition and a clean internal state.
Optimize max_staleness: Datastream sets max_staleness on the table to control when background merges happen. If staleness is too low, every query may trigger a runtime merge, scanning the full table even if filters exist.
Proactive Schema Management: Implement strict change control processes to avoid DROP TABLE followed by CREATE TABLE operations on source databases for replicated tables. These operations are known to disrupt Datastream’s internal state and performance issues. Consider adding new columns, migrating data, and then deprecating the old ones, rather than in-place modifications.
Datastream Health Metrics: Configure alerts for key Datastream metrics such as Data freshness and Throughput to detect and address replication logs or failures promptly.
You can also check the following documents for a deeper understanding: