I have a PostgreSQL database and I’m looking into the possibility of sending data directly to BigQuery via Pub/Sub, without using Dataflow.
Testing Debezium and Sequin, both sent messages to the Pub/Sub topic with the operation performed (insert, update, and delete), but not in the format required by BigQuery (containing the _CHANGE_TYPE field), and I haven’t found anything to transform the message and adapt it to the BigQuery schema.
My question is:
Is it possible to transfer data between Pub/Sub and BigQuery without the _CHANGE_TYPE attribute in the messages, or do I need a system to handle this before sending the data to BigQuery?
My understanding is that for BigQuery CDC to engage then the message in the PubSub payload must conform to BigQuery’s expected content. If the message doesn’t it won’t be processed by BigQuery CDC. If the format published to PubSub DOES contain the information but NOT in the expected structure, then you will indeed need to transform the message before delivering it to BigQuery.
You can choose any transform pipeline to perform the transformation. You mentioned Dataflow (Apache Beam) but there are others such as Dataproc (Apache Spark), Data Fusion and more.
If you want to minimize the processing engines, one possibility is to use the relatively new “Single Message Transform” capability of PubSub (ref). This is a technique to transform messages from publish to delivery exclusively within PubSub. The high level appears to be that we are given a copy of the source message as a JavaScript object and from there you can use the power of JavaScript to build a corresponding transformed message. PubSub then owns the invocation of the JavaScript to transform the message.
BigQuery subscriptions from Pub/Sub can only ingest data in a format that matches the target table schema, so if your messages don’t already have the _CHANGE_TYPE field (or any required columns), they won’t load directly. Pub/Sub itself doesn’t transform payloads, so you’ll need an intermediate layer to reshape the CDC events before BigQuery ingestion. The typical lightweight option is Cloud Functions or Cloud Run triggered by the Pub/Sub topic to parse the Debezium/Sequin message, add the _CHANGE_TYPE field, and output a JSON row matching your BigQuery schema. You can then write to BigQuery using the streaming API or via a Pub/Sub subscription to BigQuery once the format matches exactly.