Hi @Nikita_G your proposed architecture:
MySQL → DataStream → Pub/Sub → Cloud Function → BigQuery
is definitely creative, but there are a few key things to keep in mind that might help you choose the best path forward:
Is that flow feasible without GCS and Dataflow?
Technically, it can be done , but here’s the thing: DataStream isn’t designed to publish directly to Pub/Sub.
Typically, DataStream writes to Cloud Storage or connects with Dataflow, which then loads the data into BigQuery.
Current limitations to be aware of:
-
There’s no native integration between DataStream and Pub/Sub.
-
You’d need to build a workaround to read files from GCS and forward them to Pub/Sub — which kind of brings GCS back into the picture anyway.
Simpler (and recommended) alternatives:
Option 1: The classic GCP architecture
MySQL → DataStream → GCS → Dataflow → BigQuery
This is the most common and well-supported setup on GCP. It’s scalable and reliable , though it may require a bit more initial setup.
Option 2: No-fuss ETL (no code, no GCS)
MySQL → Windsor.ai → BigQuery
With tools like Windsor.ai, you can connect MySQL as a data source and automatically send the data to BigQuery, without having to manage any infrastructure in between. You can also add basic transformation logic if needed.
Perfect if you’re looking for a fully managed solution without building and maintaining custom pipelines.
Option 3: Custom CDC setup
MySQL → Debezium (CDC) → Pub/Sub → Cloud Function → BigQuery
This route gives you more flexibility and control, but you’d need to handle and maintain the components yourself.
It’s ideal for hybrid environments or cases where you need precise control over change events.
Final recommendation:
If you just need a daily or scheduled sync (not real-time), a tool like Windsor.ai can save you tons of time and configuration. But if real-time streaming and full control are what you’re after, then going with Debezium or the official GCP flow is your best bet.