Do we really need Dataflow for stream processing?

I think we’ll need to paint a broader picture. I’m hearing you say that you do indeed have events arriving. Are these events in Kafka, or PubSub or some other messaging system? Are they incoming REST requests? Are they micro batches of files or direct database inserts? All of these will influence our discussion. Next comes the concept of what processing has to be performed on an event when it arrives? I think I’m hearing that the event payload is going to be replicated or distributed across many tables. What is the nature of the back-end database? Is it BigQuery, Cloud SQL or something else?

Now lets talk generics … to my ears, stream processing is the ingestion of data, its processing as part of a pipeline and then its disposition to live in a particular format at rest at the end. Many customers want as low a latency as possible from the time an event arrives at the enterprise to when it can be of value downstream. If we didn’t use an stream processing engine, what would be the alternative? We could (I think) cause the ingested events to languish in a file or other storage media when they arrive and then batch process them. This would maximize latency. Alternatively, we could write ourselves an application that receives the incoming events, processes them and deposits them … but if we wrote this application from scratch, we would effectively be re-creating what is provided by a stream processing engine today. I see a stream processing engine as a “platform” that you can use as a significant starting point for building stream processing solutions. As for choosing which stream engine to choose, you do indeed have a variety of stories … many of which can run on Google Cloud. The default story from Google is Dataflow. Let’s realize that Dataflow is a Google marketing name for “managed Apache Beam”. The skills and techniques needed to build a stream pipeline are 100% open source Apache Beam. Dataflow is Google’s serverless environment for getting your Apache Beam job running with as little fuss as possible with auto scaling and other useful features.

Let’s turn our conversation over to “If not stream processing, what is the alternative?”. Looking forward to hearing back.

4 Likes