Hybrid workloads: Can run both batch and streaming workloads.
Cons:
Steeper learning curve: Requires familiarity with Kubernetes concepts and best practices.
Additional costs: May incur additional costs for GKE resources and services.
Dataproc Serverless
Pros:
Fully managed: No cluster management required.
Pay-as-you-go: Only pay for the resources used, making it cost-effective for intermittent workloads.
Scalability: Automatically scales to meet workload demands.
Cons:
Limited control: Offers less control over cluster configuration compared to Dataproc on GCE.
Potential for cold starts: May experience delays when starting new jobs after periods of inactivity.
Operational Considerations and Cost
Operational overhead: Dataproc on GCE requires the most operational overhead, while Dataproc Serverless requires the least.
Cost: Dataproc Serverless is generally the most cost-effective option for intermittent workloads, while Dataproc on GCE can be more cost-effective for long-running clusters.
Workload requirements: Consider the specific requirements of your workloads, such as batch processing, streaming, or machine learning, to determine the most suitable option.
One question though i accepted your solution, COSTWISE with Dataproc GCE can i achieve the same benefits as Dataproc Serverless by having ephemeral clusters meaning clusters are deleted once the jobs are completed. I don’t see a huge difference between the 2 variants with this approach @jaia