Dataproc variants - Pros / Cons and Usecases

There are 3 Dataproc options

  1. Dataproc on GCE
  2. Dataproc with GKE
  3. Dataproc Serverless

What are the pros/cons of each, I understand containers, serverless and IaaS fully but looking more from the operations perspective and also cost.

Hello,

Thank you for contacting Google Cloud Community!

Dataproc on GCE (Google Compute Engine)

Pros:

  • Full control: Offers the highest level of control over your cluster configuration.
  • Flexibility: Can be tailored to specific workloads and performance requirements.
  • Cost-effective for long-running clusters: Ideal for workloads that require consistent compute resources.

Cons:

  • Requires management: Requires more operational overhead to manage and scale clusters.
  • Higher upfront costs: Can have higher upfront costs due to provisioning and managing infrastructure.
Dataproc with GKE (Google Kubernetes Engine)

Pros:

  • Managed Kubernetes: Leverages the managed Kubernetes platform for cluster management.
  • Container orchestration: Provides advanced container orchestration capabilities.
  • Hybrid workloads: Can run both batch and streaming workloads.

Cons:

  • Steeper learning curve: Requires familiarity with Kubernetes concepts and best practices.
  • Additional costs: May incur additional costs for GKE resources and services.
Dataproc Serverless

Pros:

  • Fully managed: No cluster management required.
  • Pay-as-you-go: Only pay for the resources used, making it cost-effective for intermittent workloads.
  • Scalability: Automatically scales to meet workload demands.

Cons:

  • Limited control: Offers less control over cluster configuration compared to Dataproc on GCE.
  • Potential for cold starts: May experience delays when starting new jobs after periods of inactivity.
Operational Considerations and Cost
  • Operational overhead: Dataproc on GCE requires the most operational overhead, while Dataproc Serverless requires the least.
  • Cost: Dataproc Serverless is generally the most cost-effective option for intermittent workloads, while Dataproc on GCE can be more cost-effective for long-running clusters.
  • Workload requirements: Consider the specific requirements of your workloads, such as batch processing, streaming, or machine learning, to determine the most suitable option.

Regards,

Jai Ade

1 Like

Thanks so much @jaia

One question though i accepted your solution, COSTWISE with Dataproc GCE can i achieve the same benefits as Dataproc Serverless by having ephemeral clusters meaning clusters are deleted once the jobs are completed. I don’t see a huge difference between the 2 variants with this approach @jaia