Configuring a custom node pool service account for dataproc

In my gcloud environment, the default service accounts have been deleted for security posture reasons. This seems to pose an issue when I attempt to create a dataproc virtual cluster on GKE; I can find how to specify the service account that the pods will adopt when running dataproc jobs, but I can’t seem to figure out where to specify the service account under which the node pool itself will run.

In gke, this is possible, of course (in console and in terraform code), but dataproc doesn’t seem to allow use of a node pool it didn’t specifically create.

Does anybody know how I might configure this when provisioning my dataproc on gke cluster?

Can any Google team member answer this question please ? Thanks so much

Hi @matt-deboer ,

Welcome to Google Cloud Community!

The key is that you need to specify the node pool service account during the Dataproc cluster creation process. Dataproc provides configuration options to define the node pools it manages, and that includes the service account.

Make sure your service account must exist (ex. custom node service account: my-node-sa@my-project.iam.gserviceaccount.com), have the appropriate IAM roles, and the Dataproc Service Agent must have permission.

Here are some workarounds you can try:

  1. Using [gcloud dataproc clusters gke create](https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/gke/create) (CLI)
  • The --node-pool flag is the primary way to configure node pools when creating a Dataproc on a GKE cluster using the gcloud command-line tool.

  • Within the --node-pool flag, you can use the config.serviceAccount property to specify the service account for the nodes in that pool. If you have a Google Kubernetes Engine cluster, you can update the node pool on this cli guide.

    gcloud dataproc clusters gke create CLUSTER_NAME \
            --project=PROJECT_ID \
            --region=REGION \
            --gke-cluster=GKE_CLUSTER_NAME \
            --node-pool="name=NODE_POOL_NAME,roles=default,locations=ZONE,config.machineType=MACHINE_TYPE,config.serviceAccount=CUSTOM_NODE_SA"
    

Replace:

  • CLUSTER_NAME: The name for your Dataproc cluster.
  • PROJECT_ID: Your GCP project ID.
  • REGION: The region to deploy the cluster.
  • GKE_CLUSTER_NAME: The name or full resource name of your existing GKE cluster.
  • NODE_POOL_NAME: A name for the Dataproc-managed node pool
  • ZONE: The zone where you want to create the nodes (must be a valid zone for the GKE cluster).
  • MACHINE_TYPE: The machine type to use for the nodes.
  • CUSTOM_NODE_SA: The full email address of the custom service account you want to use for the node pool (e.g., my-node-sa@my-project.iam.gserviceaccount.com).
  1. Using Terraform
  • Terraform provides a [google_dataproc_cluster](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataproc_cluster) resource that you can use to define your Dataproc on GKE cluster. Within the virtual_cluster_config, you have kubernetes_cluster_config and gke_cluster_config.
  • The node pool configuration goes in node_pool_config, specifically under node_pools.
  • The service account is specified under config.service_account.

Example:

terraform

    resource "google_dataproc_cluster" "default" {
      project = "PROJECT_ID"
      region  = "REGION"
      name    = "CLUSTER_NAME"

      virtual_cluster_config {
        staging_bucket = "BUCKET_NAME"
        kubernetes_cluster_config {
          gke_cluster_config {
            gke_cluster_target = "GKE_CLUSTER_NAME"
            node_pool_config {
              node_pools {
                name = "NODE_POOL_NAME"
                roles = ["DEFAULT"]
                locations = ["ZONE"]
                config {
                  machine_type   = "MACHINE_TYPE"
                  service_account = "CUSTOM_NODE_SA"
                }
              }
            }
          }
        }
      }
    }
  1. Required Permissions for Service Account
    Custom Node Service Account (CUSTOM_NODE_SA) needs the necessary IAM roles to function as a GKE node, it typically needs:
  • [roles/logging.logWriter](https://cloud.google.com/logging/docs/access-control#considerations): To write logs to Cloud Logging.
  • [roles/monitoring.viewer](https://cloud.google.com/logging/docs/access-control#metrics-permissions): To send metrics to Cloud Monitoring.
  • [roles/artifactregistry.reader](https://cloud.google.com/artifact-registry/docs/access-control#roles) or [roles/storage.objectViewer](https://cloud.google.com/storage/docs/access-control/iam-roles#standard-roles): To pull container images from Artifact Registry (or Container Registry if you’re using it). You may check this documentation on how to Grant a single IAM role.

If you have any questions and need further assistance with specific configurations, please reach out to our Google Cloud Support team.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Thank you so much for your replay , since I have tried this configuration, in my case I have to have a policy to disable the default compute engine service account , would you please disable the default compte engine service account (delete it for example , by the way you can restore it until 30 days after the deletion) and try the terraform apply please

Thanks @reinc ; I’m attempting to apply your suggestions, but finding that the terraform example you provided doesn’t seem to work with the most recent terraform google provider (https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataproc_cluster). If you look at the example they’ve provided, you’ll find that service_account is not a valid property to be configured on virtual_cluster_config.gke_cluster_config.node_pool_target.node_pool_config.config, which is why I asked the original question.

Also, it looks like the documentation you referenced for the gcloud dataproc clusters gke create references an example argument '–node-pool` which is not supported by that command.
ERROR: (gcloud.dataproc.clusters.gke.create) unrecognized arguments: --node-pool=... (did you mean '--pools'?)

The closest argument available (as referenced here](https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/gke/create)) ) is the --pools property, but once again that property does not support the config.serviceAccount property:

Based on this, I’m wondering if you’re somehow using different versions of these commands which support your suggested solution? Or is there something else you can suggest to try?