ControlPlane Nodes Missing - Data Disk Issue

Hello,

We have a GDC VMware cluster currently running on version 1.31.

  • The cluster was originally created on version 1.29 and then upgraded sequentially to 1.30 and now 1.31.

  • After the upgrade to 1.31, we observed an issue with one admin cluster control plane node and one user cluster control plane node.

  • These control plane nodes are missing because they are attaching to the wrong data disks.

Steps we tried:

  1. We edited the control plane object using:

    kubectl edit controlplane --kubeconfig kubeconfig
    
    
  2. We corrected the datadisk name manually (for example, changing from .disk-ooo1 to .disk).

  3. After this change, the control plane node came back online and started running normally.

  4. However, after some time, the node again re-attaches to the wrong data disk, and the issue repeats.

Hi @shayan,

Kindly check Fixed vulnerabilities by Google Distributed Cloud patch version

As per release notes:

Google Distributed Cloud (software only) for VMware 1.31.800-gke.32 is now available for download. To upgrade, see Upgrade a cluster. Google Distributed Cloud 1.31.800-gke.32 runs on Kubernetes v1.31.10-gke.300.

If you are using a third-party storage vendor, check the GDC Ready storage partners document to make sure the storage vendor has already passed the qualification for this release.

After a release, it takes approximately 7 to 14 days for the version to become available for use with GKE On-Prem API clients: the Google Cloud console, the gcloud CLI, and Terraform.

For troubleshooting for Volumes that fails to attach:

If a virtual disk is attached to the wrong virtual machine, you can manually detach it by using the following steps:

  1. Drain a node. You can optionally include the --ignore-daemonsets and --delete-local-data flags in your kubectl drain command.
  2. Power off the VM.
  3. Edit the VM’s hardware config in vCenter to remove the volume.
  4. Power on the VM.
  5. Uncordon the node.

Hello,

In my case, the node is using the wrong data disk. For example, if the correct disk name is vm-name-01.disk, the node still tries to use vm-name-01.disk. When I edit the control plane object and change the data disk name back to vm-name-01.disk, the node comes up again.

However, after 1 or 2 days, the node starts using the wrong disk again.

This issue is occurring on one node of both the user and admin clusters.

I am considering running the gkectl repair admin cluster command.

Is there any other way I can resolve this issue?

Note: We are using vSphere datastore( every configuration is according to the documentation ).