When I delete an autopilot cluster and recreate a new autopilot cluster with the same name, it seems like pods are unable to resolve DNS within the cluster. The error I get when trying to access each other service is always a variation of “Name or service not known”. Only after I wait for a certain amount of time (next morning) before I redeploy the cluster with the same name, pods are able to work properly again.
I noticed this issue since GKE autopilot upgraded to use cloud DNS. I’m wondering if there’s some reconciliation that is not done on the old cluster that cause the new cluster to not function properly.
A solution is to have the new cluster with another name or random suffix but honestly this will have certain downstream impacts as some of our work depend largely on the cluster name.
I’m curious to see if anyone else is experiencing this issue and any other better workaround that I can explore.
1 Like
We see a lot of “Name or service not known” in large cluster (standard, not autopilot), 1.26 and 1.27, Cloud DNS is used.
Edit: In our case headless Service has issues being exposed in Cloud DNS so it seems additional work on clouddns-controller might be required. There are no records for headless service in Cloud DNS.
There are a few possible explanations for why you are experiencing this issue:
- DNS caching: DNS caching can cause issues when deleting and recreating a cluster with the same name. When a pod resolves a DNS name, it may cache the result for a period of time. If you delete and recreate a cluster with the same name, the pods in the new cluster may still be using the cached DNS records from the old cluster. This can cause the pods to be unable to resolve DNS names in the new cluster.
- Stale DNS records: It is also possible that the DNS records for the old cluster are not being deleted promptly. This can cause the pods in the new cluster to resolve the DNS records for the old cluster instead of the new cluster.
- Issues with Cloud DNS: It is also possible that there is an issue with Cloud DNS itself. This could be causing the pods in the new cluster to be unable to resolve DNS names.
Workaround
The workaround that you mentioned, which is to create the new cluster with a different name, is a good one. However, I understand that this may not be ideal for your situation.
Another possible workaround is to manually flush the DNS cache on the pods in the new cluster. This can be done using the following command:
sudo systemctl flush-resolve
You can also try restarting the pods in the new cluster. This may force the pods to resolve DNS names again.
How to troubleshoot the issue
If you are still experiencing the issue after trying the workarounds above, you can try troubleshooting the issue by following these steps:
- Check the DNS records for the old and new clusters. Make sure that the DNS records for the old cluster are deleted and that the DNS records for the new cluster are correct.
- Check the DNS cache on the pods in the new cluster. If the DNS cache is not empty, flush the DNS cache on the pods.
- Restart the pods in the new cluster.
- If you are still experiencing the issue, contact Google Cloud support for assistance.
Additional information
I am not aware of any other users who have reported this issue. However, it is possible that other users are experiencing the same issue but have not reported it yet.
I am also not aware of any other better workarounds that you can explore. However, I will continue to investigate this issue and see if I can find any other workarounds.
I hope this information is helpful. Please let me know if you have any other questions.
1 Like
Thank you for elaborate answer, but nothing is cached if Cloud DNS record in managed zone related to the cluster is not created in the first place. Maybe some issues within the cluster or (to us) invisible clouddns-controller?
2 Likes
You’re right, if the Cloud DNS record in the managed zone related to the cluster is not created in the first place, then there is nothing to be cached. In this case, it is possible that there is an issue with the cluster or with the clouddns-controller.
To troubleshoot the issue, you can try the following steps:
- Check the logs for the clouddns-controller. You can view the logs by running the following command:
kubectl logs -n kube-system -l app=clouddns-controller
Look for any errors in the logs.
- Check the status of the clouddns-controller. You can view the status of the clouddns-controller by running the following command:
kubectl get pods -n kube-system -l app=clouddns-controller
Make sure that the clouddns-controller is running and healthy.
- If you are still experiencing the issue, contact Google Cloud support for assistance.
Hi, are you sure we should see the controller? Isn’t that in control plane to which we do not have access to?
$ kubectl logs -n kube-system -l app=clouddns-controller
No resources found in kube-system namespace.
How are you trying to access the pods and/or services? e.g. what FQDNs are you using?
You are correct, the clouddns-controller is a control plane component and we do not have direct access to it. However, we can still get information about the clouddns-controller by running the following command:
kubectl get pods -n kube-system -l app=clouddns-controller
This command will show us the status of the clouddns-controller pod. If the pod is running and healthy, then the clouddns-controller is likely working properly. However, if the pod is not running or is not healthy, then there may be a problem with the clouddns-controller.
If you are still seeing the error message “No resources found in kube-system namespace” after running the command above, then it is possible that the clouddns-controller pod has not been created yet. This can happen if the cluster is still initializing. In this case, you can wait a few minutes and try running the command again.
If you are still seeing the error message after waiting a few minutes, then there may be a problem with the cluster. In this case, you should contact Google Cloud support for assistance.
Maybe, this should help. I am also, trying to understand it well and learning it though.
Let me know, what happens next.
Hi @pgstorm148 , Cloud DNS is working, but not for headless services. There’s no Pods we can see.
<svc>.<ns>.svc.cluster.local
But for us there’s not even a record in Cloud DNS for headless Service.
If Cloud DNS is working but you are unable to see pods for headless services, it is possible that there is a problem with the way that the headless services are configured.
To troubleshoot the issue, you can try the following steps:
- Check the configuration of the headless services. Make sure that the headless services are configured correctly.
- Check the DNS records for the headless services. Make sure that the DNS records for the headless services are correct.
- Flush the DNS cache on the pods in the cluster. You can do this using the following command:
sudo systemctl flush-resolve
- Restart the pods in the cluster.
If you are still experiencing the issue after trying the steps above, you can contact Google Cloud support for assistance.
We experience the same issue for headless services with Cloud DNS.
We use Standard cluster (not Autopilot) for CI and create/destroy namespaces on-demand.
It is sporadic for us in the sense it works for weeks and then suddenly it breaks for a day or two.
Details on the specifics are reported in a GitHub issue for Strimzi as we are looking for ways to not depend on headless services: https://github.com/strimzi/strimzi-kafka-operator/issues/9551