Hello all,
I am having an issue with a GKE pod and I’m trying to diagnose it.
The pod in question serves as the main entry point for our websites.
The issue is that the pod on one of our sites keeps randomly restarting. Sometimes it’ll restart once a day and sometimes several times in a single hour. There doesn’t seem to be any rhyme or reason to it.
Each time it does, it brings our website offline for 10 - 20 seconds while a new one starts up.
There’s nothing in the logs, and when I run kubectl get events all, it says is that the readiness and liveness probes failed. As such, I don’t know how to find out what is going on.
When I look at the CPU/Memory of the pod, there are often little to no changes at the times of restart, and there’s plenty of resources.
Also, as stated above, this is only on one of our sites. Other sites with this same pod image do not have issues with the restarting.
How can I diagnose this further? What resources do I have to examine what is going on in a pod to cause it to restart without errors or warnings?
Yes I do. I am getting logs from the workload on lots of other things, including web traffic information, and other errors that occur (even though they don’t cause it to restart).
Also, when it restarts, I get all the logging of the start up process.
There are just no errors that anything went wrong.
The logs simply show traffic logging as normal, and then I can see where it restarted because of the startup process logs. Just no logs that would indicate why or what went wrong to cause it to restart.
You mentioned that you’re getting the error along the lines of “readiness and liveness probes failed”.
Readiness and Liveness checks fail because of:
Connection Refused: This typically indicates that the container isn’t listening on the expected port. Resolving this issue requires ensuring that the application within the container is set up to accept connections on that port. Additionally, it’s possible that the probe configuration specifies an incorrect port.
Context Deadline Exceeded: This error generally occurs when the kubelet doesn’t receive a response within the timeoutSeconds specified. The troubleshooting steps for this error vary depending on the probe type. For example, with an exec type probe, it might indicate that the command executed is taking longer than anticipated to run.
HTTP Probe Failed with StatusCode: This error signifies that the server is technically responding to requests, but it’s returning an unexpected error code. This issue is specific to the application’s behavior, meaning an investigation into the HTTP status codes returned by the application is necessary.
Can you run the following query in your Logs Explorer so we can troubleshoot further:
log_id("events")
resource.type="k8s_pod"
resource.labels.cluster_name=*CHANGE TO YOUR CLUSTER NAME*
jsonPayload.message=~"Liveness probe failed"
Also, kindly post the result of kubectl describe pod *POD_NAME* here.
"
But when I try to send that request to the pod IP on the port, I get an UP response immediately
"/ # curl http://10.33.19.5:9020/api/health
{“status”:“UP”,“groups”:[“liveness”,“readiness”]}
/ # "
I’ve also tried setting the timeoutseconds to 30s and failure threshold to 10, incase it’s just not getting a response in time, but it’s still restarting.
On running that query, I’m getting this in the logs for the pods of 2 particular deployments
message: “Liveness probe failed: Get “http://10.33.19.5:9020/actuator/health”: dial tcp 10.33.19.5:9020: connect: connection refused”
I’ve confirmed that this endpoint is actually working by trying to call it from another pod and I get the response:
/ # curl http://10.33.33.6:9020/actuator/health
{“status”:“UP”,“groups”:[“liveness”,“readiness”]}
/ #
I have tried to make provision for slow responses by setting the timeoutseconds to 30 and the failurethreshold to 10, but it still ends up restarting.
What might be the issue here?
You’re facing a frustrating issue with random pod restarts, especially when there’s no clear indication in the logs or resource metrics. Let’s delve into a systematic troubleshooting approach to pinpoint the root cause.
1. Deep Dive into Readiness and Liveness Probes
Examine Probe Definitions:
Carefully review the readinessProbe and livenessProbe definitions in your pod’s YAML.
Pay attention to:
httpGet, tcpSocket, or exec types.
initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold values.
The specific path or command being executed by the probes.
Simulate Probe Execution:
If you’re using httpGet or tcpSocket probes, try manually accessing the specified path or port from within the pod using kubectl exec.
If you’re using exec probes, manually run the command inside the pod.
This will help you determine if the probe itself is failing or if there’s an underlying issue.
Increase Probe Verbosity:
If possible, modify the probe to log more detailed information. For example, if it’s an HTTP probe, log the response code and body. If it’s an exec probe, log the command’s output and exit code.
2. Network Connectivity Issues
DNS Resolution:
If your application relies on external services, ensure that DNS resolution is working correctly within the pod.
Use kubectl exec to run nslookup or dig commands.
Network Latency or Packet Loss:
Use kubectl exec to run ping or traceroute commands to check for network latency or packet loss.
If your application connects to a database, or other networked service, ensure that the network connection is stable.
Firewall Rules:
Check if there are any firewall rules that might be blocking network traffic to or from the pod.
3. Application-Specific Issues
Resource Leaks:
Even if overall CPU and memory usage seems normal, there might be resource leaks within the application that eventually lead to probe failures.
Use application-specific monitoring tools or profiling tools to check for memory leaks, file descriptor leaks, or other resource exhaustion.
Concurrency Issues:
If your application handles concurrent requests, there might be race conditions or deadlocks that cause it to become unresponsive.
External Dependencies:
If your application relies on external services (databases, APIs, etc.), ensure that those services are stable and responsive.
If the external service has a period of high latency, or is unavailable, this could cause your pods probes to fail.
Configuration Differences:
You mention that other sites using the same image do not have the same issue. This strongly suggests a configuration difference.
Check for differences in:
Environment variables.
ConfigMaps.
Secrets.
Mounted volumes.
Even slight differences in configuration can cause drastically different behavior.
4. Kubernetes-Specific Issues
Node Issues:
Although unlikely, there might be underlying issues with the node where the pod is running.
Try scheduling the pod to a different node by using node affinity or tolerations.
Kubernetes Network Policies:
Review any network policies that might be affecting the pod’s network traffic.
Storage Issues:
If the application is writing to a persistent volume, there could be underlying storage issues that are causing the application to hang.
5. Enhanced Logging and Monitoring
Application-Level Logging:
Increase the verbosity of your application’s logging.
Add logging for critical sections of code, especially around the probe execution paths.
Kubernetes Events:
Use kubectl describe pod to get more detailed information about the pod’s events.
Container Runtime Logs:
Check the container runtime logs on the node where the pod is running.
Google Cloud Monitoring:
Utilize google cloud monitoring to get a deeper understanding of the pods metrics.
Create custom metrics, and alerts.
Troubleshooting Steps
Examine Probe Definitions: Start by thoroughly reviewing the readiness and liveness probe definitions.
Simulate Probe Execution: Manually test the probes to ensure they are working as expected.
Check Network Connectivity: Verify DNS resolution, network latency, and firewall rules.
Investigate Application-Specific Issues: Look for resource leaks, concurrency issues, and external dependency problems.
Compare configurations: compare the configuration of the working pods, to the non working pods.
By following these steps, you should be able to gather more information and pinpoint the cause of the random pod restarts.