Regular outages on compute engine VMs

Hi,

I have a 4 VMs running different apps which are hosting services aimed at supporting an open source project.

I have 2 VMs which often stop responding completely and I have to reset them, over the last 6 months I observed that this tends to happen always during weekends. Today is saturday and I just reset both VMs.

The fact that it seems to happen always on Saturdays makes me wonder if Google has any scheduled maintenance/backup activity during weekends which increases the usage of CPU or Memory on VMs and due to the small size of the VMs causes them to become unresponsive?

Has anyone else noticed anything similar?

Hi @nemesifier ,

Welcome to Google Cloud Community!

These issues often occur during off-peak times, such as weekends, but it’s highly unlikely that Google’s scheduled maintenance directly causes your virtual machines (VMs) to become unresponsive and require a reset.

Here is the relevant documentation and related sources that address VM unresponsiveness, scheduled maintenance, and best practices for handling such issues:

  1. Google Cloud Scheduled Maintenance Information
    Google Cloud Service Health Dashboard -This dashboard provides real-time and historical data on Google Cloud service availability, including any scheduled maintenance. Google typically notifies users in advance of planned maintenance via the dashboard, email, or the Google Cloud Console.

  2. Troubleshooting VM Issues in GCP
    This guide outlines steps to diagnose and resolve common VM issues, such as unresponsiveness, high CPU usage, or network connectivity problems. It includes checking logs, verifying resource utilization, and resetting VMs if necessary.

  3. Compute Engine Maintenance Events:
    This page explains how Google handles maintenance for Compute Engine VMs, including live migration (which typically doesn’t cause downtime) and host maintenance events. Google ensures minimal disruption, and users can configure VMs to automatically restart or migrate during maintenance.

  4. Best Practices for VM
    This guide recommends using managed instance groups, enabling auto-healing, and configuring monitoring to improve VM reliability. It also suggests distributing workloads across multiple zones to mitigate issues during off-peak hours.

  5. Monitoring and Alerting for Off-Peak Issues
    Google Cloud Monitoring allows users to track VM performance metrics and set alerts for anomalies, which can help identify issues during off-peak times like weekends.

For more information, you may check About Host Events which describes maintenance events and how VMs respond based on their configuration

If the issue still persists and needs further assistance, please feel free to reach out to our Google Cloud Support team.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.