I’m trying to write some PromQL Alert Policies. Specifically I’m trying to detect when a service goes down on a Windows vm. If the VM is off, I want to ignore the absence. I’m using the Process cpu_time metric to determine if the service is on. I’m thinking I’ll use cpu_utilization to see if the machine is on or not.
I have code that will narrow results down to machines that have metrics, but I have no idea how to single out the ones where the service is off, but the machine is on. The count for the service is not 0 when the service is off. The count is null and that fact seems to make this very difficult. Any suggestions?
count by (instance_id) (
agent_googleapis_com:processes_cpu_time{
monitored_resource=“gce_instance”,
command_line=~“(?i).process.exe.”
}
) and count by (instance_id) (
compute_googleapis_com:instance_cpu_utilization
)