Issue with Autoscaler Stabilization Period Causing Unnecessary Instance Creation

FNogueira · August 9, 2024, 10:51am

Hi,

I’m encountering an issue with the autoscaler configuration for my Instance Group on Google Cloud, and I’m hoping for some guidance.

Here’s the scenario: My Instance Group processes tasks that vary significantly in duration, ranging from 30 seconds to several hours. I have a custom metric called tasks_in_queue that tracks the number of tasks either waiting to be processed or currently being executed. The value of this metric decreases as tasks are completed.

My autoscaler is configured for scale-out only, with single_instance_assignment set to 1. This means that if tasks_in_queue has a value of 20, the autoscaler should launch 19 new instances. The scale-out process works flawlessly. However, the issue arises during scale-in.

Scale-in isn’t managed by the autoscaler but instead by the instances themselves. Each instance is programmed to monitor its idle time. If an instance remains idle for more than 120 seconds, it sends a request to another service to terminate itself via a Google Cloud API call.

While this approach generally works, I’ve noticed that the autoscaler seems to react unexpectedly when instances terminate themselves. Even when the tasks_in_queue metric drops to zero, the autoscaler sometimes creates new instances instead of allowing the group size to decrease naturally. This results in a scenario where the Instance Group size oscillates in a triangular wave pattern, with instances being created and terminated unnecessarily, before eventually returning to the minimum size. Below there is a figure that shows this behavior, the blue line is my metric and the cyan line the size of the instance group.

This behavior seems to be related to the autoscaler’s stabilization period, though I haven’t found any documentation on how to reduce or eliminate this period.

Is this an issue with my configuration, or is there a way to adjust the stabilization settings to prevent this unwanted behavior?

Thanks for your help!

francislouie · August 14, 2024, 10:32pm

Hi @FNogueira ,

Welcome to Google Cloud Community!

The Stabilization period by default was set to 10 minutes; however, if your application takes longer than 10 minutes to initialize on a new VM, then the autoscaler uses the initialization period instead of the default 10 minutes of stabilization. The Stabilization period is not configurable, but it is a built-in feature of autoscaling.

By default, the Initialization period is 60 seconds. Actual initialization times vary because of numerous factors. It is recommended that you test how long your application takes to initialize. To do this, create an instance and time the startup process from when the instance becomes RUNNING until the application is ready.

Based on the graph that you provided, it seems like the instance termination is triggered before the autoscaler’s stabilization period has ended. There’s a possibility that when the stabilization period ends, there’s still a workload or task being executed.

Since the number of tasks is getting lower, there is a possibility that some VM instance task is completed and then proceeds to idle and eventually terminates itself.

Here is what I can recommend:

Adjust the initialization period
Adjust the stabilization settings
- Increase the MIN and MAX instance
Adjust the logical termination
Check and review logs related to instances and autoscaling in Log Explorer to know how the instances are being removed or added.

If the issue still persists and you need further assistance, you can file a ticket with our support team.

I hope the above information is helpful.

talkenig · September 8, 2024, 3:07pm

I have a very similar configuration and the same issue. The fact you don’t allow to change the stabilization period is a bug. None of the “recommendations” you made will do anything to resolve this problem and some are not actionable even. What does it mean “Adjust the stabilization settings”? There’s no such thing in instance group settings so not sure what you mean.

Topic		Replies	Views
Instance Group Scale down choice Compute Infrastructure compute-engine	0	9	September 30, 2021
MIG does not scale based on CPU utilization Compute Infrastructure compute-engine	2	5	May 12, 2025
Cloud Run creates second instance despite "maxScale: 1" Serverless Applications cloud-run	1	13	September 4, 2024

Issue with Autoscaler Stabilization Period Causing Unnecessary Instance Creation

AI Suggested topics