Hi,
I’m encountering an issue with the autoscaler configuration for my Instance Group on Google Cloud, and I’m hoping for some guidance.
Here’s the scenario: My Instance Group processes tasks that vary significantly in duration, ranging from 30 seconds to several hours. I have a custom metric called tasks_in_queue that tracks the number of tasks either waiting to be processed or currently being executed. The value of this metric decreases as tasks are completed.
My autoscaler is configured for scale-out only, with single_instance_assignment set to 1. This means that if tasks_in_queue has a value of 20, the autoscaler should launch 19 new instances. The scale-out process works flawlessly. However, the issue arises during scale-in.
Scale-in isn’t managed by the autoscaler but instead by the instances themselves. Each instance is programmed to monitor its idle time. If an instance remains idle for more than 120 seconds, it sends a request to another service to terminate itself via a Google Cloud API call.
While this approach generally works, I’ve noticed that the autoscaler seems to react unexpectedly when instances terminate themselves. Even when the tasks_in_queue metric drops to zero, the autoscaler sometimes creates new instances instead of allowing the group size to decrease naturally. This results in a scenario where the Instance Group size oscillates in a triangular wave pattern, with instances being created and terminated unnecessarily, before eventually returning to the minimum size. Below there is a figure that shows this behavior, the blue line is my metric and the cyan line the size of the instance group.
This behavior seems to be related to the autoscaler’s stabilization period, though I haven’t found any documentation on how to reduce or eliminate this period.
Is this an issue with my configuration, or is there a way to adjust the stabilization settings to prevent this unwanted behavior?
Thanks for your help!