Today, from around 10 AM to 10 PM GMT+07:00, we experienced an issue where a large number of our Spot instances were terminating at the same time. Normally, we see 2-3 Spot instance shutdowns per day, but during this period, we saw over 10 instances shut down in the morning alone, and close to 30 by 4 PM.
We have around 28 nodes total. Having so many nodes go down at once put a strain on our system and caused disruptions.
Has anyone else experienced this type of issue with multiple Spot instances shutting down concurrently and impacting their workloads? We are running in the taiwan asia-east1-a zone. Are there any best practices or preventative measures that can be taken to avoid or mitigate this type of situation?
Thank you.