GKE Autopilot: Preventing Scale-Down Interruptions for Long-Running Quartz Jobs

Hi,
I’ve been tasked with migrating an old monolith to GKE Autopilot (my first time with GKE Autopilot).
The monolith can scale horizontally, but it runs jobs using Quartz; these jobs can last for hours. The issue is that during scale-down, pod termination should be deferred until the jobs currently being processed finish, because some are critical and their processing could be lost.
The monolith is large, we don’t fully know everything it does, and a full rewrite isn’t feasible right now, but I can add a few patches—for example, two APIs:

  • one to stop enqueuing jobs
  • another API to report whether there are running executions.

Given that GKE Autopilot’s maximum terminationGracePeriodSeconds is 10 minutes,
what would be a robust strategy to migrate this to GKE Autopilot without interrupting in-flight work?

Ideas I’m considering:

  • Set safe-to-evict to false whenever a job is taken. The Autopilot documentation mentions “up to 7 days” of protection against scale-down and auto-upgrade.
    A few questions here: does Autopilot honor this if a pod changes it dynamically? Or is it expected to live in the pod template and not be changed? Is the timer counted from the image pull rather than each time this annotation is toggled?
  • Set a high pod-deletion-cost whenever a job is enqueued to mitigate the issue.
  • Add a preStop hook that stops enqueuing new jobs and waits for running jobs to finish.
  • Even with these in place, I’m unsure about the robustness; I suspect concurrency issues—for example, scaling down a pod right as it picks up a job. Is there an alternative that eliminates this race?

For migrating a long-running job monolith to GKE Autopilot, a robust approach is to combine PodDisruptionBudgets (PDBs), preStop hooks, and safe-to-evict annotations, but avoid relying solely on dynamic changes to annotations, as Autopilot may not honor changes mid-lifecycle—the protection is typically read from the pod spec at scheduling. Use a preStop hook to stop enqueuing new jobs and poll your API to wait for in-flight jobs to complete, and consider marking pods with high pod-deletion-cost to reduce the chance of eviction. To eliminate race conditions, introduce a central job coordinator or a lease mechanism so that a pod only picks up a job if it is fully ready and protected, ensuring no job starts just as the pod is selected for scale-down.