Intro
GKE offers many consumption options for TPU, including on-demand, spot, DWS flex-start, and TPU reservations. Unless you have the budget to obtain all capacity via reservation, you are faced with the reality that your preferred consumption option may be unavailable when you need it. Therefore, it’s common to mix-and-match different consumption options in a single GKE cluster to achieve objectives like:
- Increasing obtainability: Due to highly constrained TPU availability, allowing flexibility to use on-demand, flex-start, or spot machines increases your chances of securing TPU capacity.
- Optimizing costs: For workloads that tolerate preemptions well, achieve significant cost savings by using spot machines when available, and automatically falling back to on-demand when they are not.
- Prioritizing reservations: For customers with some capacity on reservation and the rest via on-demand, fully utilize all reserved capacity first before adding on-demand nodes during usage spikes.
Managing this complexity is difficult: each consumption option and multi-host TPU topologies requires its own nodepool, quickly creating an unmanageable number of nodepools for ML platform teams. The GKE Custom Compute Class (CCC) API was designed specifically to address this by automating and streamlining the entire provisioning workflow.
A real world problem
Consider a ML platform team that manages a GKE cluster shared by multiple internal groups. Each group may need different TPU sizes depending on the workload they run, hypothetically, N workloads requiring 4x4 topology and M of 8x8 topology. Given that each multi-host TPU slice requires a separate nodepool, this translates to N+M node pools being created. The team decided that given this is a non-production environment, they can optimize the cost by creating these nodepools as preemptable spot machines and enable auto-scaling to tear down the nodepools when not in use.
One day, they saw that workloads were stuck unschedulable because the spot machines were temporarily unavailable. To unblock this, they need to manually create on-demand node pools as replacement, which is time-consuming and creates an operational burden. Automating this process is challenging because GKE’s Cluster Autoscaler follows a strict rule to prioritize scaling up the cheapest available machine first. Having many nodepools with the same cost priority means all of them need to be attempted, failed and placed in cool-down before the next cost priority is attempted. Dependings on the numbers of nodepools, this may take hours before the priority fallback happens.
How CCC solves it
Custom Compute Class(CCC) solves this issue by grouping multiple nodepools into a single priority. Based on the previous example, a cost-based priority CCC look something like this:
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
name: tpu-4x4
spec:
priorities:
- spot: true
tpu:
type: tpu-v6e-slice
count: 4
topology: 4x4
- tpu:
type: tpu-v6e-slice
count: 4
topology: 4x4
When Cluster Autoscaler sees a workload with nodeSelector: cloud.google.com/-class: tpu-4x4, it will try scale up any existing node pools in the cluster that matches the first priority defined, namely “nodepools with spot consumption option and 4x4 TPU topology”. If spot machines are unavailable at the moment, the failed scale up attempt places the first priority into a 5-minute cooldown, and the next priority will be tried. Essentially, CCC modifies the default behavior of Cluster autoscaler and makes it much more flexible.
CCC Node auto-create and more
CCC can automate the creation and deletion of node pools with its node pool auto creation feature. This greatly reduces the toil of creating and managing node pools manually. To use it, simplify modify the previous example by adding the required attribute:
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
name: tpu-4x4
spec:
priorities:
- ...
nodePoolAutoCreation: # ADD THIS
enabled: true
CCC also allows targeting of specific sets of nodepools by name which allow even more fine-grained controls of nodepool priority. For example:
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
name: tpu-4x4
spec:
priorities:
- nodepools: [nodepool-1] # nodepool 1 will be attempted before nodepool-2 and nodepool-3
- nodepools: [nodepool-2, nodepool-3]
Try CCC with TPU today
There are a lot more CCC to offer to make TPU provision more flexible. Try CCC today and learn more at the following links: