Run your AI/HPC workloads with GKE Managed Lustre CSI driver

Organizations today invest heavily in high-performance accelerators like GPUs and TPUs to power cutting-edge AI and HPC workloads. However, these costly compute resources often sit idle. The surprising culprit often isn’t a lack of compute power—it’s a lack of data. When your accelerators are forced to wait for data during loading, checkpointing, or writing outputs, you’re facing a critical bottleneck. This hidden waiting game can severely degrade performance, inflate your total cost of ownership (TCO), and push your deadlines further out.

Google Cloud Managed Lustre

Google Cloud Managed Lustre is designed to solve this challenge head-on. By providing massive throughput and ultra-low latency, it acts as a high-speed fuel line for your compute resources. It ensures that your accelerators remain fully utilized, your data pipelines flow without interruption, and your results arrive faster.

By making Managed Lustre a foundational component of your stack, you unlock the full potential of your AI/HPC workloads and build a more competitive AI Hypercomputer on Google Cloud. Stop waiting for data and start accelerating your results.

GKE & Managed Lustre: High-performance file storage for demanding applications

Google Kubernetes Engine (GKE) now offers native integration with Google Cloud Managed Lustre through the new GKE Managed Lustre CSI driver. This integration provides a seamless control plane to provision and manage Lustre file systems directly to your clusters, allowing you to provision them as standard Kubernetes Persistent Volumes (PVs).

While built on the open-source Lustre CSI driver, our new GKE Managed version removes the operational complexity. You get a fully-managed experience with the performance and reliability of a native Google Cloud service, all while preserving the flexibility and scalability your workloads demand.

1600×885 152 KB

figure : GKE Lustre CSI architecture

Key benefits of the GKE Managed Lustre CSI driver

:white_check_mark: One-step enablement — No kernel module management required

Traditionally, using Lustre on cloud compute was a hands-on process. You had to manually manage kernel modules and user-space utilities on every node. This created significant operational overhead, and simple events like node upgrades or reboots could become major headaches, potentially disrupting critical workloads.

The GKE Managed Lustre CSI driver changes everything.

All that complexity is now handled for you. Simply enable the Lustre CSI driver feature in your cluster, and you’re done. There’s no need to touch kernel modules or worry about lifecycle management during node changes. Connecting your applications to high-performance Lustre storage is now as straightforward as defining a standard Pod and PVC.

:white_check_mark: Shared-mount architecture — Scale efficiently with fewer resources

In modern AI/HPC, it’s common for hundreds or even thousands of pods to require access to the same shared dataset. The traditional architecture, where each pod mounts the storage volume independently, quickly becomes a bottleneck, consuming excessive resources and struggling to scale.

The Lustre CSI driver is engineered to solve this problem.

It uses a shared-mount architecture, where a Lustre volume is mounted only once per node. All pods running on that node then access the data through this single, streamlined connection. This intelligent design dramatically reduces overhead and is built for massive scale.

The benefits are clear:

  • Maximized connectivity: A single Lustre instance can serve numerous workloads efficiently.

  • Reduced overhead: Eliminating duplicate mounts frees up valuable system resources.

  • Peak performance: The architecture is ideal for large-scale, read-heavy HPC workloads.

:white_check_mark: Kernel-level performance — High throughput without userspace overhead

Unlike userspace-based solutions, the Lustre CSI driver operates natively at the kernel level. This means:

  • No userspace emulation layers,

  • No pod-level annotations or sidecar containers,

  • No tuning of per-pod memory limits to avoid OOMs.

You get direct, high-performance I/O designed for extreme scale and parallelism—ideal for the demands of AI model training, HPC simulations, and large-scale data pipelines.

gke googler-article cloud-storage

10 Likes

Wonderful information ai HPC working loading database active

Thank you for the feedback, let us know if you are looking for specific information on topics. Happy to help !

Ok source
We’ll Don

This is a great update :clap:. The shared-mount architecture and kernel-level performance really stand out—removing the complexity of manual Lustre management while still delivering the throughput needed for AI and HPC is a huge win. The one-step enablement also lowers the barrier for teams who want performance without heavy ops overhead. Excited to see how this helps scale large training and simulation workloads more efficiently :rocket:.

2 Likes

Thank you for your feedback !

2 Likes

Hello, is there CMEK option available Lustre CSI driver ?

It will be available soon we are working on it, if you have a specific requirement just drop me a note, happy to help. poonamlamba @google.com , thanks

1 Like

great work

1 Like