Optimise AI training cost and reduce storage QPS via GCSFuse async prefetch

Optimise AI training cost and reduce storage QPS via GCSFuse async prefetch

For AI/ML engineers and data scientists, the speed at which models can ingest data is a primary determinant of training efficiency. GCSFuse serves as a critical bridge, allowing users to interact with massive datasets in Google Cloud Storage as if they were local filesystems. By providing high-performance, cost-effective access to these buckets, GCSFuse is foundational for running training, serving, and checkpointing workloads at scale.

With the async metadata prefetch feature of GCS Fuse, customers can make AI training:

  • Faster: Save up to 20ms for metadata calls and improve execution times by 48%.
  • Cost efficient: Reduce metadata read cost by 108x by making 200x less Queries Per Second (QPS) to GCS.

The bottleneck: Metadata lookup latency

In data-heavy AI/ML pipelines, the performance of the filesystem is often limited not by throughput, but by metadata lookup latency. Every time a workload accesses a file, a “stat” operation is required to confirm file type and metadata. Historically, these individual, blocking network calls to GCS have been costly and slow, with latencies ranging from 20–100ms.

When these workloads are dealing with billions of files (as seen in TensorFlow datasets or massive Hugging Face datasets), these micro-delays compound into significant “read stalls.” Furthermore, traditional caching mechanisms often suffer from “cold starts” or frequent evictions, forcing the system back to the network for every operation.

Previously, GCS Fuse relied on loading metadata for the entire storage bucket at the mount. While this helps solve metadata latency, it fundamentally has two problems:

  1. Signal to noise ratio: Customers keep additional data (noise) outside the core training data (signal) in the same bucket. By loading metadata for these additional data, the metadata cache size becomes bloated.
  2. Inconsistent with data drift: Since metadata was loaded at the mount, when data changes over time, systems must either reduce the TimeToLive (TTL) or fetch metadata for the entire bucket again. There is no way to balance the metadata size vs TTL so that only active data sets are refreshed with aggressive TTL.

Optimising metadata lookup

The Async Metadata Prefetch introduces an intelligent background mechanism that proactively batch-retrieves directory metadata, activated upon initial directory access. This implementation offers superior scalability over legacy methods by targeting only active directories within a bucket and maintaining the capability to refresh cache entries dynamically throughout the workload execution.

GCSFuse, for each file operation (e.g., Read, Write, Stat), requires a metadata lookup handled by FUSE’s LookUpInode call. This can be served by making 2 StatObject or 1 StatObject + 1 ListObject/GetFolder network calls to GCS. The results can then be cached to serve further requests. When operating at a directory level, these operations can be optimised by doing a single ListObjects call at the directory level rather than executing two calls for each object in the directory.

By moving to directory-based prefetch during the first access, async prefetch provides the following benefits:

1. Performance

By proactively batch-retrieving only active directory metadata, the need for individual, blocking network calls is eliminated. Testing shows substantial execution time reductions that translate directly to faster training iterations. In FIO based benchmarking, reading 10K 1MiB files and comparing the performance of various bucket types, there is up to a 48% reduction in read times.

Bucket Structure Read Time (W/O Prefetch) Read Time (W/ Prefetch) Improvement (%)
Flat Bucket 13m 53s 9m 28s 46%
Flat (Implicit Dirs) 9m 15s 6m 13s 48%
HNS (Hierarchical) 8m 49s 7m 16s 21%

2. Cost efficiency

Metadata operations incur API costs. A single ListObjects call, when used for batching, is significantly more efficient than issuing thousands of individual GetObjectMetadata requests. For high-volume customers, this shift can lead to massive cost reductions.

Cost Savings Analysis (Assuming 100 files in an active directory):

  • Flat bucket: $0.005 for 1000 Class A operations, and $0.0004 for Class B operations.
  • HNS bucket: $0.0065 for 1000 Class A operations, and $0.0005 for Class B operations.
Scenario (100 files per directory) API Pattern Estimated Cost Notes
Prefetch Enabled 1 List call (Class A) $0.005 x 10⁻³ -
Prefetch Disabled (HNS) 100 x 2 GetMetadata (Class B) $0.1 x 10⁻³ 20x more expensive
Prefetch Disabled (Implicit Dirs on flat namespace) 100 GetMetadata (Class B) + 100 ListObjects (Class A) $0.54 x 10⁻³ 108x more expensive

By enabling async metadata prefetch, cost reductions can reach up to 108x and QPS reduction by 200x.

3. Scalability

As training clusters grow to hundreds of nodes, the system must handle massive concurrency without succumbing to “list storms” or Out-of-Memory (OOM) risks. Modern metadata optimization includes “Targeted Mode” and global concurrency throttling, ensuring that the system remains stable even under extreme metadata density.

Control the behaviour of Async metadata prefetch

Customers can control the behaviour of async metadata prefetch by adjusting the following parameters while mounting the storage buckets via Fuse:

Parameter Notes Possible values Default value
enable-metadata-prefetch Enables async prefetch of metadata. Boolean value: true, false. false
metadata-prefetch-entries-limit Specifies the maximum metadata entries to prefetch per directory. Values > 5000 result in multiple sequential Cloud Storage list calls. Integer between -1 and 2147483647. Set to -1 to prefetch all entries. 5000
metadata-prefetch-max-workers The maximum number of concurrent background workers allowed to perform metadata prefetching across all directories. Integer between -1 and 2147483647. Set to -1 for unlimited workers. 10

Sample mount configuration

Async metadata prefetch is available from GCS Fuse version above 3.8.0 and GKE version: 1.35.2-gke.1852000. To get started with Fuse, please refer to documentation on how to mount a GCS bucket using Fuse.

gcsfuse Command

gcsfuse --enable-metadata-prefetch=true my-bucket /path/to/mount

GKE Fuse CSI driver

mountOptions:"metadata-cache:enable-metadata-prefetch:true"
7 Likes

Great insights on reducing metadata lookup latency, Trinadh! I am currently developing a multi-agent drone swarm simulation for search and rescue. Since I am building this out with a limited budget and compute resources, I rely heavily on local virtual environments to simulate my ML data pipelines and predictive models. The directory-based async prefetch seems like a fantastic way to optimize even resource-constrained, local-to-cloud workflows.

Quick question: If we set metadata-prefetch-entries-limit to -1 (unlimited) for a massive flat directory, how much client-side memory overhead should we anticipate, specifically when running the training phase on a resource-constrained virtual machine?

Meta data cache typically uses ~1700bytes for each file. When you set it to -1, please calculate how much memory that could consume and ensure that you won’t get into OOM due to large flat structure.

Sorry for a delay in response.