How to benchmark Google Cloud Feature Store in Java

kamalkishore · August 25, 2025, 9:48pm

Authors:
Anna Whelan, AI Strategic Cloud Engineer, Google Cloud
Kamal Kishore, Software Engineer, Google

The world of real-time analytics demands speed and efficiency, especially when dealing with massive datasets. To empower users with cutting-edge solutions, it’s crucial to understand the performance capabilities of core services like Google Cloud Feature Store under various conditions.

This post shares the results of a comprehensive benchmarking effort focused on key metrics like scalability, cost efficiency, and, most importantly, latency. We will explore how different networking endpoints and architectural configurations impact the Feature Store’s performance. This involves a detailed analysis of:

Optimized Feature Store with a Public Endpoint: Evaluating performance with standard public network access.
Optimized Feature Store with Private Service Connect (PSC): Assessing the benefits of secure, private connectivity.
A new Direct Read capability that uses Bigtable for enhanced performance.

Under the hood: Engineering the benchmarking solution

To accurately measure performance, we built a scalable and configurable benchmarking framework using Google Cloud’s native services. The solution is designed to simulate a real-world, high-throughput scenario.

The benchmarking architecture

The core of our solution is an Apache Beam pipeline that runs on Dataflow. This pipeline is responsible for generating a massive load of read requests against the Feature Store. Here’s how the components work together:

Source Data: A text file in Google Cloud Storage (GCS) contains millions of entity IDs, which act as the keys for our feature lookups.
Dataflow Pipeline: The pipeline reads these IDs, divides them across many workers, and makes parallel calls to the Feature Store’s online serving endpoint.
Feature Store: This is the service we are benchmarking. We test instances with different endpoint configurations (Public, Private Service Connect).
BigQuery: For every feature lookup, the Dataflow pipeline logs key metrics—most importantly, the end-to-end latency—into a BigQuery table for later analysis.

This architecture allows us to generate significant load and capture precise performance data for every single request.

Configuring the dataflow pipeline

With the architecture in place, the first step is to define the pipeline and its configuration options. Using a custom PipelineOptions interface in Apache Beam allows us to easily parameterize the job, making it simple to switch between different test configurations (like public vs. private endpoints) without changing the code.

Here are the custom options we defined for our benchmarking pipeline:

public interface BenchmarkFS extends PipelineOptions {
    @Description("The GCP project ID")
    @Default.String("gcp-project-id")
    String getProject();
    void setProject(String project);

    @Description("Feature Store ID")
    @Default.String("feature-store-id")
    String getFeaturestoreId();
    void setFeaturestoreId(String fsId);

    @Description("A file containing entity IDs")
    @Default.String("gs://bucket/path/to/entity_ids.txt")
    String getEntityIdFile();
    void setEntityIdFile(String file);

    @Description("Is this a private endpoint?")
    @Default.Boolean(false)
    Boolean getIsPrivateEndpoint();
    void setIsPrivateEndpoint(Boolean isPrivate);

    @Description("The private endpoint address")
    @Default.String("10.11.12.13:443")
    String getEndpointAddr();
    void setEndpointAddr(String addr);

    @Description("BQ Table to write results")
    @Default.String("project.dataset.table_name")
    String getBqResultsTable();
    void setBqResultsTable(String table);
}

The main pipeline code then uses these options to build and run the job. The process starts by reading the entity IDs line-by-line from the source text file in Google Cloud Storage using Beam’s TextIO.read() transform. Each entity ID is then passed downstream to be processed by the core logic of our pipeline.

public static void main(String[] args) {
    BenchmarkFS options = PipelineOptionsFactory.fromArgs(args)
        .withValidation()
        .as(BenchmarkFS.class);

    Pipeline p = Pipeline.create(options);

    // Read entity IDs from a GCS file and pass them to the main processing DoFn
    p.apply("ReadEntityIds", TextIO.read().from(options.getEntityIdFile()))
     .apply("FetchFeatures", ParDo.of(new FetchFeatures(options)));

    p.run();
}

The Core logic: Fetching features and measuring latency

The heart of the benchmark is a custom DoFn (a Beam processing function) called FetchFeatures. Each worker in the Dataflow job runs this function. For every entity ID it receives, this DoFn is responsible for connecting to the Feature Store, sending a read request, and logging the performance.

First, in the @Setup method, which runs once per DoFn instance, we initialize the FeatureOnlineStoreServiceClient. This is crucial for efficiency, as it prevents creating a new client for every single request. This is also where we implement the logic to connect to either a public endpoint or a private one using Private Service Connect (PSC).

@Setup
public void setup() throws IOException {
    // This logic creates the correct client based on the pipeline options
    if (isPrivateEndpoint) {
        // Connect to the private IP and port
        ManagedChannel channel = ManagedChannelBuilder.forTarget(endpointAddr).usePlaintext().build();
        FeatureOnlineStoreServiceSettings settings =
            FeatureOnlineStoreServiceSettings.newBuilder()
                .setTransportChannelProvider(
                    FixedTransportChannelProvider.create(GrpcTransportChannel.create(channel)))
                .setCredentialsProvider(NoCredentialsProvider.create()) // PSC handles auth
                .build();
        this.client = FeatureOnlineStoreServiceClient.create(settings);
    } else {
        // Connect to the regional public endpoint
        String apiEndpoint = region + "-aiplatform.googleapis.com:443";
        FeatureOnlineStoreServiceSettings settings =
            FeatureOnlineStoreServiceSettings.newBuilder().setEndpoint(apiEndpoint).build();
        this.client = FeatureOnlineStoreServiceClient.create(settings);
    }
}

Next, the @ProcessElement method executes for each entity ID. The logic is straightforward:

Record the start time.
Build a ReadFeatureValuesRequest for the given entity.
Execute the request inside a try-catch block.
Calculate the end-to-end latency in milliseconds.
Create a BigQuery TableRow object containing the latency, the entity ID, and a timestamp.
Output the TableRow for the final stage of the pipeline.

@ProcessElement
public void processElement(ProcessContext c) {
    String entityId = c.element();
    long startTime = System.currentTimeMillis();

    // Build the request object for the Feature Store
    ReadFeatureValuesRequest request =
        ReadFeatureValuesRequest.newBuilder()
            .setEntityType(entityTypeName)
            .setEntityId(entityId)
            .build();
    try {
        // Make the actual API call to fetch features
        client.readFeatureValues(request);

    } catch (Exception e) {
        // Handle exceptions if needed
    } finally {
        // Calculate latency and create a BQ row
        long latencyInMillis = System.currentTimeMillis() - startTime;
        TableRow row = new TableRow()
            .set("entity_id", entityId)
            .set("latency", latencyInMillis)
            .set("timestamp", Instant.now().toString());
        c.output(row);
    }
}

Finally, the main pipeline takes the PCollection<TableRow> produced by this DoFn and streams the results directly into our BigQuery results table using BigQueryIO.

// (Continuing from the main method...)
p.apply("ReadEntityIds", TextIO.read().from(options.getEntityIdFile()))
 .apply("FetchFeatures", ParDo.of(new FetchFeatures(options)))
 .apply("WriteToBQ", BigQueryIO.writeTableRows().to(options.getBqResultsTable())
     .withCreateDisposition(CreateDisposition.CREATE_NEVER)
     .withWriteDisposition(WriteDisposition.WRITE_APPEND));

From execution to analysis

This section covers the final steps: launching the Dataflow job and querying the results in BigQuery to understand the performance characteristics.

Running the benchmark job

You can execute the pipeline from your local machine using a Maven command. The command passes all the configuration parameters we defined in the BenchmarkFS options interface, such as the project ID, GCS file path, and target BigQuery table.

To run the job, use the following command structure, filling in your specific project details:

mvn compile exec:java \
 -Dexec.mainClass=com.example.BenchmarkFS \
 -Dexec.args=" \
 --runner=DataflowRunner \
 --project=your-gcp-project \
 --region=us-central1 \
 --featurestoreId=your_fs_id \
 --entityIdFile=gs://your-bucket/path/to/entity_ids.txt \
 --bqResultsTable=your-gcp-project:your_dataset.results_table \
 --isPrivateEndpoint=false \
 --tempLocation=gs://your-bucket/temp/ \
 --workerMachineType=n2-standard-4 \
 --maxNumWorkers=50"

To test with a private endpoint, you would simply change the flags to --isPrivateEndpoint=true and provide the PSC address with --endpointAddr=10.11.12.13:443.

Analyzing the results in BigQuery

Once the Dataflow job completes, the results_table in BigQuery will contain millions of rows, each logging the latency for a single feature lookup. Now, you can run SQL queries to aggregate this data and calculate meaningful performance statistics.

While an average latency is useful, percentile latencies (p50, p90, p95, p99) provide a much deeper understanding of the user experience. The following query uses BigQuery’s APPROX_QUANTILES function to efficiently calculate these values:

SELECT
    -- Calculate latency percentiles. p99 is the latency that 99% of requests fall under.
    APPROX_QUANTILES(latency, 100)[OFFSET(50)] AS p50_latency_ms,
    APPROX_QUANTILES(latency, 100)[OFFSET(90)] AS p90_latency_ms,
    APPROX_QUANTILES(latency, 100)[OFFSET(95)] AS p95_latency_ms,
    APPROX_QUANTILES(latency, 100)[OFFSET(99)] AS p99_latency_ms
FROM
    `your-gcp-project.your_dataset.results_table`

Running this query for each benchmarked configuration (Public, PSC, etc.) will give you the precise data needed to compare their performance and make an informed architectural decision.

Conclusion: Key takeaways

This guide provides a complete, reusable framework for benchmarking the Google Cloud Feature Store at scale. By leveraging Dataflow, you can simulate high-throughput workloads and use BigQuery to capture and analyze granular performance data, allowing you to make informed, data-driven decisions for your MLOps platform.

The key takeaway from our own internal benchmarking is that the direct-read Feature Store (backed by Bigtable) offers a transformative leap in performance, consistently delivering the lowest latencies required for demanding, real-time applications.

You now have the tools to validate these findings for yourself. By using the provided pipeline code and analysis queries, you can test different endpoint configurations and machine types to find the optimal balance of performance and cost for your unique use case. In the world of real-time AI, this ability to precisely measure and optimize feature serving is not just an advantage—it’s a necessity.

agboisrael27 · August 29, 2025, 3:15pm

thanks the information is very impactful

Topic		Replies	Views
Need (workaround) solution to improve BQ table reads in Apache beam sdk 250 java [Urgent] Data Analytics dataflow , bigquery	3	5	October 4, 2023
Real-time Civic Media Dashboard on GCP – Feedback on High-Velocity Data Ingestion and Visualization Data Analytics dataflow , bigquery , cloud-composer , cloud-pubsub , looker-studio	2	14	July 12, 2025
Dataflow: Latency in 'Kafka to table in CloudSQL db' pipeline Data Analytics dataflow , apache-kafka	2	6	July 26, 2024