Gemini-Embedding-001 Latency Considerations

MeanValueTheorem · November 1, 2025, 8:18pm

I’m facing considerable latency issues when using Gemini-Embedding-001 via the Vertex AI API.
Below code takes between 11 and 12 seconds to run.
Warming up, batch size, etc. do not affect the aforementioned values.
I wanted to ask whether this is expected behavior or whether I’m missing something.
Such latencies essentially double my pipeline’s running duration.

This is how I initialize and instantiate my client:

import logging
from google.oauth2.service_account import Credentials
from Configurations import configurations
from google.genai import Client
from google.cloud import storage

class GoogleClients:

    def __init__(self) -> None:
        try:
            credentials = Credentials.from_service_account_file(filename = configurations.CREDENTIALS_PATH, scopes = ['https://www.googleapis.com/auth/cloud-platform'])
            self.VertexAIClient = Client(vertexai = True, project = configurations.GOOGLE_PROJECT_ID, location = 'global', credentials = credentials)
            self.storageClient = storage.Client(project = configurations.GOOGLE_PROJECT_ID, credentials = credentials)
            logging.info('GoogleClients initialization completed successfully.')

        except Exception as error:
            raise RuntimeError(f'Failed to initialize GoogleClients. Aborting application: {error}.') from error

clients = GoogleClients()

And this is how I call the embedding model:

import logging
import asyncio
import time
from GoogleClients import clients
from Configurations import configurations
from typing import Optional
from tqdm.asyncio import tqdm_asyncio

class DenseEmbedder:

    def __init__(self) -> None:
        try:
            self.vertexAIClient = clients.VertexAIClient
            self.semaphore = asyncio.Semaphore(configurations.ASYNCIO_SEMAPHORE)
            logging.info('DenseEmbedder initialization completed successfully.')

        except Exception as error:
            raise RuntimeError(f'Failed to initialize DenseEmbedder. Aborting application: {error}.') from error

    async def embedBatch(self, batch: list[str], taskType: str) -> Optional[list[list[float]]]:
        try:
            async with self.semaphore:
                logging.info(f'Sending batch of {len(batch)} item(s) to Vertex AI embedding API.')
                startTime = time.perf_counter()
                response = await self.vertexAIClient.aio.models.embed_content(model = configurations.DENSE_EMBEDDING_MODEL, contents = batch, config = {'task_type': taskType})
                endTime = time.perf_counter()
                duration = endTime - startTime
                logging.info(f'Vertex AI API call for batch of {len(batch)} completed in {duration:.4f} seconds.')
                return [embedding.values for embedding in response.embeddings]

        except Exception as error:
            logging.warning(f'Failed to generate embeddings for a batch: {error}.')
            return None

    async def embedBatches(self, texts: list[str], taskType: str) -> list[list[float]]:
        try:
            allEmbeddings = []
            batchSize = configurations.DENSE_EMBEDDING_BATCH_SIZE
            tasks = [self.embedBatch(texts[i:i + batchSize], taskType) for i in range(0, len(texts), batchSize)]
            batchResults = await tqdm_asyncio.gather(*tasks, desc = 'Generating dense embeddings...')
            for batchEmbeddings in batchResults:
                if batchEmbeddings:
                    allEmbeddings.extend(batchEmbeddings)

            logging.info('Dense embedding generation completed successfully.')
            return allEmbeddings

        except Exception as error:
            raise RuntimeError(f'Failed to generate dense embeddings: {error}') from error

logging.basicConfig(level = logging.INFO, format = '%(asctime)s - %(levelname)s - %(message)s', force = True)
denseEmbedder = DenseEmbedder()
dummyText = ["This is a sample text for embedding."]
embeddings = asyncio.run(denseEmbedder.embedBatch(batch = dummyText, taskType = 'RETRIEVAL_DOCUMENT'))

Below is indicative logging output.

2025-11-01 22:12:08,790 - INFO - DenseEmbedder initialization completed successfully.
2025-11-01 22:12:08,790 - INFO - Sending batch of 1 item(s) to Vertex AI embedding API.
2025-11-01 22:12:20,587 - INFO - Vertex AI API call for batch of 1 completed in 11.7834 seconds.

Henrik · December 10, 2025, 2:22am

I also have this issue. There is something wrong. I mean I can have a 10s delay for embedding a query for a rag search when it should way less than 1s.

Cant find any info for how to reduce this. Query embeddings need very low latency otherwise there is no point in using it.

For example using the vertex multimodal embedd model for only text takes 1.2s

vertexai.init()

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file(
    "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)

start_time = time.time()
embeddings = model.get_embeddings(
    # image=image,
    contextual_text="Colosseum",
    dimension=1408,
)

And for text only 1.1s

model = TextEmbeddingModel.from_pretrained("gemini-embedding-001")

start_time = time.time()
embeddings = model.get_embeddings(
    texts=["Colosseum"],
)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")

LeoK · December 10, 2025, 8:54am

Hello @MeanValueTheorem,

Things to try:

Do not use Global as a location, but rather a region close to where you run the code
Ideally, run the code in Cloud Run in the same region as Vertex AI
Lower the dimension while ensuring quality for smaller dimensions

Embedding computation can be long, and returning a huge array of 1400+ floats is not optimal over HTTP.

I have run an embedding pipeline from a Cloud Run Job with 280k embeddings per minute, the farther the Vertex AI region, the higher the latency. I remember reducing the dimension size too, which helped a lot.

Henrik · December 10, 2025, 1:29pm

You are absolutly right @LeoK thank you for the quick response. I changed region from global to eu west and it goes from 11s to 0.7s. I didn’t know region had that big of an effect and thought the global region would be a good default. Apprerently not. Thank you again

mertbozkir · December 12, 2025, 11:06am

that makes sense, I also thought global would be better. will keep in my mind.

system · December 19, 2025, 11:06am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Gemini 2.5 Pro – Extremely High Latency on Large Prompts (100K–500K Tokens) Custom ML & MLOps gemini-in-looker , vertex-ai-platform	3	2431	September 10, 2025
Google Cloud Platform Vertex AI service experiences extremely slow streaming output with the gemini-3-pro-preview model. The first streaming output exceeds 17 seconds. Are there any effective solutions?（Vertex AI Platform） Apigee api-gateway	6	323	December 13, 2025
Slow RAG engine response with Vertex AI TypeScript library (~30s) vs. AI Studio (~1s) Custom ML & MLOps vertex-ai-platform	1	200	June 25, 2025

Gemini-Embedding-001 Latency Considerations

AI Suggested topics