Vertex AI RAG loading document titles but not contents from GCS

Hi,

I have created a Vertex AI RAG Engine in us-east4 (I have tried other regions also). I have imported documents from GCS into the RAG database. I have granted all appropriate permissions, including making the service accounts owners, just in case.

What happens is that the file names are obtained from GCS, but if I check the Vertex AI file sizes or index sizes, they are all 0 bytes.

I have tried so many permutations of this, with different files, different parsers, I have verified trying just one UTF-8 file to make sure the ingestion parser wasn’t hanging…

Nothing works. I always end up with a Vertex AI Rag Engine and agent that, in theory, work (I can communicate with them, and ask them questions, and they normally respond, although sometimes it is with an error), but they do not respond with anything from my corpus (which is not surprising since they have not actually ingested the data).

It appears that the data ingestion process is silently failing This is easy to see just by create a RAG corpus and then asking it to tell you what it contains (see program below).

What I do not understand is that this is the Vertex AI equivalent of “Hello World”. If it won’t index the data, nothing at all will work. So, how is anyone getting a Vertex AI RAG agent to work?

The program below give me a list of each bucket in my GCS project, and each file name, along with the file size and index size (which are always 0 bytes).

Any insight into what is going on would be greatly appreciated.

Thanks,
James


from vertexai.preview import rag

from google.cloud.aiplatform import initializer

import humanize

# THIS PROGRAM LISTS THE VERTEX AI RAG CORPORA UNDER PROJECT_ID AND THE FILES AND INDEX SIZES, AS A SANITY CHECK THAT INDEXING IS WORKING.

# — Your Project Details —

PROJECT_ID = “[YOUR-ID]”

LOCATION = “[YOUR-LOCATION]”

def check_all_corpuses_health(project_id: str, location: str):

“”"

Connects to the RAG service, finds ALL corpuses,

and lists all indexed files for each one, providing a total count and size.

"""

# You might need to install humanize: pip install humanize

initializer.global_config.init(project=project_id, location=location)

print(f"— Checking Index Health for project {project_id} (location: {location}) —")

try:

corpuses_pager = rag.list_corpora()

corpuses_list = list(corpuses_pager)

if not corpuses_list:

print(“No corpuses found.”)

return

# THIS IS THE FIX: Loop through every corpus found.

for corpus in corpuses_list:

print(“\n” + “=”*40)

print(f"Found Corpus: {corpus.display_name} ({corpus.name})")

print(“=”*40)

print(“— Listing Indexed Files —”)

files = rag.list_files(corpus_name=corpus.name)

file_count = 0

total_bytes = 0

for file in files:

file_count += 1

total_bytes += file.size_bytes

print(f"- {file.display_name} ({humanize.naturalsize(file.size_bytes)})")

print(“-” * 30)

print(f"Total Indexed Documents: {file_count}")

print(f"Total Indexed Size: {humanize.naturalsize(total_bytes)}")

print(“-” * 30)

except Exception as e:

print(f"An error occurred: {e}")

if _name_ == “_main_”:

check_all_corpuses_health(PROJECT_ID, LOCATION)

This has been fixed using the python SDK and/or curl. It still may not work if you do the document import from GCS from within Google’s UI when setting up the Vertex AI Rag Engine - I am not sure - but as long as it works some way, that’s good enough.

Thank you!

Here is working code in case anyone wants:

from vertexai import rag

from vertexai.generative_models import GenerativeModel, Tool

import vertexai

# Create a RAG Corpus, Import Files, and Generate a response

# This works, as does the curl version on the other computer and found at the ticket:

# https://console.cloud.google.com/support/cases/detail/v2/62191532?hl=en&inv=1&invt=Ab55Vw&project=book-1-gog&supportedpurview=project&rapt=AEjHL4OG_6Uajt1VyH49mV9UK6V7TexKAhLvwivbIx29bWV_hDgdiuxtAOoa3ZQ1919K5feuWIGayjc7srUVmiCqw90zK0YUNuwGE9d3EQAENj3Yv1nXItw

# PROJECT_ID = “your-project-id”

# display_name = “test_corpus”

paths = [“gs://buck-002”] # Supports Google Cloud Storage and Google Drive Links

# Initialize Vertex AI API once per session

vertexai.init(project=“book-1-gog”, location=“us-east4”)

# Create RagCorpus

# Configure embedding model, for example “text-embedding-005”.

embedding_model_config = rag.RagEmbeddingModelConfig(

vertex_prediction_endpoint=rag.VertexPredictionEndpoint(

publisher_model=“publishers/google/models/text-embedding-005”

)

)

rag_corpus = rag.create_corpus(

display_name=‘rag-corpus1’,

backend_config=rag.RagVectorDbConfig(

rag_embedding_model_config=embedding_model_config

),

)

# Import Files to the RagCorpus

rag.import_files(

rag_corpus.name,

paths,

# Optional

transformation_config=rag.TransformationConfig(

chunking_config=rag.ChunkingConfig(

chunk_size=512,

chunk_overlap=100,

    ),

),

max_embedding_requests_per_min=1000, # Optional

)

# Direct context retrieval

rag_retrieval_config = rag.RagRetrievalConfig(

top_k=1, # Optional

filter=rag.Filter(vector_distance_threshold=0.5), # Optional

)

response = rag.retrieval_query(

rag_resources=[

rag.RagResource(

rag_corpus=rag_corpus.name,

# Optional: supply IDs from `rag.list_files()`.

# rag_file_ids=[“rag-file-1”, “rag-file-2”, …],

    )

\],

text=“What is RAG and why it is helpful?”,

rag_retrieval_config=rag_retrieval_config,

)

print(response)

# Enhance generation

# Create a RAG retrieval tool

rag_retrieval_tool = Tool.from_retrieval(

retrieval=rag.Retrieval(

source=rag.VertexRagStore(

rag_resources=[

rag.RagResource(

rag_corpus=rag_corpus.name, # Currently only 1 corpus is allowed.

# Optional: supply IDs from `rag.list_files()`.

# rag_file_ids=[“rag-file-1”, “rag-file-2”, …],

            )

        \],

rag_retrieval_config=rag_retrieval_config,

    ),

)

)

# Create a Gemini model instance

rag_model = GenerativeModel(

model_name=“gemini-2.0-flash-001”, tools=[rag_retrieval_tool]

)

# Generate response

response = rag_model.generate_content(“Who is Kenji?”)

print(response.text)