Hi,
I have created a Vertex AI RAG Engine in us-east4 (I have tried other regions also). I have imported documents from GCS into the RAG database. I have granted all appropriate permissions, including making the service accounts owners, just in case.
What happens is that the file names are obtained from GCS, but if I check the Vertex AI file sizes or index sizes, they are all 0 bytes.
I have tried so many permutations of this, with different files, different parsers, I have verified trying just one UTF-8 file to make sure the ingestion parser wasn’t hanging…
Nothing works. I always end up with a Vertex AI Rag Engine and agent that, in theory, work (I can communicate with them, and ask them questions, and they normally respond, although sometimes it is with an error), but they do not respond with anything from my corpus (which is not surprising since they have not actually ingested the data).
It appears that the data ingestion process is silently failing This is easy to see just by create a RAG corpus and then asking it to tell you what it contains (see program below).
What I do not understand is that this is the Vertex AI equivalent of “Hello World”. If it won’t index the data, nothing at all will work. So, how is anyone getting a Vertex AI RAG agent to work?
The program below give me a list of each bucket in my GCS project, and each file name, along with the file size and index size (which are always 0 bytes).
Any insight into what is going on would be greatly appreciated.
Thanks,
James
from vertexai.preview import rag
from google.cloud.aiplatform import initializer
import humanize
# THIS PROGRAM LISTS THE VERTEX AI RAG CORPORA UNDER PROJECT_ID AND THE FILES AND INDEX SIZES, AS A SANITY CHECK THAT INDEXING IS WORKING.
# — Your Project Details —
PROJECT_ID = “[YOUR-ID]”
LOCATION = “[YOUR-LOCATION]”
def check_all_corpuses_health(project_id: str, location: str):
“”"
Connects to the RAG service, finds ALL corpuses,
and lists all indexed files for each one, providing a total count and size.
"""
# You might need to install humanize: pip install humanize
initializer.global_config.init(project=project_id, location=location)
print(f"— Checking Index Health for project {project_id} (location: {location}) —")
try:
corpuses_pager = rag.list_corpora()
corpuses_list = list(corpuses_pager)
if not corpuses_list:
print(“No corpuses found.”)
return
# THIS IS THE FIX: Loop through every corpus found.
for corpus in corpuses_list:
print(“\n” + “=”*40)
print(f"Found Corpus: {corpus.display_name} ({corpus.name})")
print(“=”*40)
print(“— Listing Indexed Files —”)
files = rag.list_files(corpus_name=corpus.name)
file_count = 0
total_bytes = 0
for file in files:
file_count += 1
total_bytes += file.size_bytes
print(f"- {file.display_name} ({humanize.naturalsize(file.size_bytes)})")
print(“-” * 30)
print(f"Total Indexed Documents: {file_count}")
print(f"Total Indexed Size: {humanize.naturalsize(total_bytes)}")
print(“-” * 30)
except Exception as e:
print(f"An error occurred: {e}")
if _name_ == “_main_”:
check_all_corpuses_health(PROJECT_ID, LOCATION)