Vertex AI Rag engine file metadata and metadata filtering

I’m trying to import files into rag engine from gcs. I can see I can see in the docs for rag engine import that I can set rag_file_metadata_config. I want to use inline_metadata_schema_source and inline_metadata_source, but I can’t find any examples or docs for what they should look like. Similarly, I can’t find any docs on what metadata.json should look like if I wanted to use a metadata file in gcs instead, though I’d prefer just using the inline options.

And for context retrieval filtering, in rag queries, rag_retrieval_config.filter.metadata_filter, is there docs on metadata filter syntax? Because I tried CEL, and got errors for tenant_id = “1“, but a success for tenant_id == “1“ (although not files were retrieved because none of my file actually have metadata yet since I haven’t figure out how to add it.)

Is filtering my query in the way I want even possible? Should I be using Vertex Ai search instead? Or gemini file search? Why are there so many similar but slightly different rag apis.

1 Like

Hello! You’ve hit a common pain point. The documentation for Vertex AI RagEngine’s metadata configuration is indeed quite sparse at the moment.

To answer your specific questions:

1. Inline Metadata Schema & Source

When using rag_file_metadata_config, the inline_metadata_source expects a mapping that matches your inline_metadata_schema_source.

  • Schema Source: Define the fields and their types (e.g., STRING, NUMBER).

  • Source: This is where you provide the actual key-value pairs.

Example Structure (JSON/Python SDK style) :

rag_file_metadata_config = {
    "inline_metadata_schema_source": {
        "metadata_schema": {
            "fields": {
                "tenant_id": {"type": "STRING"},
                "category": {"type": "STRING"}
            }
        }
    },
    "inline_metadata_source": {
        "metadata_map": {
            "tenant_id": "1",
            "category": "legal"
        }
    }
}

Note: If you use the GCS metadata.json method, it should be a JSONL file where each line corresponds to a file path in your bucket with its associated metadata.

2. Filter Syntax (CEL)

You are correct—RagEngine uses Common Expression Language (CEL). The reason tenant_id = "1" failed while == worked is that CEL follows C-style comparison operators.

  • Correct Syntax: metadata.tenant_id == "1" (Ensure you prefix with metadata. if the engine requires the namespace).

3. RagEngine vs. Vertex AI Search vs. Gemini File Search

This is the “billion-dollar question.” Here’s how I categorize them at Whitecyber Data Science Lab:

  • Gemini File Search (File API): Great for quick, short-lived sessions (up to 20MB per file). It’s “RAG-in-a-box” but lacks enterprise control.

  • Vertex AI Search (formerly Discovery Engine): The “Gold Standard” for enterprise RAG. It has built-in connectors, website indexing, and sophisticated UI. Use this if you want a managed, high-performance search experience.

  • RagEngine: Think of this as the “Lower Level” API. It gives you more control over the underlying vector database (like Vertex AI Vector Search) while still being managed. Use this if you need to integrate RAG deeply into a custom application and require specific metadata filtering that Vertex AI Search might abstract too much.

Recommendation:
If you are struggling with RagEngine’s metadata complexity, Vertex AI Search is often the smoother path because its metadata handling (via Schema mapping) is more mature and better documented.

Hope this helps you unblock your implementation!

did this answer solve your problem?

@lk213 I was wondering if you were able to correctly use metadata filtering in RAG Engine and if so, how did you get it?

Context: I’ve tried the suggestions in this thread up to the part where I do the rag queries but I get no results, and I have not found any material on checking if the metadata fields are correctly set, or even how to properly set them. Is worth mentioning that my RAG Engine is using RagManagedDB as a backend and not Vector Search