Subject: Unable to Programmatically Access Dataplex Universal Catalog Term-to-Column Links and Aspects/Tags via API

Hello,

We are building a custom metadata solution by programmatically exporting metadata from our Dataplex Universal Catalog into BigQuery using Python. We have had great success with parts of the process, but are completely blocked on retrieving the links between glossary terms and BigQuery columns, as well as any attached Aspects (Tags).

Our Goal: To create a complete, mirror-image of our data catalog in BigQuery. This involves three stages:

  1. Exporting the full Business Glossary (terms, categories, descriptions).

  2. Exporting the full technical asset inventory (Lakes, Zones, BigQuery Datasets, Tables, and Column Schemas).

  3. Exporting the links that connect a specific Glossary Term to a specific BigQuery column, and any custom Aspects (Tags) attached to these assets.

What Works: :+1: Using a hybrid approach of the Dataplex REST API and the BigQuery API, we can successfully complete the first two stages:

  • We can list all glossaries and their terms via the Dataplex REST API (.../glossaries and .../terms endpoints).

  • We can discover all Lakes, Zones, and BigQuery Dataset assets via the Dataplex REST API.

  • We can extract detailed column schemas for each table using BigQuery’s INFORMATION_SCHEMA.

  • We successfully load all of this data into corresponding tables in BigQuery.

What Doesn’t Work: :-1: The crucial third stage fails. We are unable to programmatically find any links or Aspects, even though we have created them in the Dataplex UI and can see them there.

For example, in the UI, we can clearly see a Glossary Term “linea-biznesowa” linked to the campaign column in the AV_Meta_Ads_Raw.campaign table.

Our script, however, reports zero links found and zero tags found.

Our Approach & What We’ve Tried: Our script’s final stage attempts to find these links using the google-cloud-datacatalog library (v1beta1), as suggested in other community posts. We have tried multiple search strategies, all of which return empty results for links/tags:

  1. A broad search for all BigQuery assets, followed by a lookup_entry for each to check for a business_context.

  2. A more precise, term-by-term search using the query business_term:"<full_term_path>".

  3. A search for any asset with a tag using has:tag.

We have also confirmed the following:

  • The Service Account has Dataplex Administrator and Data Catalog Viewer roles on the main project, and BigQuery Data Viewer on the source data project. Permissions should not be an issue.

  • All necessary APIs (Dataplex, Data Catalog, BigQuery) are enabled.

  • The code runs in a clean Python virtual environment with up-to-date libraries.

  • Direct lookup_entry on a table works, but fails on a column (as expected per documentation).

Our Question for the Google Team: Given that the links and Aspects are visible in the Dataplex UI but are undiscoverable via the search_catalog and list_tags methods in the v1beta1 API, could you please clarify:

  1. Is this a known limitation or a gap in the public API for the Dataplex Universal Catalog, which is currently in Preview?

  2. Is there an alternative, documented programmatic method to retrieve these term-to-column links and Aspects that we have missed?

  3. Is there an estimated timeline for when this functionality will be fully supported in the public API and client libraries?

Thank you for your help and insights.

Below is a sanitized version of our final script for reference:

import requests
import json
import pandas as pd
import google.auth
from google.auth.transport.requests import AuthorizedSession
from google.cloud import bigquery
from google.cloud import datacatalog_v1beta1 as datacatalog

# --- Configuration ---
MAIN_PROJECT_ID = "your-main-project-id"  # Project with Dataplex and metadata BQ dataset
SOURCE_DATA_PROJECT_ID = "your-source-data-project-id"
BIGQUERY_LOCATION = "EU"
LAKE_RESOURCE_LOCATION = "europe-central2"
GLOSSARY_LOCATION = "global"
BIGQUERY_DATASET = "governance_assets_dataset"
# ... table names ...
# --------------------

try:
    credentials, project = google.auth.default(
        scopes=['https://www.googleapis.com/auth/cloud-platform']
    )
    authed_session = AuthorizedSession(credentials)
except Exception as e:
    print(f"Authentication failed: {e}")
    credentials = None

def load_glossary_to_bigquery():
    # This stage works perfectly using the Dataplex REST API.
    # It fetches all glossaries and terms and loads them to a BQ table.
    print("STAGE 1: GETTING GLOSSARY - SUCCESS")
    # ... code for stage 1 ...
    all_terms = [] # Returns a list of all found term dicts
    return all_terms

def load_asset_inventory_to_bigquery():
    # This stage works perfectly using Dataplex REST API and BQ INFORMATION_SCHEMA.
    # It discovers all lakes, zones, and BQ datasets, then gets all table/column schemas.
    print("STAGE 2: GETTING TECHNICAL ASSETS - SUCCESS")
    # ... code for stage 2 ...

def load_term_links_to_bigquery(all_terms):
    """
    ATTEMPT to find links between terms and assets.
    THIS IS THE STAGE THAT FAILS by returning zero results.
    """
    print(f"\nSTAGE 3: ATTEMPTING TO FIND TERM-ASSET LINKS")
    if not all_terms:
        print("--- No terms to check. Skipping Stage 3. ---")
        return

    catalog_client = datacatalog.DataCatalogClient(credentials=credentials)
    scope = datacatalog.SearchCatalogRequest.Scope(
        include_project_ids=[MAIN_PROJECT_ID, SOURCE_DATA_PROJECT_ID]
    )
    all_links = []

    print(f"--- Checking {len(all_terms)} terms for any linked assets... ---")
    for term in all_terms:
        term_path = term['term_full_path']
        # Precise query for each term
        query = f'business_term:"{term_path}"'
        
        search_results = catalog_client.search_catalog(scope=scope, query=query)
        
        for result in search_results:
            # This loop is never entered because search_results is always empty
            if result.search_result_subtype == 'entry.column':
                # ... logic to parse and collect links ...
                all_links.append({...})

    if not all_links:
        print("--- RESULT: No links found between terms and BigQuery columns. ---")
        return

    # ... code to load links to BQ ...

if __name__ == "__main__":
    if credentials:
        all_terms_list = load_glossary_to_bigquery()
        load_asset_inventory_to_bigquery()
        load_term_links_to_bigquery(all_terms_list)

Hi ArkadyZagdan,

Here’s a response to your questions:

  1. Is this a known limitation or a gap in the public API for the Dataplex Universal Catalog, which is currently in Preview?

    While this wasn’t explicitly mentioned in the limitations, it seems like the relationships between specific glossary terms and BigQuery columns are not fully exposed through the API.

  2. Is there an alternative, documented programmatic method to retrieve these term-to-column links and Aspects that we have missed?

    Currently there is no documented programmatic method to retrieve these term-to-column links. However, you can export metadata through metadata export jobs and store it directly to Google Cloud Storage.

  3. Is there an estimated timeline for when this functionality will be fully supported in the public API and client libraries?

    You can monitor the release notes for new features and improvements or consider filing a feature request but take note that there’s no specific timeline on when it can be done.

I am trying to programmatically trigger a Dataplex metadata export job for one of my Lakes, as described in the official documentation: https://cloud.google.com/dataplex/docs/export-metadata

However, every attempt to do this via the documented methods (gcloud and REST API) is failing, and I’m looking for clarification on the correct procedure.

My Goal: To create a script that programmatically triggers a full metadata export for a specific Dataplex Lake to a GCS bucket. The ultimate goal is to process these export files to get a complete metadata catalog, including the term-to-column links that are visible in the Dataplex UI but not available via the Data Catalog search API.

What I’ve Tried & The Errors:

I’ve been working inside the Google Cloud Shell to eliminate any local environment issues. My Lake (lakeid) is in the europe-central2 region.

  1. Attempt 1: gcloud dataplex metadata-jobs create

    • Following the documentation, I tried to use the gcloud dataplex metadata-jobs create command.

    • Error: The command fails with an unrecognized arguments error for flags like --lake and --gcs-uri. The gcloud tool seems to have a different command structure than what the documentation might imply for this specific job type.

  2. Attempt 2: Dataplex REST API (.../lakes/{lake-id}:exportMetadata)

    • I tried making a direct REST POST request to the custom action endpoint.

    • Error: The API returns a 404 Not Found error, indicating that this endpoint does not exist for my Lake in the europe-central2 region.

  3. Attempt 3: Dataplex REST API (Creating a metadataJob resource)

    • I also tried creating a metadataJob resource directly via a POST request to the .../metadataJobs endpoint.

    • Error: This fails with a 400 Invalid JSON payload error. The API seems to get confused about the request body structure, first complaining about missing fields, and then complaining that the wrapper object (metadata_job) is an unknown field.

My Question:

Given that the UI does not have an “Export” button for my Lake, and the gcloud and REST API methods are failing with fundamental errors, what is the correct, currently supported programmatic method for triggering a full metadata export from a Dataplex Lake?

It appears there may be a discrepancy between the documentation and the actual behavior of the API/gcloud for this Preview feature. Any clarification on the correct gcloud command syntax or REST API endpoint and body structure would be greatly appreciated.

I’ve also tried using GEMINI 2.5 in the cloud shell. Its was interesting and fun attempt. Not fruitful though.