Hello,
We are building a custom metadata solution by programmatically exporting metadata from our Dataplex Universal Catalog into BigQuery using Python. We have had great success with parts of the process, but are completely blocked on retrieving the links between glossary terms and BigQuery columns, as well as any attached Aspects (Tags).
Our Goal: To create a complete, mirror-image of our data catalog in BigQuery. This involves three stages:
-
Exporting the full Business Glossary (terms, categories, descriptions).
-
Exporting the full technical asset inventory (Lakes, Zones, BigQuery Datasets, Tables, and Column Schemas).
-
Exporting the links that connect a specific Glossary Term to a specific BigQuery column, and any custom Aspects (Tags) attached to these assets.
What Works:
Using a hybrid approach of the Dataplex REST API and the BigQuery API, we can successfully complete the first two stages:
-
We can list all glossaries and their terms via the Dataplex REST API (
.../glossariesand.../termsendpoints). -
We can discover all Lakes, Zones, and BigQuery Dataset assets via the Dataplex REST API.
-
We can extract detailed column schemas for each table using BigQuery’s
INFORMATION_SCHEMA. -
We successfully load all of this data into corresponding tables in BigQuery.
What Doesn’t Work:
The crucial third stage fails. We are unable to programmatically find any links or Aspects, even though we have created them in the Dataplex UI and can see them there.
For example, in the UI, we can clearly see a Glossary Term “linea-biznesowa” linked to the campaign column in the AV_Meta_Ads_Raw.campaign table.
Our script, however, reports zero links found and zero tags found.
Our Approach & What We’ve Tried: Our script’s final stage attempts to find these links using the google-cloud-datacatalog library (v1beta1), as suggested in other community posts. We have tried multiple search strategies, all of which return empty results for links/tags:
-
A broad search for all BigQuery assets, followed by a
lookup_entryfor each to check for abusiness_context. -
A more precise, term-by-term search using the query
business_term:"<full_term_path>". -
A search for any asset with a tag using
has:tag.
We have also confirmed the following:
-
The Service Account has Dataplex Administrator and Data Catalog Viewer roles on the main project, and BigQuery Data Viewer on the source data project. Permissions should not be an issue.
-
All necessary APIs (
Dataplex,Data Catalog,BigQuery) are enabled. -
The code runs in a clean Python virtual environment with up-to-date libraries.
-
Direct
lookup_entryon a table works, but fails on a column (as expected per documentation).
Our Question for the Google Team: Given that the links and Aspects are visible in the Dataplex UI but are undiscoverable via the search_catalog and list_tags methods in the v1beta1 API, could you please clarify:
-
Is this a known limitation or a gap in the public API for the Dataplex Universal Catalog, which is currently in Preview?
-
Is there an alternative, documented programmatic method to retrieve these term-to-column links and Aspects that we have missed?
-
Is there an estimated timeline for when this functionality will be fully supported in the public API and client libraries?
Thank you for your help and insights.
Below is a sanitized version of our final script for reference:
import requests
import json
import pandas as pd
import google.auth
from google.auth.transport.requests import AuthorizedSession
from google.cloud import bigquery
from google.cloud import datacatalog_v1beta1 as datacatalog
# --- Configuration ---
MAIN_PROJECT_ID = "your-main-project-id" # Project with Dataplex and metadata BQ dataset
SOURCE_DATA_PROJECT_ID = "your-source-data-project-id"
BIGQUERY_LOCATION = "EU"
LAKE_RESOURCE_LOCATION = "europe-central2"
GLOSSARY_LOCATION = "global"
BIGQUERY_DATASET = "governance_assets_dataset"
# ... table names ...
# --------------------
try:
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform']
)
authed_session = AuthorizedSession(credentials)
except Exception as e:
print(f"Authentication failed: {e}")
credentials = None
def load_glossary_to_bigquery():
# This stage works perfectly using the Dataplex REST API.
# It fetches all glossaries and terms and loads them to a BQ table.
print("STAGE 1: GETTING GLOSSARY - SUCCESS")
# ... code for stage 1 ...
all_terms = [] # Returns a list of all found term dicts
return all_terms
def load_asset_inventory_to_bigquery():
# This stage works perfectly using Dataplex REST API and BQ INFORMATION_SCHEMA.
# It discovers all lakes, zones, and BQ datasets, then gets all table/column schemas.
print("STAGE 2: GETTING TECHNICAL ASSETS - SUCCESS")
# ... code for stage 2 ...
def load_term_links_to_bigquery(all_terms):
"""
ATTEMPT to find links between terms and assets.
THIS IS THE STAGE THAT FAILS by returning zero results.
"""
print(f"\nSTAGE 3: ATTEMPTING TO FIND TERM-ASSET LINKS")
if not all_terms:
print("--- No terms to check. Skipping Stage 3. ---")
return
catalog_client = datacatalog.DataCatalogClient(credentials=credentials)
scope = datacatalog.SearchCatalogRequest.Scope(
include_project_ids=[MAIN_PROJECT_ID, SOURCE_DATA_PROJECT_ID]
)
all_links = []
print(f"--- Checking {len(all_terms)} terms for any linked assets... ---")
for term in all_terms:
term_path = term['term_full_path']
# Precise query for each term
query = f'business_term:"{term_path}"'
search_results = catalog_client.search_catalog(scope=scope, query=query)
for result in search_results:
# This loop is never entered because search_results is always empty
if result.search_result_subtype == 'entry.column':
# ... logic to parse and collect links ...
all_links.append({...})
if not all_links:
print("--- RESULT: No links found between terms and BigQuery columns. ---")
return
# ... code to load links to BQ ...
if __name__ == "__main__":
if credentials:
all_terms_list = load_glossary_to_bigquery()
load_asset_inventory_to_bigquery()
load_term_links_to_bigquery(all_terms_list)