From Data to Context: Launching the Automated Metadata Generation API
In today’s data-driven world, understanding your data is more critical than ever. As data engineers, analysts, and stewards, you are constantly working with vast amounts of data in BigQuery. But how well do you truly understand it? How much time do you spend writing documentation, crafting ad-hoc queries for exploration, or simply trying to figure out what a table is all about?
What if you could automate this process?
Introducing the new Data Documentation scan type in Dataplex’s Data Scan API. This powerful feature, part of the Data Insights functionality in Dataplex and BigQuery, helps you automatically generate valuable metadata for your BigQuery tables. Think insightful SQL queries for data exploration, clear schema descriptions, and concise table-level summaries – all generated for you.
Brief overview of DataScans API
Datascans API comprises two key concepts: DataScans and DataScan Jobs. Think of them as the “what” and the “when” of data analysis.
DataScan: The Blueprint for Your Analysis
A DataScan is a resource that defines the configuration of a data analysis task. It’s essentially a blueprint that specifies:
-
What to scan: The specific data asset to be analyzed, which can be a BigQuery table or dataset.
-
How to scan: The type of analysis to be performed.
-
When to scan: Configure a DataScan to run on a recurring schedule (e.g., daily, weekly) or on-demand. For the insights scans, only on-demand is applicable.
DataScan Job: The Execution of the Plan
A DataScan Job is the actual execution of a DataScan. Every time a DataScan is triggered, either by its schedule or manually, a new DataScan Job is created. This job represents a single, specific run of the analysis defined in the DataScan.
Key characteristics of a DataScan Job include:
-
A unique identity: Each job has its own ID, allowing you to track its progress and results individually.
-
A status: A job will progress through various states, such as running, succeeded, failed, or cancelled.
-
Results: Once a job completes successfully, it produces a detailed report of its findings.
Getting Started: Your 3-Step Guide
Ready to generate your first data documentation scan? Here’s a streamlined workflow to get you started.
Step 1: Prerequisites
Before you begin, make sure you have the following enabled:
-
Enable the APIs: You’ll need to enable the Dataplex API and the Gemini API. To do this, you’ll need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin).
-
Grant the necessary IAM roles: To create and manage scans, you’ll need the Dataplex DataScan Editor or Admin role. You’ll also need BigQuery Data Viewer and BigQuery Job User roles on the tables you want to scan.
Pro-Tip: For best results, run a Dataplex data profile scan on your tables before generating documentation scans. This helps ground the generated content in the actual values present in your data, preventing hallucinations or approximations by the LLM.
Step 2: Create and Run Your Scan
Now for the fun part! You’ll use curl commands to interact with the API.
- Create the DataScan: This command creates the DataScan object, which acts as a template for your scan jobs.
alias gcurl=‘curl -H “Authorization: Bearer $(gcloud auth print-access-token)” -H “Content-Type: application/json”’
gcurl -X POST https://dataplex.googleapis.com/v1/projects/your-project/locations/your-location/dataScans?dataScanId=your-scan-id -d '{
“data”: {
"resource": "//bigquery.googleapis.com/projects/your-project/datasets/your-dataset/tables/your-table"
},
“executionSpec”: {
"trigger": {
"onDemand": {}
}
},
“type”: “DATA_DOCUMENTATION”,
“dataDocumentationSpec”: {}
}’
- Run the Scan Job: Once the DataScan is created, you can trigger a job run.
gcurl -X POST https://dataplex.googleapis.com/v1/projects/your-project/locations/your-location/dataScans/your-scan-id:run
-
This will return a unique job ID that you can use to track the status of the scan.
-
Check the Job Status: You can check the status of your job run using the job ID.
- The job is complete when the status is SUCCEEDED or FAILURE.
Step 3: View and Publish Your Insights
Once the job succeeds, you can retrieve the results.
- Get the Scan Results: Use the following command to get the DataScan object, which now contains the results from the latest job run.
-
The results, including the generated SQL queries and descriptions, will be in the dataDocumentationResult field.
-
[Optional] Publish to BigQuery: To make the results visible in the BigQuery UI, you can publish them by attaching three labels to your BigQuery table:
-
dataplex-data-documentation-published-scan: <datascan_id>
-
dataplex-data-documentation-published-project:
-
dataplex-data-documentation-published-location:
-
Why this matters for Data Practitioners
The Data Documentation scan is more than just a cool feature; it’s a powerful tool that can transform how you work with data.
-
For Data Engineers: Automate the tedious process of documenting tables. Spend less time writing basic documentation and more time building robust data pipelines.
-
For Data Analysts: Quickly get up to speed on new datasets. The generated queries provide a great starting point for your analysis, helping you uncover insights faster.
-
For Data Stewards: Improve data governance and discoverability. With clear, automated documentation, you can ensure that your data assets are well-understood and properly used across the organization.
Start Exploring Your Data Today!
The Dataplex Data Insights API is a game-changer for anyone working with BigQuery. By automating the generation of data documentation, it frees you up to focus on what you do best: unlocking the value of your data.
Ready to give it a try? Head over to the https://cloud.google.com/bigquery/docs/data-insights to learn more and start generating insights for your BigQuery tables today.