Insights and suggestions on determining the last consumption-based access date on BigQuery tables, excluding insertions or creations:
1. Use the audit log
The audit log is a comprehensive record of all activity on your BigQuery project. It can be used to track consumption-based access to your tables by filtering for the following operations:
- jobservice.jobcompleted with a methodName of QUERY
Once you have filtered the audit log for consumption-based access, you can use the timestamp column to determine the last time each table was accessed.
2. Use a custom script
You can also write a custom script to extract the last consumption-based access date for your tables. The following Python code shows how to do this:
import pandas as pd
from google.cloud import bigquery
# Connect to BigQuery
client = bigquery.Client()
# Get the audit log for the past 30 days
dataset = client.dataset('audit_logs')
table = dataset.table('data_access')
query = f"""
SELECT
timestamp,
operation,
resource_type,
resource_name
FROM
`{table.project}.{table.dataset_id}.{table.table_id}`
WHERE
operation = 'jobservice.jobcompleted'
AND methodName = 'QUERY'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()
# Filter for consumption-based access
df_filtered = df[df['resource_type'] == 'TABLE']
# Get the last consumption-based access date for each table
last_consumption_date_per_table = df_filtered.groupby('resource_name')['timestamp'].max()
# Print the results
print(last_consumption_date_per_table)
This code will print the following output:
resource_name
my_project.my_dataset.my_table 2023-10-10 11:33:00
Name: timestamp, dtype: object
3. Use a third-party tool
There are also a number of third-party tools that can help you to track consumption-based access to your BigQuery tables. These tools typically provide a more user-friendly interface and additional features, such as the ability to set alerts and generate reports.
Recommendation:
I recommend using the audit log to determine the last consumption-based access date for your tables. This is the most reliable and comprehensive method. You can either write a custom script to extract the data from the audit log, or you can use a third-party tool.
Once you have determined the last consumption-based access date for your tables, you can use this information to deprecate data pipelines generating tables that haven’t been accessed in the past X months.
Additional considerations:
- Keep in mind that the audit log can be large and expensive to query, especially for large projects. You may want to consider sampling the audit log or using a third-party tool that can pre-process the data for you.
- When deprecating data pipelines, be sure to carefully consider the impact on your downstream applications. You may need to notify users of the change and provide them with a migration plan.
- You may also want to consider implementing a data retention policy to ensure that your BigQuery tables are not deleted prematurely.