Hi,
Goal: To export Google Analytics 4 events data to the GCS bucket on daily basis.
Approach: Exporting Table Data to BigQuery
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
, as the GA data is nested the file format would be either Parquet or Avro.
Problem: As per the official document of [GA4] BigQuery Export schema, In the events table, “Not all devices on which events are triggered send their data to Analytics on the same day the events are triggered. To account for this latency, Analytics will update the daily tables (events_YYYYMMDD) with events for those dates for up to three days after the dates of the events. Events will have the correct time stamp regardless of arriving late. Events that arrive after that three-day window are not recorded”
I’ve checked the same in our analytics dataset - for today i.e 15th Feb 2023 the tables events_20230214, events_20230213, and events_20230212 the last modified date is of 15th Feb 2023, so GA is still appending data into the previous three days events table. In order to schedule a job that exports data from BQ to GCS in APPEND mode into respective date-wise folders, I am thinking of using the change history feature of BQ, only scanning data INSERTED on 15th Feb 2023
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230214`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
UNION ALL
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230213`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
UNION ALL
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230212`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
GCS URI will be like
gs://bucket/table=events_20230212/change_date=20230215/*
gs://bucket/table=events_20230212/change_date=20230214/*
gs://bucket/table=events_20230212/change_date=20230213/*
Question: Is the above approach the right way to move ahead or is there any better alternate way?
Thanks