GOV.UK GA4 BigQuery data processing
We process the raw GOV.UK GA4 data we export from Google Analytics 4 to BigQuery to better meet data users’ needs.
As well as transforming the data for specific products, we have created a partitioned nested version and a partitioned flattened version of this raw data. These tables, particularly the flattened, should be easier and more efficient to query.
Partitioning and flattening GOV.UK GA4 data
The partitioning and flattening of the GOV.UK GA4 takes place in the gds-bq-reporting
Google Cloud project and uses Dataform.
The gds-bq-reporting
project was used to ensure that this processing did not count towards the quotas we have set on the primary GA4 Analytics project (ga4-analytics-352613
).
The one Dataform repository contains all of the code for the partitioning and flattening of GOV.UK GA4 data, as well as for the generation of a table of summary metrics. The workflow is scheduled to run six times a day - at every hour between 8am and 12pm, and then again at 2pm. This is to ensure that yesterday’s raw data is processed promptly, regardless of when it arrives, and means that any historic gaps in the data will be backfilled speedily (the code processes one day of data each time it runs).
The full code used can be viewed in Github.
Partitioning
This process takes the raw date sharded GOV.UK GA4 table (with shards with the suffix ‘YYYYMMDD’) and duplicates the data into a new partition in the ga4-analytics-352613.analytics_330577055.partitioned_events
table.
Partitioning is done on event_date
which means that this field is transformed from a string into a timestamp. The table is also clustered on event_name
.
This data is available to query and use, but is still nested and so not as easy to use or efficient to query as the flattened data.
Flattening
The flattened GA4 data is created from the partitioned nested table.
This is also partitioned on event_date
and clustered on event_name
, and is saved as ga4-analytics-352613.flattened_dataset.partitioned_flattened_events
.
Summary tables
Tables of high-level summary metrics are also generated from the partitioned_flattened_events
table, and saved in the ga4-analytics-352613.flattened_dataset.page_summaries
table.
Making changes to the processing
Manually running the pipeline
The partitioning and flattening workflow is scheduled to run several times a day, but there may be occasions when a you need to force it to run.
To perform a manual execution, you should click on the ‘Start execution’ button next to the ‘Release configurations’ heading under the ‘Releases and scheduling’ tab in the Dataform repository. You will need to select the ‘production’ release configuration and ‘All actions’ to run.
Updating the partitioned table schema
Periodically there may be changes required to the schema of the partitioned table or the partitioned flattened table. This could be due to Google changing the schema of the data it sends to BigQuery, or due to changes to our GOV.UK GA4 implementation having an impact of the fields we want in the dataset. These changes will cause the processing to fail.
Simple changes to the schema (adding new, non-nested fields) can be achieved in the BigQuery interface. More complicated changes may require use of the command line.
Should you need to update the schema of the partitioned_events
table (duplicating the schema of the raw nested data), you will have to produce a copy of the raw nested schema, edit it, and save it as the schema of the partitioned nested table.
This command produces a schema file based on the latest raw nested GOV.UK GA4 data (you will need to change YYYYMMDD to reflect yesterday’s date):
bq show schema format=prettyjson ga4-analytics-352613:analytics_330577055.events_YYYYMMDD > new_schema.json
You will then need to manually change the event_date
to a DATE
(as this needs to be a date for the partitioning) and save the modified file as the schema of the partitioned nested table:
bq update ga4-analytics-352613:analytics_330577055.partitioned_events new_schema.json
The whole pipeline can then be manually re-run, as above, to confirm these changes have worked.
Updating the code
Periodically, changes may also be required to the flattening code - for example, when new custom dimensions are added and need flattening.
There are several steps to follow when making changes to the SQL in the Dataform workspace:
- Make your change to the SQL in the ‘Code’ tab of the Dataform development workspace
- Commit your change using the button in the top left of the ‘Code’ tab. Make sure to add a useful commit message. This saves your changes to the code in Dataform, but the code still needs to be synced with the Github repository before it can be run
- To send your changes to Github, you need to push your changes, again using the button in the top left of the ‘Code’ tab in Dataform
- In the Github repository, create a pull request using the branch you have just pushed your changes to (the branch will likely be called ‘workspace’). You should get someone else to check and approve your changes before merging
- In Dataform, pull your changes from the default branch (‘main’), again using the button in the top left of the ‘Code’ tab
- Before you can execute the code, you will need to re-compile the code. This can be done using the ‘New compilation’ button in the release configuration details. These can be found by clicking on the name of the release configuration (in this case, ‘production’) in the ‘Releases and scheduling’ tab. If you don’t need to manually execute the code now, you can leave this as the code is scheduled to re-compile every day at 7am before the regular workflow runs
- If you wish to manually re-run the code, you can now manually trigger a run using the ‘Start execution’ button next to the ‘Release configurations’ heading under the ‘Releases and scheduling’ tab
Other products
The data is also transformed in a variety of other ways for particular products (for example, for the Content Data app). More information on how these work can be found in the product pages.