Skip to main content

GOV.UK GA4 (BigQuery) backfill processes

Each morning we receive yesterday’s GOV.UK GA4 data in BigQuery.

The sending of this raw data is administered by Google, and on occasion the data can arrive relatively late in the day. When this happens, several processes we use to transform the data fail, and the transformed data needs to be backfilled when it does arrive.

This page aims to summarise the backfilling processes undertaken by the GOV.UK Insights and Analytics team when GOV.UK GA4 data arrives late.

Partitioned tables

The two partitioned tables built from GOV.UK GA4 data have been set up to automatically backfill, even if the raw data arrives late.

If previously processed data is incorrect and needs to be reprocessed, all that is required is for the partitions (days) including data requiring re-processing to be deleted. The backfill will then run automatically to reprocess the missing dates. If a large number of days need to be reprocessed, the cron schedule of the workflow in Dataform can be adjusted to that it runs more frequently and so that more days are processed in a set time.

More information on how these tables are generated can be read on the GOV.UK GA4 BigQuery data processing page, and the queries used can be viewed in Github.

Pogo-sticking dashboard

The data for the pogo-sticking dashboard relies on the flattened GOV.UK GA4 data arriving on time. If this is late, then it will need to be backfilled. Only steps 8, 9 and 10 need to be backfilled for the dashboard to function.

The process for backfilling the data for this dashboard is as follows:

  • First, delete the affected empty tables from steps 8, 9 and 10 in the gds-bq-data.GA4_PogoSticking dataset
  • Run the below queries (changing to the correct date at the beginning) and save the results with the correct table names to their corresponding tables in the aforementioned dataset. For example, if you were backfilling the data for step 9 for the 2nd of September 2024, you would run the backfill query and save the results to the ‘gds-bq-data’ project, the ‘GA4_PogoSticking’ dataset, and your table name would be ‘PogoStickStepNine_20240902’

GA4 flattened backfill process - replacing incorrectly processed data

This section outlines the process which we (the GOV.UK Insights and Analytics team) will follow in order to backfill our flattened GA4 production data. We are fully backfilling the data as opposed to selectively backfilling specific fields. This is following a comprehensive investigation into the feasibility and pros and cons of selectively backfilling only specific fields in the data. The investigation determined that there were more arguments in favour of completely reprocessing days containing incorrect data - the principal reasons being lower cost and reduced complexity.

The process itself

  • Delete the required data from the flattened partitioned table as required, in 60 day chunks. For example, using DML as below:
DELETE FROM `ga4-analytics-352613.flattened_dataset.partitioned_flattened_events`
WHERE event_date BETWEEN "2023-01-01" and "2023-03-02"
  • Set the scheduled frequency for the flattening Dataform pipeline to execute five times per hour. This can be done with the following custom schedule to run the pipeline on the hour, and at 15, 30, 45 and 50 minutes past, every hour: 0,15,30,40,50 * * * *

  • Run the below query to check that the data has been reprocessed, and that the row counts look correct for each day. If the dates have not been reprocessed, or the counts look incorrect, investigate before proceeding. The table this query checks against is a precalculated count of events per day in the flattened data. We would want to ensure the reprocessed counts are the same.

DECLARE backfill_start_date_check DATE DEFAULT '2023-01-01';
DECLARE backfill_end_date_check DATE DEFAULT '2024-03-02';

with cte_1 as (

SELECT event_date, count(*) as event_count from `ga4-analytics-352613.flattened_dataset.partitioned_flattened_events`
where event_date BETWEEN backfill_start_date_check AND backfill_end_date_check GROUP BY 1)

SELECT a.event_date, 
b.event_date as backfilled_date, 
a.event_count, 
b.event_count as backfilled_event_count, 
CASE WHEN a.event_date = b.event_date AND a.event_count = b.event_count 
THEN "Success!" ELSE "Error" 
END AS match_check
FROM `gds-bq-reporting.ga4_data_checks.flat_dates_event_counts` a 
JOIN cte_1 b on a.event_date = b.event_date 
WHERE a.event_date = b.event_date
ORDER BY a.event_date
  • Make a note of the latest dates to be reprocessed according to a progress tracker spreadsheet and continue the process

  • Once all backfilling has been completed, revert the flattening Dataform pipeline’s schedule back to the custom Cron schedule of: 0 8,9,10,11,12,14 * * *

This page was last reviewed on 11 April 2025. It needs to be reviewed again on 11 October 2025 .