GA4 data deletions
This page is a work in progress.
This page details the policy and process we follow when deleting production GA4 data.
When do we delete GA4 data?
We will delete GA4 data when it meets any of the following criteria:
- When we believe the data was collected without adequate user consent
- When data collected contains personally identifiable information and/or we are advised by our data protection and privacy specialists that it should be deleted
- When data collected is ‘wrong’, inaccurate or otherwise deemed to be unhelpful to data users, and it is: a. A significant enough number of events to be worth deleting or partially deleting (1 million events a day, or 10 million events in total), and b. Possible to delete in the GA4 interface and in BigQuery without deleting other accurate and useful data
With regards criterion 3a, we believe the number of events is a useful indicator of the impact the presence of this incorrect data is likely to have on data users (more events are more noticeable and more likely to appear in reporting, so probably more worth deleting) and of the costs we will be paying to store this data in BigQuery. We are currently using 1 million events a day, or 10 million events in total, as useful rules of thumb for what might count as a significant number of events. This is only to be used as a guide or supporting piece of information when considering data deletions however, and is not a strict threshold.
In the case of criterion 3b, we acknowledge the limitations of the GA4 data deletion feature, which means that it is not always possible to delete only a specific subset of our GA4 data. Depending on the ‘wrong’ nature of the data, it may be preferable to keep it if the alternative involves also deleting a large swathe of accurate and useful data. We also note that many products and other datasets rely on our GA4 BigQuery data, so in some cases it may be that the deletion of data in BigQuery is not worth it if it will not be hugely impactful (if it is data that is not heavily used) or, at the opposite end of the spectrum, if it will have a disruptive impact on other tools.
If the unwanted data collected is merely ‘wrong’ (criteria 3), we would likely not delete data from our raw GOV.UK GA4 dataset in BigQuery, only from our flattened and other processed datasets. This is because the risks associated with deleting raw data are not worth the benefits of removing incorrect data.
GA4 data deletion process
- Unwanted data is spotted and analysts/Analytics team are notified
- A fix is put in place to stop the unwanted data from being collected, if this hasn’t already been done
- Analysts figure out which datasets are impacted by the unwanted data collection, and determine using the criteria above whether it is possible and desirable to delete the unwanted data
- Relevant owners and/or Leads are informed of the recommendation on data deletion and asked to approve/feedback
- Unwanted data is deleted from the GA4 Data API using the GA4 data deletion feature, if this is possible and desirable
a. Data deletion request is scheduled. Data scheduled to be deleted must be at least 12 days old so the team may need to wait a fortnight to action this
b. Data deletion is checked during the preview/grace period
c. Successful data deletion is checked and confirmed after the deletion request shows as processed. A data deletion request can take anything from 7 to 63 days, so this may take some time - Unwanted data is deleted from the relevant BigQuery datasets, if this is possible and desirable, using a soft deletion or quarantine process. Note that data will only be deleted from the raw GA4 dataset if absolutely necessary (if the data was collected without consent or has to be deleted for data protection or privacy reasons) a. The data being deleted is moved out of the original source dataset into a quarantine dataset, and then deleted from the source dataset b. If the data deletion has been checked and confirmed to be successful then the quarantine dataset is also deleted
- A notice on the data deletion is shared with the analyst community/data users and/or added to the GA4 data annotations table
Owners and staff members’ responsibilities
For GOV.UK GA4 Production data:
- The Lead Performance Analyst in Data will own GA4 API data deletions and be responsible for signing off any data deletion requests
- The Lead Data Engineer in Data will own GA4 BigQuery data deletions and be responsible for signing off any data deletions
- Members of Data Services responsible for the upkeep and maintenance of GOV.UK GA4 will action data deletion requests on the GOV.UK GA4 data and communicate the impact to the wider analyst community