Looker Studio alert processes
There are a number of ways we monitor the quality and integrity of the GOV.UK GA4 data. This page details the data alerts process which sends GA4 admins a Looker Studio dashboard to alert them to the presence of unwanted data in GOV.UK GA4.
Once a week an email is generated by Looker Studio with a link to a dashboard containing 7 days of data containing potential PII. This email is sent to the GOV.UK GA4 Alerts mailbox.
How has the alerts process been set up?
There is a scheduled query in the gds-bq-reporting project of BigQuery which runs every afternoon. This query checks a number of fields in the GA4 flattened table for potential PII.
The result of this query is appended daily to a table called pii_checks which sits within the ga4_data_checks dataset in the gds-bq-reporting project.
A Looker Studio dashboard is connected to this table and a separate page of the dashboard displays the last 7 days of data. A scheduled email has been set up from this page.
Scheduled query
Below is the code for the scheduled query which is running every day and appending data to a table:
WITH
  combined AS (
  SELECT
    event_date,
    unique_session_id,
    page_location,
    page_title,
    page_referrer,
    query_string,
    search_term,
    LOWER(CONCAT(page_location, ' ', page_title, ' ', page_referrer, ' ', query_string, ' ', search_term)) AS combined_text
  FROM
    `ga4-analytics-352613.flattened_dataset.partitioned_flattened_events`
  WHERE
    event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) )
SELECT
  event_date,
  unique_session_id,
  page_location,
  page_title,
  page_referrer,
  query_string,
  search_term
FROM
  combined
WHERE
  REGEXP_CONTAINS(combined_text, r'\bmy (name|email( address)?|address|postcode|(contact |phone |tel |telephone |mobile )?number|utr( number)?|passport number|date of birth|dob|d\.o\.b)\s?(is|:|-)\b')
  OR REGEXP_CONTAINS(combined_text, r'\b07\d{9}\b')
  OR REGEXP_CONTAINS(combined_text, r'\b0[1246]\d{8,9}\b')
  OR REGEXP_CONTAINS(combined_text, r'(@\s*|%40\s*)(gmail\.com|outlook\.com|yahoo\.com|icloud\.com)')
ORDER BY
  event_date
The code identifies potential PII that may have found its way into the page_location, page_title, page_referrer, query_string or search_term. There are a number of triggers for a row to be flagged such as ‘my name is’, something that looks like a UK mobile or landline number and personal emails that may have escaped the normal redaction process. In the case of telephone numbers, non-geographic numbers are not included as these are likely to be publically available numbers and not linked to an individual.
Checks were done on phone numbers detected by this REGEXP by comparing them against phone numbers on the GOV.UK website. To do this phone numbers were compared against a Knowledge Graph table in the public dataset called phone_numbers. The result of these checks was that most of the numbers being detected were not publically available on GOV.UK.
Looker Studio dashboard
The BigQuery table created by the scheduled query is used as a data source for the Looker Studio dashboard. The dashboard is filtered to the last 7 days on a separate page and this is sent to govuk-ga4-alerts@digital.cabinet-office.gov.uk once a week.
Investigations and actions
Once the email alert has been issued, the content of the Looker Studio dashboard can be evaluated to see whether the REGEXP has identified PII. If it has, then action can be taken to delete the data from GA4 and the BigQuery tables.