GOV.UK GA4 data quality
Google Analytics 4 (GA4) is used to collect data on the usage of GOV.UK.
Information on how to understand and use GOV.UK GA4 data can be found in the Analysis section of this site.
There are a few known issues with this data, which are detailed below.
Data quality notes and annotations
The GOV.UK GA4 data annotations Looker Studio report contains notes on issues with the GOV.UK GA4 data collection and outside events that may impact GOV.UK usage. More information on this report can be found on the GOV.UK GA4 annotations report page.
Data quality notes can be added by GDS staff using the GOV.UK GA4 annotations form.
Known issues with the GOV.UK GA4 data
Some larger scale issues which impact our data collection are not detailed in the above annotations reporting. Shorter term bugs and issues with our tracking can be found in the above Looker Studio.
Data quality variance and content over time
Data was first collected into this property on 23/09/22. The events captured have changed significantly over time, and so early data quality may be patchy.
The GOV.UK GA4 data annotations Looker Studio report contains information on when GOV.UK GA4 events were created.
Bot traffic
In GA4, traffic from known bots is automatically excluded.
We do not know how much other bot traffic is being recorded in our data.
Smokey tests
Some smoke tests are run on Production GOV.UK, and sometimes the smokey test bot triggers analytics data collection.
These can be found and removed from the data by using the ‘Test data filter name’ dimension, which will be populated with ‘Smokey’ if the data is from a smokey test IP address.
Users we miss
We know a chunk of our users do not accept analytics cookies so we do not collect any data from them.
Some users may also be using ad blockers which inhibit Google Analytics from functioning.
Incorrect event tracking
Assumptions in navigation tracking
In order to capture users clicking on links in ways other than the usual left or primary click, we set up navigation (link) tracking so that all clicks on links trigger a navigation event to fire.
This may mean some false positives in the data, as some right (secondary) clicks or other clicks may not necessarily lead to the user navigating through the link in question. For example, a user might be right clicking on a link to copy it or inspect it. If data users would like to exclude these clicks from their dataset, they can do so using the ‘method’ custom dimension.
This is also why some duplicate tracking may be occuring with the navigation and copy tracking, as users who right click and select to ‘Copy’ a link will trigger both navigation and copy events.
‘Error’ in form_complete tracking
form_complete events will fire on any view of the ‘results’ page. This means that multiple form_complete events can be triggered by the same user if they refresh the page or navigate away from it and back to it. Also, it is possible that some users could trigger a form_complete event without having actually responded to or submitted the form themselves, if they are sent the link to a results page. This is because we were not technically able to limit form_complete events to firing conditionally depending on users’ activity on other pages.
Further information on how and when form events are used can be found in our information on the GOV.UK GA4 data structure.
Truncated ecommerce (view_item_list) tracking on search results pages
Each ecommerce event (view_item_list) can have up to 200 items sent with it. When searches return more than 200 items, the items sent with each event will be truncated.
In some cases, we have had to limit the number of items sent with each event further, as the number of bytes being sent with each event was too high and was triggering error 413 (Payload Too Large). The maximum amount of bytes we can send to GA4 per event appears to be 16KB, so we have implemented a limit to cut off the ecommerce items array at 15KB, to ensure our events are small enough to send (the additional KB may be needed for other information sent with the event).
Incorrect information in dimensions
Parameter character count limits
GA4 limits all parameters (custom dimensions) to a maximum length of 500 characters (for 360 properties).
The only exception to this is the page_location parameter which must be 1,000 characters or fewer.
Inconsistencies in ‘outbound’ values in attachment events
On pages with attachment links, clicks on different links to download files come through with the outbound
dimensions equalling ‘true’ as would be expected (as these files are hosted on https://assets.publishing.service.gov.uk).
However, clicks on the preview link (to ‘View online’) come through with the outbound
value of ‘false’ even through the preview is also hosted on https://assets.publishing.service.gov.uk.
This is because in the source HTML the second link only has the page path, and is being redirected to the assets domain.
Issues with users accessing GOV.UK in different languages
We capture what language a page was written in in the content_language
dimension. The majority of pages on GOV.UK are written in English, although there are a few pages in Ukrainian, Russian, and other languages.
Separately, the user’s browser language is captured in the language
dimension.
However, there are a number of ways users can translate page content - for example, using browser add-ons.
If the browser is translating the page content after the page_view
event is sent, then the page_view
will be sent with details in the original language (in most cases, English), though the text and other dimensions sent with subsequent interactions on that page might be translated.
There are a few dimensions that we have hard-coded in English to make them easier to analyse, for example the ‘section’ value on related content link clicks, but in most cases this was not possible due to the way content is surfaced via the Content API on GOV.UK.
Issues with link domain information in navigation tracking on pages with mistyped URLs
When there is an extra / in the URL, the ‘Link domain’ information is incorrect, coming through as the first part of the path instead of the domain.
This can be seen on the live site, for example if you interact with the Contents links on the https://www.gov.uk//guidance/cost-of-living-payment#low-income-benefits-and-tax-credits-cost-of-living-payment-eligibility
page.
Strictly, this URL should not be valid, but it (and many other incorrect URLs) do work to load content on GOV.UK.
Use of URLs like this is rare and so this should not cause too much of a data quality issue.
Issues with publication and update dates
The first_published_at
, public_updated_at
and updated_at
dates sent with page view events may be misleading.
This is particularly likely to be the case for content items published between 11pm and 1am (an hour either side of midnight) depending on whether the item is published during Greenwich Mean Time (GMT) or British Summer Time (BST). This is because to extract the date, which we record in the custom dimension, we strip out the time information from the Content API timestamp to leave the date in YYYY-MM-DD format.
Previous work looking into timestamps associated with Whitehall Publisher CSVs identified that the Content API timestamps are in GMT, so for example an item timestamped as ‘2014-08-31 23:00:00’ is actually displayed on the page as ‘1 September 2014’ (published at midnight BST).
We have not yet investigated whether this has an impact on these GA4 dimensions.