Skip to main content

Publishing GA4 Processed Form Data

Publishing form data

Our publishing apps collect a lot of form data. Each time a document is created, withdrawn, published, or edited in any way, multiple form submissions will be involved. We collect the data from these form submissions across all of our publishing apps - provided they have been migrated to the design system, and have implemented GA4 form tracking according to the specification in Publishing’s Implementation Guide. From this data we can extract lots of useful information, including when major publication actions occurred (e.g. publishings, withdrawals), as well as putting together a document change history.

In its raw format, the form data is extensive and complex. Because of this, we have created a dataform pipeline which processes the data from our partitioned publishing GA4 data, govuk-publishing.dataform.partitioned_flattened_events, and produces an output which compiles a document’s major publishing milestones and its change history. This data is split by several dimensions by which we can examine the data, for example, government department and a document’s content type.

Location

The outputs from the form processing pipeline can be found in the govuk-publishing.form_processing dataset in BigQuery which lives in the govuk-publishing GCP project.

There are 17 tables in total - one for each stage of the pipeline - but the most useful one will be the final output which is used to power a number of Looker Studio dashboards, govuk-publishing.form_processing.17_output_joined_locale_counts

If you would like to use/view this data but do not have access, please Contact a GOV.UK Publishing performance analyst.

Output and using the data

The pipeline’s final output table, govuk-publishing.form_processing.17_output_joined_locale_counts, is based on publishing’s GA4 data and GOV.UK’s Knowledge Graph data. At the moment, the pipeline contains data from the Whitehall publishing app only.

The data can be used for a wide range of analysis tasks - from simply counting how many documents were edited within a given period (by counting content_id values), to exploring a document’s full change history using the publication_actions_with_edits field.

It also supports more detailed investigations, such as identifying which specific parts of a document were changed in each edit submission via the change_note field. If you need to analyse validation errors, you can use the error_message field to see the exact text shown to users when an error is logged, or the error_message_grouped dimension, which categorises errors into broader groups for easier interpretation (e.g. link errors, missing topic tags).

All of these fields can be combined with various useful dimensions, depending on how you want to slice the data. For example, you can examine any of the above by department (user_organisation_name), document_type and sub_type, or by user_id.

Output table schema

See the table’s schema below for a description of all of its fields.

field name type description
event_date DATE The date when the event was logged, formatted as YYYYMMDD
unique_session_id STRING The user_pseudo_id concatenated with the ga_session_id to create a unique session ID
user_organisation_name STRING The user’s organisation name (department) as inherited from their Signon account
event_name STRING The name of the event
page_location STRING The full URL with the protocol, hostname, page path and query string
event_text STRING Identifying text for the specific item the user interacted with
event_type STRING The type of feature the user interacted with e.g. accordion, footer etc
user_id STRING A unique, hashed UID assigned to an individual user. Inherited from Signon
event_timestamp TIMESTAMP The time (in microseconds, UTC) when the event was logged
publication_action STRING A field describing major document publishing actions based on form and flash notice event data - e.g. publications, withdrawals, attachments added
change_note STRING Records the changes made to a document as a comma separate list for ‘Edit’ events. This is determined using window functions and comparing the submitted form values with the document’s previously submitted values
error_message STRING An event’s ‘text’ value when the publication_action is ‘Error message shown’
url STRING A document’s public URL (if it’s live) from the knowledge graph search.page table
publication_action_with_edits STRING Preferred over the publication_action field. This is the publication_action field, but where it previously returned NULL and the event_type was ‘edit’, it now returns ‘edit’
first_saved_timestamp TIMESTAMP A document’s earliest seen timestamp where the publication_action value is ‘First saved’
first_other_recorded_timestamp TIMESTAMP A document’s earliest seen timestamp where the publication_action value is not ‘Published’ or ‘Force published’. i.e. the earliest recorded timestamp seen in the data where the change to the document was not its publication
latest_published_timestamp TIMESTAMP A document’s latest recorded timestamp where the publication_action value is either ‘Published’ or ‘Force published’
content_id STRING A document’s content_id value. Used to identify individual documents. Stays consistent across editions, but documents translated to different locales will return the same value here
document_type STRING A document’s type (content type) as determined in a lookup table based on values contained within the page_location field. This is attached to a list of content_ids
sub_type STRING A document’s sub type as determined in a lookup table based on an event’s event_text. Also attached to a list of content_ids
locale_count INTEGER A count of a document’s locales, from the knowledge graph’s publishing_api.documents table. Use in conjunction with the content_id field to know whether you may be looking at edits to more than one ‘document’
multiple_locales STRING If locale_count is greater than 1, this will be ‘Yes’, else ‘No’
error_message_grouped STRING Groups error messages into bigger buckets, e.g. link errors, topic tags need adding

Example of use with additional GA4 usage data

Alongside this, this final output can be combined with other GA4 usage data. For example, we have combined it with usage data which calculates the average time each user spends on Whitehall to create a new dimension for what is a ‘Light’ vs ‘Proficient’ user, examining the data through this lens. This was done through a Looker Studio custom query which is as follows.

with data as (
 SELECT
 event_date,
  event_timestamp,
  user_id,
  unique_session_id,
  event_name,
  page_location
FROM
  `govuk-publishing.dataform.partitioned_flattened_events`
WHERE
  event_date >= PARSE_DATE('%Y%m%d', @DS_START_DATE) and event_date <= PARSE_DATE('%Y%m%d', @DS_END_DATE)
  AND user_id IS NOT NULL
  AND hostname LIKE '%whitehall%'
  and event_name = 'page_view'
  ),

-- marks the event order per session using ROW_NUMBER, gets a readable timestamp

  ordered_sessions as (

  SELECT *, 
  TIMESTAMP_MICROS(event_timestamp) as timestamp_readable, 
  ROW_NUMBER() OVER(PARTITION BY user_id,unique_session_id ORDER BY event_timestamp) as event_order 
  from data
  ORDER BY event_date, user_id, unique_session_id, event_order),

-- marks the first and last timestamps in each session with a 1 (uses window function for last timestamp)

  min_max_identifier as (

  SELECT 
  *, CASE WHEN event_order = 1 THEN 1 ELSE 0 END AS first_event,
  CASE WHEN event_order = MAX(event_order) OVER(PARTITION BY event_date, user_id,unique_session_id) THEN 1 ELSE 0 END AS last_event
  FROM ordered_sessions
  ORDER BY event_date, user_id, unique_session_id, event_order),


-- get the timestamps for these events now that they have been identified

  first_last_timestamps as (

SELECT
  event_date,
  user_id,
  unique_session_id,
  MAX(CASE WHEN first_event = 1 THEN timestamp_readable END) AS first_event,
  MAX(CASE WHEN last_event = 1 THEN timestamp_readable END) AS last_event,
FROM min_max_identifier
GROUP BY event_date, user_id, unique_session_id),

-- use TIMESTAMP_DIFF to calculate the duration of each session

time_difference_calculated as (

SELECT *, 
ROUND(TIMESTAMP_DIFF(last_event,first_event,SECOND) / 60.0,2) AS session_duration_minutes 
from first_last_timestamps),

-- add these values up to get a total number of mins spent per day by user (using user_id)

user_time_by_day as (

SELECT event_date, 
user_id, 
round(sum(session_duration_minutes),2) as time_spent_summed 
from time_difference_calculated
group by event_date, user_id),

-- group the data into weeks rather than days, starting on whatever date you select in the date range

time_over_week as (

  SELECT
    PARSE_DATE('%Y%m%d', @DS_START_DATE)
      + 7 * DIV(
          DATE_DIFF(event_date, PARSE_DATE('%Y%m%d', @DS_START_DATE), DAY),
          7
        ) AS week_starting,
    user_id,
    ROUND(SUM(session_duration_minutes), 2) AS time_spent_summed
  FROM time_difference_calculated
  GROUP BY week_starting, user_id),

-- get a total number of weeks in the data (this is so when you work out the average, you're dividing all time spent by the same number of weeks)

  global_weeks AS (
  SELECT COUNT(DISTINCT week_starting) AS total_weeks
  FROM time_over_week
),

-- divide time spent by number of weeks to get average time per week (grouped by user)

joined as (

SELECT
  t.user_id,
  ROUND(SUM(t.time_spent_summed) / g.total_weeks, 2) AS average_time_per_week
FROM time_over_week t
CROSS JOIN global_weeks g
GROUP BY t.user_id, g.total_weeks),

-- added in param to adjust the threshold for the the case when which determines how many minutes a proficient vs light user is

average_time_per_week_output as (

SELECT *, 
CASE WHEN average_time_per_week >= @avg_minutes_on_app_per_week THEN 'Proficient user' ELSE 'Light user' END AS use_level 
from joined)

SELECT a.*, b.use_level, b.average_time_per_week
FROM `govuk-publishing.form_processing.17_output_joined_locale_counts`a
LEFT JOIN average_time_per_week_output b ON a.user_id = b.user_id
WHERE a.event_date >= PARSE_DATE('%Y%m%d', @DS_START_DATE) and a.event_date <= PARSE_DATE('%Y%m%d', @DS_END_DATE)

Processing (dataform)

The pipeline is made up of 17 nominal steps, though there are a few parallel processes which happen alongside the main processing thread which feed into the final output. A visual representation of how the data is processed can be found in the dataform ‘Compiled graph’.

Sources

Within the ‘definitions’ folder that contains the .SQLX files which define the pipeline, there is a ‘sources’ folder. Within this sources folder we defined three sources which are referenced throughout the pipeline. They are the:

Date configuration and initialisation

The pipeline starts by deciding what date it’s going to process from the flattened GA4 data. It does this in the 0_date_selector.sqlx file. It gets the maximum date from the GA4 flattened partitioned table which hasn’t already been processed (by looking at step 12’s output). It also sets the minimum date as the 7th October 2025, as this is the earliest date from which we can rely on this data being consistent post-implementation.

The pipeline is scheduled to run several times per day, which means that if the GA4 raw data is late for the previous day and the first code execution fails, the most recently missing date will always be backfilled first.

From this point, the pipeline runs in essentially three parallel streams which converge at step 10 (10_unioned_events.sqlx). The three streams are:

  • Core form_response events stream
  • Workflow-related form_response events
  • Non form_response event data

Core form_response events stream

1_raw_form_event_data.sqlx

The 1_raw_form_event_data.sqlx runs a pre-operation to determine a date value from the output of the 0_date_selector.sqlx file. It then extracts core form_response events and related events (flash notices, page views, form errors) from the publishing_ga4_flattened.sqlx reference file created earlier.

It also handles content_id propagation for newly created documents where the initial form_response event will have a NULL content_id after submission (because the content_id doesn’t exist until after the document has been created, and is only visible in subsequent events).

2_raw_form_response_columns.sqlx

Next, we have the 2_raw_form_response_columns.sqlx file which takes the output from 1_raw_form_event_data.sqlx and then parses JSON form field values found in the concatenated event_text parameters of each form_submission event. It extracts specific form fields into separate columns for change detection, for example:

JSON_VALUE(text_concat, '$."Worldwide organisations"') AS worldwide_organisations_form_value

3_add_missing_content_id.sqlx (incremental)

The next step, 3_add_missing_content_id.sqlx, is designed to catch any form_response events which are still missing their content_ids, so it selects the output from 2_raw_form_response_columns.sqlx and it adds missing content_ids for newly created documents and filters results to only form_response events.

Importantly, this step is an incremental table, which means that each time the pipeline is run, a new day of data gets added to the end of the output instead of being replaced like the other tables so far.

4_form_answer_differences.sqlx

The reason that the previous step needs to be an incremental table is because step four (4_form_answer_differences.sqlx) detects field changes between different form submissions, and because differences can occur across days, a LAG window function partitioned by content_id rather than unique_session_id or event_date is used to detect the change.

It detects field changes by comparing each field value to its previous value. The window functions are written as:

CASE
 WHEN COALESCE(closing_date_date_required_day_form_value, "0") != LAG(COALESCE(closing_date_date_required_day_form_value, "0"))
      OVER(PARTITION BY content_id ORDER BY event_timestamp)
 THEN 'Closing date - Date (required) - Day' END AS closing_date_date_required_day_change_check

5_form_change_note.sqlx

Step 5 takes the output from step 4 and builds a change_note field by concatenating all the ‘changed field’ columns (e.g. closing_date_date_required_day_change_check as above) into a comma-separated change note so that you are able to identify in one column what the altered form fields were for a specific event.

6_distinct_form_change_note.sqlx

This combines the results from 5_form_change_note.sqlx and 5_2_workflow_related_events and returns the distinct results of the union of these two tables.

1_2_editorial_form_response_event_data.sqlx

Similar to 1_raw_form_event_data.sqlx, this step runs a pre-operation to determine a date value from the output of the 0_date_selector.sqlx file. It then extracts workflow-related form_response events (publish, unpublish, delete, etc.) From the publishing_ga4_flattened.sqlx GA4 reference file created earlier.

5_2_workflow_related_events.sqlx is also built as an incremental table. This is necessary because its output is later unioned with the output of 5_form_change_note.sqlx in 6_distinct_form_change_note.sqlx. Since 5_form_change_note.sqlx ultimately depends on 3_add_missing_content_id.sqlx, which is itself incremental, 5_2_workflow_related_events.sqlx must contain the same range of days to ensure the union operates on a consistent dataset.

What it does is it processes workflow-related events from step 1_2_editorial_form_response_event_data.sqlx and identifies publication actions. It handles publishing, unpublishing, deletion, and other editorial workflow actions, for example:

CASE WHEN content_id IS NOT NULL AND event_type = 'confirm_unwithdraw' AND event_name = 'form_response' THEN 'Document unwithdrawn'
ELSE NULL END AS publication_action,

This combines the results from 5_form_change_note.sqlx and 5_2_workflow_related_events and returns the distinct results of the union of these two tables.

Non form_response event data

7_additional_event_data.sqlx

Again, this step runs a pre-operation to determine a date value from the output of the 0_date_selector.sqlx file. It then selects additional event information used to track the changes to a document, but events that are mainly non form_response events: (form_response, page_view, flash_notice ,side_bar_error ,form_error ,flash_message)

8_publication_actions_check.sqlx

Takes data from the previous step (step 7) adds labels under a field called publication_action to determine what kind of change/edit was made to a document through a series of CASE WHENs. For example:

CASE WHEN event_name = 'flash_notice' and event_text LIKE '%uploaded%' AND event_action = 'success_alerts' AND page_title LIKE '%Attachments%' THEN 'Attachment uploaded'
   WHEN event_name = 'form_response' and event_section = 'Create new edition' THEN 'Create new edition'
   ELSE NULL END AS publication_action

9_filtered_publication_actions.sqlx

This step is another incremental table. This is because it gets unioned in step 10 alongside the results from step 6 (which is itself a combination of the results from 5_form_change_note.sqlx and 5_2_workflow_related_events), and so must contain the same range of days to ensure the union operates on a consistent dataset.

In this step, the results from 8_publication_actions_check.sqlx are taken, and are filtered to remove any non NULL publication_action values. Error messages are also set so that the actual event_text is returned:

CASE WHEN publication_action =  'Error message shown' THEN event_text ELSE NULL END AS error_message

Final steps (Step 10 onwards)

10_unioned_events.sqlx

The results of all three streams end up being unioned in this step - so the core form_response events are combined with workflow-related form events, as well as the additional event data which isn’t necessarily a form_response event.

11_content_url_knowledge_graph.sqlx

This step simply gets the content_id and url of published documents from the knowledge_graph_search_table.sqlx table source we created in our sources folder.

12_final_output_joined_data.sqlx

Joins the unioned event data in 10_unioned_events.sqlx with the knowledge graph data in 11_content_url_knowledge_graph.sqlx to ensure that we are able to pull through the page url (live on GOV.UK of any document (if the document is live). A final attempt is also made here to propagate missing content_id values (this is to account for the new configurable document interstitial forms). It also adds three new timestamp fields, each calculated per content_id using window functions:

  • first_saved_timestamp: the earliest time the document was “First saved”
  • first_other_recorded_timestamp: the earliest time the document had any recorded action other than “Published” / “Force published”
  • latest_published_timestamp: the most recent time the document was Published or Force published

13_document_type_identifier.sqlx (incremental)

This is an incremental table. It takes the output from 12_final_output_joined_data.sqlx and identifies a content_id‘s document type and sub type from form submitted values or values within the page_location field for other events belonging to the same document.

14_document_types_processed.sqlx

This step takes the output from 13_document_type_identifier.sqlx and resolves document type information so that each content_id returns one definitive document type and sub-type. It first fills empty document type values across a content_id window using the latest known type. It then selects a single row per content_id, prioritising rows that contain a sub_type, and if multiple exist, choosing the most recent event. The result is effectively a lookup table of document_types and sub_types which is joined later on.

15_output_joined_doc_types.sqlx

Here the 12_final_output_joined_data.sqlx is augmented by joining it with the document_type and sub_type values from the lookup table created in 14_document_types_processed.sqlx. It is joined on a.content_id = b.content_id.

16_locale_counts.sqlx

In parallel to this, this step gets a count of locales by content_id. This is pulled from the knowledge_graph_documents_table.sqlx file which references the publishing_api.documents table which we set up in our sources folder.

17_output_joined_locale_counts.sqlx

The final step of the pipeline joins the output from 15_output_joined_doc_types.sqlx (which is itself a join of 12_final_output_joined_data.sqlx and 14_document_types_processed.sqlx) with the locale_count field created in step 16_locale_counts.sqlx.

Following this, a multiple_locales field is created here which is simply a ‘Yes’ or ‘No’ for each content_id depending on whether >1 locales has been counted.

Finally, the error_message field is grouped into the error_message_grouped dimension, which categorises errors into broader groups for easier interpretation (e.g. link errors, missing topic tags).

This page was last reviewed on 4 December 2025. It needs to be reviewed again on 4 June 2026 .