Using the GA4 Page Path Tool
The GA4 Page Path Tool provides insights into user journeys on GOV.UK, allowing you to analyse navigation patterns from specific start and/or end page URLs. The tool is organised into two tabs: the Session Path Tool and the Page Sequence Tool. Built using GA4 ‘page_view’ event data in BigQuery, it updates each morning to include data from the previous day.
Session Path Tool Tab
The Session Path Tool displays complete session journeys, starting from the user’s entry page through to their exit page.
Each row represents a unique session journey. On the right, a count of the number of sessions following each specific journey is displayed.
You are able to use the tool in the reverse - by only submitting a final page URL to view journeys which lead to the final page of user sessions, regardless of starting page.
Additional filters enable you to view data by device type, session length (in terms of page count), and include an intermediary URL if you want to ensure a particular page appears in the session journey.
Page Sequence Tool Tab
The Page Sequence Tool focuses on the most common page sequences originating from a specific URL, regardless of its position in the session.
Each row shows a different sequence of pages visited, starting from your specified page. A count of the sessions that followed each sequence is displayed.
You are able to use the tool in the reverse - by only submitting a final page URL to view journeys which lead to the final page of a page sequence, regardless of starting page.
Similar to the Session Path Tool, this tab includes controls for filtering by device type, session length, and an intermediary URL, ensuring that specified pages appear within the sequence.
Caveats and interpreting the outputs
The GA4 Page Path Tool is designed to indicate the relative frequency of popular session journeys and sequences between specified pages rather than to provide exact counts of session occurrences. Due to the data processing required to build the tool, some sessions may be omitted from the results. This means that the outputs reflect common patterns and trends but are not a complete representation of all sessions.
Other caveats to bear in mind when using the tool are:
- Sessions with user_id values set to ‘false’ are removed from the data
- Sessions containing smart answer journeys are omitted
- To clean up the data, both tools look at consecutive distinct page_view events within a session, ignoring duplicate consecutive page views.
- Only sessions with 5 or fewer page views are included in the analysis for the Session Path tool. The Page Sequence tool looks at sessions of all lengths.
The code & how it has been built
Find the code below which each morning processes the data for the previous day, creating structured outputs for both the Session Path Tool and the Page Sequence Tool.
Session Path Tool Code
WITH
page_view_sessions AS (
SELECT
event_date,
unique_session_id,
cleaned_page_location,
event_timestamp,
event_name,
category as device
FROM
`ga4-analytics-352613.flattened_dataset.partitioned_flattened_events`
WHERE
event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND event_name = 'page_view'
AND unique_session_id NOT LIKE '%false%'),
page_view_counter as (
SELECT
unique_session_id,
COUNT(cleaned_page_location) AS page_views_per_session
FROM
page_view_sessions
GROUP BY
unique_session_id),
smart_answer_check as (
SELECT *,
CASE WHEN cleaned_page_location LIKE '%/y/%' THEN 'smart_answer_session' END AS smart_answer,
FROM
page_view_sessions
),
five_pages_only as (
SELECT unique_session_id from page_view_counter
WHERE page_views_per_session <= 5),
ordered as (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY unique_session_id ORDER BY event_timestamp) AS ordered_page_views
FROM page_view_sessions
WHERE unique_session_id IN (SELECT unique_session_id FROM five_pages_only)
AND NOT EXISTS (
SELECT 1
FROM smart_answer_check sac
WHERE sac.unique_session_id = page_view_sessions.unique_session_id
AND sac.smart_answer IS NOT NULL
)
ORDER BY event_date, unique_session_id, ordered_page_views),
no_duplicates as (
SELECT
event_date,
unique_session_id,
device,
cleaned_page_location,
event_timestamp,
ROW_NUMBER() OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS row_num,
LAG(cleaned_page_location) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS prev_page
FROM ordered),
ordered_pages as (
SELECT
event_date,
unique_session_id,
device,
cleaned_page_location AS page_1,
LEAD(cleaned_page_location, 1) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS page_2,
LEAD(cleaned_page_location, 2) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS page_3,
LEAD(cleaned_page_location, 3) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS page_4,
LEAD(cleaned_page_location, 4) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS page_5,
row_num
FROM no_duplicates
WHERE cleaned_page_location != prev_page OR prev_page IS NULL
ORDER BY unique_session_id, event_timestamp),
pre_pruning as (
SELECT * FROM
ordered_pages
WHERE row_num = 1
),
page_pruning as (
SELECT DISTINCT * FROM pre_pruning
WHERE page_2 IS NOT NULL
),
penultimate_cte as (
SELECT *,
CASE
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL AND page_3 IS NOT NULL AND page_4 IS NOT NULL AND page_5 IS NOT NULL THEN 5
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL AND page_3 IS NOT NULL AND page_4 IS NOT NULL THEN 4
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL AND page_3 IS NOT NULL THEN 3
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL THEN 2
WHEN page_1 IS NOT NULL THEN 1
ELSE 0
END AS how_many_pages_in_journey,
CASE
WHEN page_5 IS NOT NULL THEN page_5
WHEN page_4 IS NOT NULL THEN page_4
WHEN page_3 IS NOT NULL THEN page_3
WHEN page_2 IS NOT NULL THEN page_2
WHEN page_1 IS NOT NULL THEN page_1
ELSE NULL
END AS exit_page
from page_pruning)
SELECT DISTINCT * from penultimate_cte
Page Sequence Tool Code
WITH
page_view_sessions AS (
SELECT
category as device,
event_date,
unique_session_id,
cleaned_page_location,
event_timestamp,
event_name
FROM
`ga4-analytics-352613.flattened_dataset.partitioned_flattened_events`
WHERE
event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND event_name = 'page_view'
AND unique_session_id NOT LIKE '%false%'
),
smart_answer_check as (
SELECT *,
CASE WHEN cleaned_page_location LIKE '%/y/%' THEN 'smart_answer_session' END AS smart_answer,
FROM
page_view_sessions
),
ordered as (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY unique_session_id ORDER BY event_timestamp) AS ordered_page_views
FROM page_view_sessions
WHERE NOT EXISTS (
SELECT 1
FROM smart_answer_check sac
WHERE sac.unique_session_id = page_view_sessions.unique_session_id
AND sac.smart_answer IS NOT NULL
)
ORDER BY event_date, unique_session_id, ordered_page_views
),
no_duplicates as (
SELECT
event_date,
unique_session_id,
device,
cleaned_page_location,
event_timestamp,
ROW_NUMBER() OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS row_num,
LAG(cleaned_page_location) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS prev_page
FROM ordered),
Ordered_pages as (
SELECT event_date, device, unique_session_id, cleaned_page_location as page_1,
LEAD (cleaned_page_location,1) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp asc) as page_2,
LEAD (cleaned_page_location,2) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp asc) as page_3,
LEAD (cleaned_page_location,3) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp asc) as page_4,
LEAD (cleaned_page_location,4) OVER (PARTITION BY unique_session_id ORDER BY event_timestamp asc) as page_5,
ROW_NUMBER() OVER (PARTITION BY unique_session_id ORDER BY event_timestamp ASC) AS row_num
FROM no_duplicates
WHERE cleaned_page_location != prev_page OR prev_page IS NULL
ORDER BY unique_session_id, event_timestamp),
two_pages AS (
SELECT
event_date,
unique_session_id,
device,
CAST(page_1 AS STRING) AS page_1,
CAST(page_2 AS STRING) AS page_2,
CAST(NULL AS STRING) AS page_3,
CAST(NULL AS STRING) AS page_4,
CAST(NULL AS STRING) AS page_5,
FROM ordered_pages
),
three_pages AS (
SELECT
event_date,
unique_session_id,
device,
CAST(page_1 AS STRING) AS page_1,
CAST(page_2 AS STRING) AS page_2,
CAST(page_3 AS STRING) AS page_3,
CAST(NULL AS STRING) AS page_4,
CAST(NULL AS STRING) AS page_5,
FROM ordered_pages
),
four_pages AS (
SELECT
event_date,
unique_session_id,
device,
CAST(page_1 AS STRING) AS page_1,
CAST(page_2 AS STRING) AS page_2,
CAST(page_3 AS STRING) AS page_3,
CAST(page_4 AS STRING) AS page_4,
CAST(NULL AS STRING) AS page_5,
FROM ordered_pages
),
five_pages AS (
SELECT
event_date,
unique_session_id,
device,
CAST(page_1 AS STRING) AS page_1,
CAST(page_2 AS STRING) AS page_2,
CAST(page_3 AS STRING) AS page_3,
CAST(page_4 AS STRING) AS page_4,
CAST(page_5 AS STRING) AS page_5,
FROM ordered_pages
),
-- Combine all paths into a single table
combined_cte as (
SELECT * FROM two_pages
UNION ALL
SELECT * FROM three_pages
UNION ALL
SELECT * FROM four_pages
UNION ALL
SELECT * FROM five_pages),
page_pruning as (
SELECT DISTINCT * FROM combined_cte
WHERE page_1 IS NOT NULL
),
page_pruning_2 as (
SELECT DISTINCT * FROM page_pruning
WHERE page_2 IS NOT NULL
),
penultimate_cte as (
SELECT *,
CASE
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL AND page_3 IS NOT NULL AND page_4 IS NOT NULL AND page_5 IS NOT NULL THEN 5
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL AND page_3 IS NOT NULL AND page_4 IS NOT NULL THEN 4
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL AND page_3 IS NOT NULL THEN 3
WHEN page_1 IS NOT NULL AND page_2 IS NOT NULL THEN 2
WHEN page_1 IS NOT NULL THEN 1
ELSE 0
END AS how_many_pages_in_journey,
CASE
WHEN page_5 IS NOT NULL THEN page_5
WHEN page_4 IS NOT NULL THEN page_4
WHEN page_3 IS NOT NULL THEN page_3
WHEN page_2 IS NOT NULL THEN page_2
WHEN page_1 IS NOT NULL THEN page_1
ELSE NULL
END AS exit_page
from page_pruning_2)
SELECT DISTINCT * from penultimate_cte
To summarise, the following steps outline the approach taken for each tab:
Session Path Tool data process
- Retrieves page views by session, ordered by event timestamp
- Arranges these page views into sequential columns, ensuring that they follow the correct chronological order
- Returns only full session journeys, where each row represents one complete session. This enables the tool to count how many unique sessions followed identical journeys starting from a shared page and ending at a shared page
Page Sequence Tool data process
- Retrieves page views by session, ordered by event timestamp
- Arranges the page views in sequential columns, with each “shift” in the session represented in a new row
- Breaks each journey into smaller, distinct sequences. Each session is divided into chunks, with each chunk given a sequence number; for example, a session with three page views would contain one 3-page sequence and two 2-page sequences. By identifying the endpoint of each sequence, we can count the number of unique sessions that followed specific sequences between two pages