Forward and reverse path tools
The forward and reverse path tools provide information on users’ journeys across GOV.UK, showing the pages a user visits before (reverse) or after (forward) visiting a page of interest on GOV.UK.
These tools were developed within Data Services for use by analysts within GDS, and can be found in the ‘Path tools’ folder in the ‘Performance and Data Analysts Community’ shared Google Drive.
The tools have 4 outputs:
- a CSV file with the count and proportion of user sessions visiting distinct, subsetted journeys
- a CSV file with the count of user sessions visiting page paths at each step, regardless of the other pages in the subsetted journey
- a Plotly visualisation, known as a Sankey diagram, summarising the top 10 user journeys and all other journeys
- a Google Sheet summarising the top 10 and all other journeys
Using the forward and reverse path tools
Get access to the forward and reverse path tools
These tools can be found in the ‘Path tools’ folder in the ‘Performance and Data Analysts Community’ shared Google Drive.
To use the forward or reverse page path tools, you should create your own copy of the page path tool notebook in your drive, and run that version.
Running the tools
Running the notebook and authenticating access
To run the notebook, you must do the following.
- Authenticate your access
- Set the query parameters
- Run the cells in order. When running cell 2 -
auth.authenticate_user()
- you will see a pop-up asking you to authenticate - Follow the on screen prompts, selecting your account and Allow when prompted
- The cell will show as successfully run - with a small green tick - when you have successfully authenticated
Setting the query parameters
You set the query parameters in the Set query parameters cell.
You must set the following mandatory query parameters:
- start and end dates for the analysis
- desired page path of interest
- the number of pages or events the user journeys will be subsetted by
- whether to include page and/or event hits
- use the first or last hit to the desired page in the session for the subsetted journey
- device categories to include
You can set the following optional query parameters on whether to:
- remove query strings from the page path of interest
- append event-associated page paths with an [E]
- flag journeys that include the entrance page
- flag journeys that include the exit page
- remove refreshes of the page of interest
- have search pages only show the search content type and search keywords
Once you are happy with the query parameters, select the cell and first select Runtime in the top menu, and then Run after.
The notebook will run each cell after the Set query parameters cell one at a time.
Viewing the outputs
There are 4 outputs from running the notebook:
- a raw CSV data file
- a CSV data file summarising the most popular pages at each step
- a Plotly visualisation of a Sankey diagram
- a Google Sheets table of the top 10 page path tool results
After you select Runtime and then Run after, the notebook should automatically scroll to the cell that starts with Initialise a Google BigQuery client, and define the query parameters.
Manually scroll to this cell if the notebook does not automatically scroll to it.
This cell first estimates the number of gigabytes read by the query and shows you the amount.
If you are happy to run the query, enter “yes” into the user input box.
If you leave the input box blank or enter something other than “yes”, the query will not run.
The raw data file, summary, and Sankey diagram
The query first generates the raw CSV data file and a CSV file of the most popular pages at each step, regardless of the subsetted journey.
The query then downloads those files into the _Downloads folder on your local machine.
If the query does not generate these CSV files, check the end of the URL search bar. If you see a download icon with a red cross, select the icon and change the option to Always allow…, and then select Done.
Finally, running the query also creates the Sankey diagram, which is a visualisation of the forward/reverse page path from the page of interest.
To download the Sankey diagram, select the camera icon in the top right of the diagram, labelled Download plot as a png. This will download a PNG file to your Downloads folder.
The top 10 path tool results
The Presenting results in Google sheets cells create a Google Sheet of the top 10 forward/reverse page path journeys.
Select and run the cell that starts Compile a message, and flag to the user for a response; if not “yes”, terminate execution.
You will see a box asking if you are happy to create the Google Sheet.
If you are happy to create the Google Sheet, enter “yes” into the user input box. If you leave the input box blank or enter something other than “yes”, the query will not run.
The query will create the Google Sheet and provide a link to this spreadsheet under the Create google sheet in Product and Technology Directorate/Data Services/Data Products/16 User Journey tools/Path tools: google sheet result tables cell.
Forward and reverse path tool data definitions
A subsetted journey is a part of a user’s journey rather than the entire journey.
A distinct journey is a unique journey that is not the same as any other journey that user has taken.
How the forward and reverse path tools work
Data sources
The forward and reverse path tools use the GOV.UK GA4 data as stored in BigQuery.
The full query is detailed in the tool notebooks, and the original SQL can be found in the Original SQL query cell at the bottom of each notebook.
Assumptions and caveats
This log contains a list of assumptions and caveats used in the forward page path tool analysis.
Assumptions are red-amber-green (RAG) rated according to the following definitions for quality and impact.
RAG rating | Assumption quality | Assumption impact |
---|---|---|
Green | Reliable assumption, well understood and/or documented. Anything up to a validated and recent set of actual data. | Marginal assumptions that their changes have no or limited impact on the outputs. |
Amber | Some evidence to support the assumption. May vary from a source with poor methodology to a good source that’s a few years old. | Assumptions with a relevant, even if not critical, impact on the outputs. |
Red | Little evidence to support the assumption. May vary from an opinion to a limited data source with poor methodology | Core assumptions of the analysis is that the output would be extremely affected by their change. |
These are Home Office Analytical Quality Assurance team definitions.
Tool only supports exact matches to DESIRED_PAGE
- Quality: Green
- Impact: Amber
The query parameter for DESIRED_PAGE
cannot evaluate the field using regular expression, and therefore the tool currently only supports exact matches. This is acceptable as users of these tools are used to providing exact matches for analyses.
Tool only uses the first and last visits to DESIRED_PAGE
- Quality: Green
- Impact: Amber
Depending on user input, the subsetted journey considers the first or last visit to the DESIRED_PAGE
as the goal location (that is, the first step).
Therefore, the tool ignores any other visits to the DESIRED_PAGE
. We decided to do this because the main aim of the tool is to understand which pages the user visits before the DESIRED_PAGE
. Users also requested the ability to subset the journeys by the first hit.
However, it is important that the user of the tool is aware of this assumption, as it will impact the subsetted journey output.
If REMOVE_DESIRED_PAGE_REFRESHES
is TRUE
, tool only uses the first visit in a series of sequential visits (page refreshes) to DESIRED_PAGE
to determine which is the last visit
- Quality: Green
- Impact: Green
Therefore, if REMOVE_DESIRED_PAGE_REFRESHES
is TRUE
, the tool will only use the first visit in a series of sequential visits to DESIRED_PAGE
of hit type PAGE
. Other earlier visits to DESIRED_PAGE
will remain, as will any earlier desired page refreshes.
Tool always includes journeys shorter than the number of desired stages (NUMBER_OF_STAGES
)
- Quality: Green
- Impact: Amber
While journeys shorter than the number of desired stages are always included, journeys longer than the number of desired stages are not accurately represented.
For example, if NUMBER_OF_STAGES = 2
, and the journey consists of 3 stages, then the tool will not provide the user with the full journey (that is, the 3rd page path).
Therefore, the tool will combine this journey with other journeys that consist of the same 2 page paths (NUMBER_OF_STAGES = 2
), even if further page paths differ.
Tool assumes GOV.UK search page paths have the format /search/{TYPE}?keywords={KEYWORDS}{...}
- Quality: Green
- Impact: Green
{TYPE}
is the GOV.UK search content type. {KEYWORDS}
are the search keywords, where each keyword is separated by +. {…} are any other parts of the search query that are not keyword-related (if they exist).
Tool assumes GOV.UK search page titles have the format {KEYWORDS} - {TYPE} - GOV.UK
- Quality: Green
- Impact: Green
{TYPE}
is the GOV.UK search content type, and {KEYWORDS}
are the search keywords.
If ENTRANCE_PAGE
is FALSE
, each journey contains both instances where the entrance page is included, and is not included
- Quality: Green
- Impact: Red
Therefore, if ENTRANCE_PAGE
is TRUE
, these 2 instances (where the entrance page is included against when the entrance page is not included) will be considered 2 separate journeys.
This is a better representation of the journey, as a TRUE
flag indicates that the journey had more page paths than NUMBER_OF_STAGES
.
The user must have a good understanding of what the ENTRANCE_FLAG
represents, as this could change the output a lot.
If EXIT_PAGE
is FALSE
, each journey contains both instances where the exit page is included, and is not included
- Quality: Green
- Impact: Red
If EXIT_PAGE
is TRUE
, these 2 instances, when the exit page is included compared to when the exit page is not included, will be considered 2 separate journeys.
This is a better representation of the journey, as a TRUE
flag indicates that the journey had more page paths than NUMBER_OF_STAGES
.
The user must understand what the EXIT_FLAG
represents, as this could change the output a lot.
If user selects DEVICE_ALL
in combination with either DEVICE_DESKTOP
, DEVICE_MOBILE
, and/or DEVICE_TABLET
, then the analysis will use DEVICE_ALL
and ignore all other arguments.
- Quality: Green
- Impact: Red
By default, the tool will implement the DEVICE_ALL
argument instead of DEVICE_DESKTOP
, DEVICE_MOBILE
, and DEVICE_TABLET
.
Therefore, if the user accidentally selects DEVICE_ALL
, the query will ignore the desired arguments DEVICE_DESKTOP
, DEVICE_MOBILE
, and/or DEVICE_TABLET
even if the user selects them as well. This will change the expected output.
The output CSV file flags which device category(ies) the tool used, which should mitigate errors related to interpreting the data.