Spider diagram tool
The spider diagram tool is a visualisation tool that shows which pages users:
- come from before visiting a page of interest
- go to after visiting a page of interest
This data is over a date range and is broken down by:
- device category
- internal/external links
- individual entry/exit pages paired with the count and proportion of page views
This tool was developed within Data Services for use by analysts within GDS.
Using the spider diagram tool
Get access to the spider diagram tool
The spider diagram tool can be found in the ‘Path tools’ folder in the ‘Performance and Data Analysts Community’ shared Google Drive.
Running the spider diagram tool
To use the spider diagram tool, you must do the following:
- Make a copy of the spider diagram tool notebook.
- Open the tool notebook in Google Colab.
- Run the tool notebook.
- View the outputs and save a local copy of the data.
To run the notebook, you must do the following:
- Authenticate your access.
- Set the input variables.
- Run the query cells.
- Run the Execute queries cell.
- Create and download the visualisation.
To authenticate your access, you should:
- Run the cells in order. When running cell 1 -
auth.authenticate_user()
- you will see a pop-up asking you to authenticate - Follow the on screen prompts, selecting your account and Allow when prompted
- The cell will show as successfully run - with a small green tick - when you have successfully authenticated
If you receive a warning message saying “The notebook was not authored by Google”, select Run Anyway.
Set the input variables
You must complete and run the cells in the Input Variables section.
- Complete the Set Input Variables Form by entering the following fields:
start_date
andend_date
- the date range you want to look at, inYYYY/MM/DD
formatpage_path
- the URL for the page of interest which always starts with/
, for example/brexit
path_or_title
- whether the visualisation will show the page paths or page titles, for example/find-a-job
or Find a jobremove_query_string
- check this box if you want to remove the query string from the URL, for example if you’ve input an answer into a smart answer field- device categories - you must select at least one device type from Desktop, Mobile or Tablet, otherwise the code will not run and will raise a
ValueError
You do not need to change the following fields:
project_id
- this is alwaysgds-bq-reporting
ga_dataset
- this is alwaysanalytics_330577055
- the GOV.UK GA4 Production dataset
- Select Runtime and then Run after to run all the cells in the Input Variables section.
Run the query cells
Run the following cells in the following order.
- Query – Previous Page Path.
- Query – Acquisition Source.
- Query – Next Page Path.
Run the Execute queries cell
The Execute queries cell estimates and shows you the amount of data read by the query in gigabytes.
If you’re happy to run the query, enter “yes” into the user input box.
If you leave the input box blank or type in something other than “yes”, the query will not run.
Create and download the visualisation
Once the Input Variables query and Execute queries cells have finished running, the notebook generates the interactive Plotly figure visualisation.
To download the plot as a .png, hover your cursor over the figure and select the camera icon located in the top right-hand menu.
Save a local copy of the data
Once the Input Variables query and Execute queries cells have finished running, the notebook downloads the following files to your local machine:
- a CSV file of the entry and exit data, including the number and proportion of page views
- an HTML file of the Plotly figure
- a text file of the metadata for the executed SQL queries
If none or only some of the files are downloaded, check the end of the URL search bar. If you see a download icon with a red cross, select the icon and change the option to Always allow…, and then select Done.
How the spider diagram tool works
Data sources
The spider diagram tool uses the GOV.UK GA4 data as stored in BigQuery.
The full query is detailed in the tool notebook, and the original SQL can be found in the Original SQL query cell at the bottom of the notebook.
Assumptions and caveats
This log contains a list of assumptions and caveats used in the Forward Path tool analysis.
Assumptions are Red-Amber-Green (RAG)-rated according to the following definitions for quality and impact.
RAG rating | Assumption quality | Assumption impact |
---|---|---|
Green | Reliable assumption, well understood and/or documented. Anything up to a validated and recent set of actual data. | Marginal assumptions that their changes have no or limited impact on the outputs. |
Amber | Some evidence to support the assumption. May vary from a source with poor methodology to a good source that’s a few years old. | Assumptions with a relevant, even if not critical, impact on the outputs. |
Red | Little evidence to support the assumption. May vary from an opinion to a limited data source with poor methodology | Core assumptions of the analysis is that the output would be drastically affected by their change. |
Thank you to the Home Office Analytical Quality Assurance team for these definitions.
Group all pages that have less than 2.5% of sessions to (other)
- Quality: Green (for visualisation purposes)
- Impact: Green (for visualisation purposes)
Pages that contain less than 2.5% of overall sessions are difficult to see in the diagram both in terms of their volume and any associated labels.
To mitigate this, all pages with less than 2.5% of overall sessions are aggregated into a separate category, (other)
. This happens immediately after the tool has executed the SQL queries, but before the tool has rendered the visualisations and outputs.
As this is strictly for visualisation purposes, it is acceptable. However, any downstream analysis of the outputs is obviously limited due to this aggregation. As such, outputs of this tool should not be used for further quantitative analysis without a thorough understanding of this assumption’s implications.
Notebook tool excludes URL query parameters and anchors from page paths
- Quality: Green
- Impact: Green
The notebook tool has removed all URL query parameters and anchors from page paths so that page views are associated with the general page path URL, rather than specific query parameters and anchors. The tool removes the URL parameters and anchors during SQL execution.
The tool removes the URL parameters and anchors as the overall aim of the tool is to provide an understanding of which page paths have been viewed, regardless of query parameters and anchors. This is in line with Google Analytics, which also excludes query parameters and anchors from the page path.