Assumptions and caveats
This contains a list of assumptions and caveats.
Assumptions are red-amber-green (RAG) rated according to the following definitions for quality and impact.
RAG rating | Assumption quality | Assumption impact |
---|---|---|
Green | Reliable assumption, well understood and/or documented. Anything up to a validated and recent set of actual data. | Marginal assumptions that their changes have no or limited impact on the outputs. |
Amber | Some evidence to support the assumption. May vary from a source with poor methodology to a good source that’s a few years old. | Assumptions with a relevant, even if not critical, impact on the outputs. |
Red | Little evidence to support the assumption. May vary from an opinion to a limited data source with poor methodology | Core assumptions of the analysis is that the output would be extremely affected by their change. |
These definitions are from the Home Office Analytical Quality Assurance team.
Page paths with fragment
- Quality: GREEN
- Impact: GREEN
Page paths may contain a fragment, for example https://www.example.com/content#section
, which indicates that a user has jumped to a specific section
of the page.
In these cases, for all the IUNIS components, we ignore the fragment and only consider the root page path. In this example, the root page path is https://www.example.com/content
.
This means we do not make any assumptions about which parts of the page the user is actually consuming.
Pre-processing of hyperlinks
- Quality: GREEN
- Impact: AMBER
We have removed links to cross-domain services and external websites as these are not included in our user journeys data or content data.
We have kept the step-by-step ID as part of the URL.
We have kept the search parameters in search URLs.
However, we need to discuss and update these assumptions as part of the code review process as they may affect compatibility with other data sources and the algorithm’s performance.
Pages omitted from the topology matrix
The topology matrix only includes pages that exist in the GOV.UK mirror.
This excludes broken links, typos, pages on subdomains of GOV.UK, and smart answers.
If the final segment of a page URL is preview
, y
, results
, and questions
, this page is part of a smart answer, so the content of the page in the mirror does not resemble content shown to real users.
Fragments in the topology matrix
- Quality: AMBER
- Impact: RED
We have omitted fragments from URLs in the topology matrix of all GOV.UK pages, because it is difficult to include fragments.
The problem with using fragments is that, when using the GOV.UK mirror, each page is represented by a single HTML file. This HTML file is scanned for links elsewhere, so the topology matrix would only include links from a URL without a fragment to a URL with a fragment, but not links from URLs with fragments to anywhere.
This is particularly relevant to pages with the detailed_guide
document type, such as this page on travelling abroad during the pandemic.
Pages with the guide
document type are very similar, but the URLs are organised differently so that each section has its own URL. This means users cannot read content beyond the section they have loaded, unless they follow another link.
Self-loops in topology matrix
- Quality: GREEN
- Impact: AMBER
We have kept self-loops only for those pages that contain an explicit link to themselves. We ignore links to headings that use fragments (gov.uk/something#fragment
).
As a consequence, we are not accounting for the fact that any page can be reloaded. If we want to reflect this, the topology matrix should be modified to have 1’s along the whole diagonal.
If we do not want to account for self-loops as they are not informtive for user intent detection, then we need to add 0’s to whole diagonal and remove self-loops from user journey data.
TF-IDF in user-page access matrix
- Quality: GREEN
- Impact: GREEN
In the user-page access matrix, the TF-IDF is the product of: - the term frequency, which is the access frequency of the page by the given user - the inverse document frequency, which is the ratio of total users to the number of distinct users who have accessed the given page
If the user does not visit a page, the TF-IDF for that user-page is set to null
.
Scraped pages from GOV.UK
- Quality: GREEN
- Impact: GREEN
To backfill any missing pages, the page access TF-IDF matrix generation process webscrapes GOV.UK for any pages it cannot find on the GOV.UK mirror.
It’s believed these missing pages are due to limitations of the GOV.UK crawler. For example, the crawler cannot check boxes, click radio buttons, or complete forms, which results in missing pages.
As more time passes between when the mirror was created, and when the webscraping is done, there is a risk that the missing content could change and/or missing pages could be removed from GOV.UK.
It is difficult to webscrape on the same day the mirror is generated, as we do not know all the journeys that have taken place on Google BigQuery until the next day, so there will always be a delay.
To mitigate this, a copy of the mirror and the webscraped pages will be stored in an AWS S3 bucket for posterity and reproducibility.
Law of surfing (alpha step function)
- Quality: AMBER
- Impact: AMBER
Chi et al. suggest defining alpha as a step function according to the law of surfing, referring to Huberman, Bernardo & A., Bernardo & Pirolli, Peter & T., Peter & Pitkow, & E., James & Lukose, & M., Rajan. (1998). Strong Regularities in World Wide Web Surfing. Science. 280. 95-. 10.1126/science.280.5360.95. , which observes a distribution of lengths of sessions. In the WUFIS algorithm, the alpha step function derived from the law of surfing is used to estimate the probability that a session will continue to any of the available links. In the IUNIS algorithm, it is used to weight neighbouring pages of pages that have already been activated in previous iterations of the spreading algorithm.
We assume that session lengths are all drawn from the same distribution, which we estimate by calculating the empirical distribution of session lengths from all GOV.UK sessions that occurred on the same dates as the sessions whose intent is being calculated.
This assumption is partly justified by findings in the paper, that the distribution is regular, and is a good predictor of session lengths. However, we have not checked its accuracy for GOV.UK sessions. We have also noticed that the intent vector is sensitive to the definition of alpha (step function vs page access), so it might make a noticeable difference how the step function is derived.
In the implementation in src/make_data/queries/calc_num_pages_visited_per_session.sql, the URLs aren’t cleaned in the same way as they are for other components of the model. This is a minor caveat, given that this step function is already a kind of average of all GOV.UK sessions.
Alpha parameter - smart-answers pre-processing
- Quality: GREEN
- Impact: AMBER
Currently, we are keeping in ALL the smart-answer URLs as they are. Movements between subsequent smart answers are common (as they belong to a deterministic flow), resulting in a rather low alpha value. We need to re-assess this after evaluating the sensitivity of the algorithm as we may want to change the approach. For instance, one alternative option we discussed offline is to only keep the last smart-answer completed URL (and get rid of the preceding ones).
Alpha parameter - search/ URLs pre-processing
- Quality: GREEN
- Impact: AMBER
Currently, we are not cleaning or pre-processing the search/all/*
URLs but we’re keeping them in their original form, thus including parameters and parameter values (i.e., key words searched by users).
Alternative options to be considered in future iterations of the algorithm:
(a) remove all /search/*
pages
(b) clean /search/*
URLs and keep a generic /search
© consider /search
as a break in intents and so two separate sessions.
Implications in alpha:
(a) probably a higher/inflated alpha values because we are creating a transition between two pages (in reality mediated by search).
(b) we are going to have a lot more /search > /whatever-page transition pairs, because there is only one general /search rather than many specific /search/* URLs. So probably a deflated alpha for /search >>> /page as more sessions will contain it.
© it is not confirmed that users use search to change topic of search rather than refine their current information need.
Alpha parameter - alpha value at t(0)
- Quality: GREEN
- Impact: GREEN
Given a user journey as a zero-indexed array of visited pages, where t(0) is the first visited page, etc. and given the set of user sessions within a specified time window (e.g., one day of data):
- for any page P at t(n), with n >= 1, alpha is the proportion of user sessions that visited the same page at t(n-1) but did not continue to P at t(n), regardless of the value of n.
- for pages at t(0), alpha is currently defined separately as 1 (as for the orginal paper).
An alternative approach for 2. is to use the proportion of user sessions that did not start on that page could be used.
Alpha parameter - removing self-loops
- Quality: GREEN
- Impact: AMBER
Currently, when calculating the alpha value for the movement between two pages A >>> B
,
we are first removing “self loops”, i.e., movement between the same page A >>>> A
.
These self-loops could be: - users refreshing the page; - users clicking on a link/button that leads to the same page; - users leaving the page to go to an external site or a cross-domain service and than comiing back to the same page on GOVUK (we do not track these “external” movements).
Alpha parameter - specific to curated set of artificial user journeys
- Quality: GREEN
- Impact: GREEN
At the moment, this is an implemention for the curated set of artificial user journeys. As they are artificial, some movements between pages may not have been observed in the real data and so no alpha value has been produced for them. To deal with this, we are treating these cases as if the user was the only one doing that “rare” transition, and so the destination page is assigned an alpha value of 0 (i.e., alpha = 1 - (1/1) = 0).
All other unobserved session-movements combinations will have alpha as NULL (rather than 0).
Alpha parameter - when a page is visited several times in the same session
- Quality: RED
- Impact: RED
We need to figure out how to deal with pages that are visited more than once within the same session. At the moment, we are not addressing this issue as pages are not re-visited in the PoC curated set. The original algorithm does not provide an approach but seems to assume each page will only be visited once. User journeys are in fact represented as a vector of indexed pages: each page corresponds to a unique position in the vector and is assigned a value according to its position in the journey, with most recently-visited pages weighted most heavily. This assumes that a page can only occur once, which is problematic for us.
GOVUK mirror - pre-processing for page-term TF-IDF
Quality: Green Impact: Amber
In terms of HTML files (i.e., page paths), we have: - removed the fragment part from URLs and only keep the root URL - excluded (quick) smart-answer intermediate steps - excluded finder/checkers intermediate steps - excluded local_transaction URLs - excluded internal search-api pages (both user-defiined and pre-packaged) - excluded redirects to non-HTML files (pdf/png docs) NOTE: this also removed foreign-language files - excluded previews
NOTE 1: These cleaning steps have been applied when processing the mirror to create the Topology matrix. NOTE 2: These cleaning steps will need to be applied to user journey (BigQuery) data.
In terms of the text in HTML pages, we have stripped away irrelevant content. Irrelevant content is defined as: - header - footer - global cookie message - feedback request form - contextual navigation sidebar (incl. related links) - contextual navigation footer (e.g., explore this topic)
Things we haven’t done at pre-processing stage: remove content from /browse
and step-by-step-nav
hub pages. Rationale: it is easier to identify these pages in both mirror and user journey data via their URLs (for /browse pages) or custom dimension (dim-17 and dim-2 “step-by-step-nav”) rather than HTML elements in the mirror. Thus, we can exclude these pages before the last algorithmic step (dot product between page-term TF-IDF based on mirror and vectors of user journey page scores) - but to be discussed.
Page-term TF-IDF - choice of n-grams and stopwords
Quality: Green Impact: Green
We have removed corpus-specific stowords by excluding all those terms that appear in more than 2% of pages (i.e., ~16,000 pages).
We have also removed terms that only appear in one page.
We then produced a (1-3)gram TF-IDF.
Which sessions to include in a page-level topic model and session-level TF-IDF
Quality: Red Impact: Red
For the proof of concept, we have only included sessions that end on a given page. Sessions that visit a given page and then continue, are excluded. This will have an effect on what the topics represent, because the sessions are ones that have either found what they sought and stopped, or given up searching.
They are not sessions that have either:
-found what they sought, and continued browsing for a new purpose - given up searching, and continued browsing for a new purpose
We exclude these pages in anticipation of a particular use; comparing the intents of groups of users that end their journey on a given page. All users would have the given page in common, so it shouldn’t be part of their intent.
For other uses, such user segmentation for accounts interventions, we might want to include every page of every session.
Inferring intents for the sessions ENDING on a given page
Quality: Amber Impact: Red
Given a page on GOV.UK, intents are computed for the sessions that end on that page.
This is different from the reversed journey tool, where journey to a page are presented regardless of the position of that page in the journey.
User sessions to compute intent embeddings and to cluster
Quality: Green Impact: Amber
Given a destination page, embeddings and clusters are produced only for all those sessions whose
journey path to the destination page is not shared by at least 10% of other sessions.
These are defined as sessions showing unpopular journeys to a page or sessions with complex needs.
To change the 10% threshold, please modify src/make_features/queries/extract_sessions_with_complex_needs.sql
.