Create the page access weight matrix
The page access term frequency-inverse document frequency (TF-IDF) weight matrix gives less weight to pages that are accessed by many users and are not relevant to the user’s information needs. For example, the GOV.UK homepage.
Definitions
The following definitions apply to this matrix:
useris eithersessionIdorfullVisitorIdpagesare the pages in the curated set of user journeys, otherwise known astarget urlsterm frequency(TF) is the access frequency of the page by the given userinverse document frequency(IDF) is the ratio of total users to the number of distinct users who have accessed the given page
Create the matrix
Go to the
intent-detectorrepo folder on your local machine.Run
python src/make_data/make_user_page_access_matrix.pyin your command line.Specify the following positional arguments in the script:
--which-user(optional, default issessionId) states whether the user in the user-page access matrix is thesessionIdor thefullVisitorId--start_date(required) is the start date to query as aYYYYMMDDstring, for example20210428--end_date(required) is the end date to query as aYYYYMMDDstring, for example20210428--tfidf-log(optional, default is1 : 'yes') states whether to use the log of the IDF in the TF-IDF calculation
For example, to create a visitor page access TF-IDF matrix for 28 April 2021 using the raw IDF, go to the root-level of the
intent-detectorrepo and run the following in your command line:python -m src.make_data.make_user_page_access_matrix --start-date "20210428" \ --end-date "20210428" --which-user "fullVisitorId" --tfidf-log 0
Outputs
When you create the page access TF-IDF matrix, the script outputs 2 .parquet files to the intent-detector repo on your local machine:
- the raw output table of the SQL query, saved in the
data/interim/folder as{start_date>}_{end_date}_{chosen_user}_page_access_raw_table.parquet - the user-target page access TF-IDF matrix, saved in the
data/interim/folder as{start_date>}_{end_date}_{chosen_user}_{tfidf_log}_page_access_matrix.parquet