Create the page term TF-IDF matrix

In information retrieval, term frequency-inverse document frequency (TF-IDF) is a numerical statistic that shows how important a word is to a document.

The TF-IDF gives weight to words that are more important to documents, and less weight to common words such as “the” or “and”.

How you create the page term TF-IDF matrix differs depending on whether you are using AWS Sagemaker or not. You can either:

set the environment variables on your local machine, or
use AWS Sagemaker to open a copy of the GOV.UK mirror

Set the environment variables on your local machine

Use direnv to set the DIR_MIRROR environment variable in the .envrc file on your local machine.

When setting the DIR_MIRROR environment variable, use the folder path to your downloaded copy of the GOV.UK mirror.

You can now go to the Jupyter notebooks directory.

Use AWS Sagemaker to open a copy of the GOV.UK mirror

Open the govuk-intent-detector-page-term-tf-idf AWS instance.

You must have the govuk-datascienceusers AWS IAM role to access this instance. Contact the GOV.UK Data Products team for more information.
There should be a copy of the GOV.UK mirror in the instance.

If there is no copy of the mirror, or the copy is not the correct date, download a new copy of the GOV.UK mirror to the AWS S3 bucket.
In the AWS instance, select Open Jupyter or Open JupyterLab and run the govuk-intent-detector/startup_script.ipynbnotebook.
Once the notebook has finished running, select File > New… > Terminal.

Run the following in the terminal to copy the GOV.UK mirror copy from the AWS S3 bucket to this AWS instance:

cd /home/ec2-user/SageMaker/govuk-intent-detector
aws s3 sync s3://govuk-data-infrastructure-integration/{YYYYMMDD}-govuk-production-mirror-replica ./govuk-production-mirror-replica

Replace {YYYYMMDD} with the date of the GOV.UK mirror copy in the AWS S3 bucket.

You can now go to the Jupyter notebooks directory.

Go to the Jupyter notebooks directory

Go to the root of the govuk-intent-detector repo directory on your local machine.
Run jupyter notebook in your command line to open Jupyter.
In Jupyter, go to the notebooks/page_term_tfidf_matrix directory.

Produce a page-term TF-IDF matrix for the GOV.UK mirror

You can produce a page-term TF-IDF matrix for the GOV.UK mirror, or for the BigQuery data set.

To produce a page-term TF-IDF matrix for the GOV.UK mirror, you must run multiple Python notebooks in a certain order.

Change the DATA_DATE in each notebook to the dates you want to look at before you run those notebooks.

Open each notebook in turn, and run all of the cells in that notebook.

Run the notebooks in the following order.

Name	Purpose	Output location
`002`	Gets the filepaths to every HTML file from the GOV.UK mirror.	The `govuk-intent-detector/data/interim` folder on your local machine.
`004b`	Parses all the valid HTML pages from notebook `002` to show only visible text. Removes irrelevant HTML such as the header and banner.	The `govuk-intent-detector/data/interim` folder on your local machine.
`005`	Changes the outputs from notebook `004b`: - Sets to lowercase - Removes numbers and non-alpha symbols - Lemmatises output using WordNet parts of speech - Removes English stopwords	The `govuk-intent-detector/data/interim` folder on your local machine.
`006`	Generates the TF-IDF matrix	The `govuk-intent-detector/data/processed` folder on your local machine.

Produce a page-term TF-IDF matrix for BigQuery

You can produce a page-term TF-IDF matrix for the GOV.UK mirror, or for the BigQuery data set.

You should produce a page-term TF-IDF matrix for the BigQuery data set to cover pages that are not normally in the GOV.UK mirror, such as search API pages.

To produce a page-term TF-IDF matrix for the BigQuery data set, you must run multiple Python notebooks in a certain order.

Change the DATA_DATE in each notebook before you run it.

Open each notebook in turn, and run all of the cells in that notebook.

Run the notebooks in the following order.

Name	Purpose	Output location
`001`	Gets all visited GOV.UK pages from BigQuery for a specific day.	The `govuk-intent-detector/data/interim` folder on your local machine.
`002`	Gets the filepaths to every HTML file from the GOV.UK mirror.	The `govuk-intent-detector/data/interim` folder on your local machine.
`003`	Compares the outputs from `001` and `002` and web-scrapes GOV.UK for any missing pages.	The `govuk-intent-detector/data/interim` folder on your local machine.
`004`	Parses all the valid HTML pages from `001` to show only visible text. If a fragment (#) exists in the page path, the notebook only extracts the visible text at and after this fragment.	The `govuk-intent-detector/data/interim` folder on your local machine.
`005`	Changes the outputs from notebook `004b`: - Sets to lowercase - Removes numbers and non-alpha symbols - Lemmatises output using WordNet parts of speech - Removes English stopwords	The `govuk-intent-detector/data/interim` folder on your local machine.
`006`	Generates the TF-IDF matrix	The `govuk-intent-detector/data/processed` folder on your local machine.

This page was last reviewed on 25 March 2022. It needs to be reviewed again on 25 September 2022 .