Skip to main content

Create the page term TF-IDF matrix

In information retrieval, term frequency-inverse document frequency (TF-IDF) is a numerical statistic that shows how important a word is to a document.

The TF-IDF gives weight to words that are more important to documents, and less weight to common words such as “the” or “and”.

How you create the page term TF-IDF matrix differs depending on whether you are using AWS Sagemaker or not. You can either:

  • set the environment variables on your local machine, or
  • use AWS Sagemaker to open a copy of the GOV.UK mirror

Set the environment variables on your local machine

Use direnv to set the DIR_MIRROR environment variable in the .envrc file on your local machine.

When setting the DIR_MIRROR environment variable, use the folder path to your downloaded copy of the GOV.UK mirror.

You can now go to the Jupyter notebooks directory.

Use AWS Sagemaker to open a copy of the GOV.UK mirror

  1. Open the govuk-intent-detector-page-term-tf-idf AWS instance.

    You must have the govuk-datascienceusers AWS IAM role to access this instance. Contact the GOV.UK Data Products team for more information.

  2. There should be a copy of the GOV.UK mirror in the instance.

    If there is no copy of the mirror, or the copy is not the correct date, download a new copy of the GOV.UK mirror to the AWS S3 bucket.

  3. In the AWS instance, select Open Jupyter or Open JupyterLab and run the govuk-intent-detector/startup_script.ipynbnotebook.

  4. Once the notebook has finished running, select File > New… > Terminal.

  5. Run the following in the terminal to copy the GOV.UK mirror copy from the AWS S3 bucket to this AWS instance:

    cd /home/ec2-user/SageMaker/govuk-intent-detector
    aws s3 sync s3://govuk-data-infrastructure-integration/{YYYYMMDD}-govuk-production-mirror-replica ./govuk-production-mirror-replica
    

Replace {YYYYMMDD} with the date of the GOV.UK mirror copy in the AWS S3 bucket.

You can now go to the Jupyter notebooks directory.

Go to the Jupyter notebooks directory

  1. Go to the root of the govuk-intent-detector repo directory on your local machine.

  2. Run jupyter notebook in your command line to open Jupyter.

  3. In Jupyter, go to the notebooks/page_term_tfidf_matrix directory.

Produce a page-term TF-IDF matrix for the GOV.UK mirror

You can produce a page-term TF-IDF matrix for the GOV.UK mirror, or for the BigQuery data set.

To produce a page-term TF-IDF matrix for the GOV.UK mirror, you must run multiple Python notebooks in a certain order.

Change the DATA_DATE in each notebook to the dates you want to look at before you run those notebooks.

Open each notebook in turn, and run all of the cells in that notebook.

Run the notebooks in the following order.

Name Purpose Output location
002 Gets the filepaths to every HTML file from the GOV.UK mirror. The govuk-intent-detector/data/interim folder on your local machine.
004b Parses all the valid HTML pages from notebook 002 to show only visible text.
Removes irrelevant HTML such as the header and banner.
The govuk-intent-detector/data/interim folder on your local machine.
005 Changes the outputs from notebook 004b:
- Sets to lowercase
- Removes numbers and non-alpha symbols
- Lemmatises output using WordNet parts of speech
- Removes English stopwords
The govuk-intent-detector/data/interim folder on your local machine.
006 Generates the TF-IDF matrix The govuk-intent-detector/data/processed folder on your local machine.

Produce a page-term TF-IDF matrix for BigQuery

You can produce a page-term TF-IDF matrix for the GOV.UK mirror, or for the BigQuery data set.

You should produce a page-term TF-IDF matrix for the BigQuery data set to cover pages that are not normally in the GOV.UK mirror, such as search API pages.

To produce a page-term TF-IDF matrix for the BigQuery data set, you must run multiple Python notebooks in a certain order.

Change the DATA_DATE in each notebook before you run it.

Open each notebook in turn, and run all of the cells in that notebook.

Run the notebooks in the following order.

Name Purpose Output location
001 Gets all visited GOV.UK pages from BigQuery for a specific day. The govuk-intent-detector/data/interim folder on your local machine.
002 Gets the filepaths to every HTML file from the GOV.UK mirror. The govuk-intent-detector/data/interim folder on your local machine.
003 Compares the outputs from 001 and 002 and web-scrapes GOV.UK for any missing pages. The govuk-intent-detector/data/interim folder on your local machine.
004 Parses all the valid HTML pages from 001 to show only visible text. If a fragment (#) exists in the page path, the notebook only extracts the visible text at and after this fragment. The govuk-intent-detector/data/interim folder on your local machine.
005 Changes the outputs from notebook 004b:
- Sets to lowercase
- Removes numbers and non-alpha symbols
- Lemmatises output using WordNet parts of speech
- Removes English stopwords
The govuk-intent-detector/data/interim folder on your local machine.
006 Generates the TF-IDF matrix The govuk-intent-detector/data/processed folder on your local machine.
This page was last reviewed on 25 March 2022. It needs to be reviewed again on 25 September 2022 .