Create the page term TF-IDF matrix
In information retrieval, term frequency-inverse document frequency (TF-IDF) is a numerical statistic that shows how important a word is to a document.
The TF-IDF gives weight to words that are more important to documents, and less weight to common words such as “the” or “and”.
How you create the page term TF-IDF matrix differs depending on whether you are using AWS Sagemaker or not. You can either:
- set the environment variables on your local machine, or
- use AWS Sagemaker to open a copy of the GOV.UK mirror
Set the environment variables on your local machine
Use direnv
to set the DIR_MIRROR
environment variable in the .envrc
file on your local machine.
When setting the DIR_MIRROR
environment variable, use the folder path to your downloaded copy of the GOV.UK mirror.
You can now go to the Jupyter notebooks directory.
Use AWS Sagemaker to open a copy of the GOV.UK mirror
Open the
govuk-intent-detector-page-term-tf-idf
AWS instance.You must have the
govuk-datascienceusers
AWS IAM role to access this instance. Contact the GOV.UK Data Products team for more information.There should be a copy of the GOV.UK mirror in the instance.
If there is no copy of the mirror, or the copy is not the correct date, download a new copy of the GOV.UK mirror to the AWS S3 bucket.
In the AWS instance, select Open Jupyter or Open JupyterLab and run the
govuk-intent-detector/startup_script.ipynb
notebook.Once the notebook has finished running, select File > New… > Terminal.
Run the following in the terminal to copy the GOV.UK mirror copy from the AWS S3 bucket to this AWS instance:
cd /home/ec2-user/SageMaker/govuk-intent-detector aws s3 sync s3://govuk-data-infrastructure-integration/{YYYYMMDD}-govuk-production-mirror-replica ./govuk-production-mirror-replica
Replace {YYYYMMDD}
with the date of the GOV.UK mirror copy in the AWS S3 bucket.
You can now go to the Jupyter notebooks directory.
Go to the Jupyter notebooks directory
Go to the root of the
govuk-intent-detector
repo directory on your local machine.Run
jupyter notebook
in your command line to open Jupyter.In Jupyter, go to the
notebooks/page_term_tfidf_matrix
directory.
Produce a page-term TF-IDF matrix for the GOV.UK mirror
You can produce a page-term TF-IDF matrix for the GOV.UK mirror, or for the BigQuery data set.
To produce a page-term TF-IDF matrix for the GOV.UK mirror, you must run multiple Python notebooks in a certain order.
Change the DATA_DATE
in each notebook to the dates you want to look at before you run those notebooks.
Open each notebook in turn, and run all of the cells in that notebook.
Run the notebooks in the following order.
Name | Purpose | Output location |
---|---|---|
002 |
Gets the filepaths to every HTML file from the GOV.UK mirror. | The govuk-intent-detector/data/interim folder on your local machine. |
004b |
Parses all the valid HTML pages from notebook 002 to show only visible text.Removes irrelevant HTML such as the header and banner. |
The govuk-intent-detector/data/interim folder on your local machine. |
005 |
Changes the outputs from notebook 004b :- Sets to lowercase - Removes numbers and non-alpha symbols - Lemmatises output using WordNet parts of speech - Removes English stopwords |
The govuk-intent-detector/data/interim folder on your local machine. |
006 |
Generates the TF-IDF matrix | The govuk-intent-detector/data/processed folder on your local machine. |
Produce a page-term TF-IDF matrix for BigQuery
You can produce a page-term TF-IDF matrix for the GOV.UK mirror, or for the BigQuery data set.
You should produce a page-term TF-IDF matrix for the BigQuery data set to cover pages that are not normally in the GOV.UK mirror, such as search API pages.
To produce a page-term TF-IDF matrix for the BigQuery data set, you must run multiple Python notebooks in a certain order.
Change the DATA_DATE
in each notebook before you run it.
Open each notebook in turn, and run all of the cells in that notebook.
Run the notebooks in the following order.
Name | Purpose | Output location |
---|---|---|
001 |
Gets all visited GOV.UK pages from BigQuery for a specific day. | The govuk-intent-detector/data/interim folder on your local machine. |
002 |
Gets the filepaths to every HTML file from the GOV.UK mirror. | The govuk-intent-detector/data/interim folder on your local machine. |
003 |
Compares the outputs from 001 and 002 and web-scrapes GOV.UK for any missing pages. |
The govuk-intent-detector/data/interim folder on your local machine. |
004 |
Parses all the valid HTML pages from 001 to show only visible text. If a fragment (#) exists in the page path, the notebook only extracts the visible text at and after this fragment. |
The govuk-intent-detector/data/interim folder on your local machine. |
005 |
Changes the outputs from notebook 004b :- Sets to lowercase - Removes numbers and non-alpha symbols - Lemmatises output using WordNet parts of speech - Removes English stopwords |
The govuk-intent-detector/data/interim folder on your local machine. |
006 |
Generates the TF-IDF matrix | The govuk-intent-detector/data/processed folder on your local machine. |