Skip to main content

Use the GOV.UK feedback spam classifier

The GOV.UK feedback spam classifier is a binary classifier machine learning model that classifies feedback data as spam or not spam.

To develop this model, we refined a Random Forest Classifier Model that uses the sklearn.ensemble API.

Use the following documentation to try the feedback spam classifier on your local machine.

Set up your local machine

Before you can try the feedback spam classifier on your local machine, you must set up your local machine.

  1. Set up a Python virtual environment on your local machine.

    You can use venv, which is built into Python, or an external tool such as pyenv.

  2. Specify a version of Python to use in the virtual environment. You must use Python 3.6.1 or later.

    How you do this depends on the virtual environment tool you are using.

  3. Make sure your Python virtual environment has pip installed. If your virtual environment does not have pip installed, then install pip manually.

  4. Start your virtual environment and use pip to install all the Python packages that the spam classifier needs to run:

    pip install -r requirements.txt
    
  5. Clone the feedback spam classifier repo to your local machine.

Load environment variables

You must load environment variables as part of setting up your local machine.

Install direnv

You should use direnv to load environment variables, as this program makes sure you only have project-specific variables loaded when you are inside the project. This prevents accidental conflicts with identically named environment variables.

  1. Run the following in the command line to install direnv using Homebrew:

    brew install direnv
    
  2. Add the shell hooks to your bash profile:

    echo 'eval "$(direnv hook bash)"' >> ~/.bash_profile
    
  3. Check you have added the shell hooks correctly:

    cat ~/.bash_profile
    

    If you have added the shell hooks correctly, you should see eval "$(direnv hook bash)" as output.

  4. Restart your command line interface to finish installing direnv.

Load the secrets environment variable

You must store secrets and credentials in a .secrets file. This file is not version-controlled, so do not commit secrets to GitHub.

  1. Go to the root of the govuk-feedback-spam-classifier directory on your local machine and create a .secrets file:

    touch .secrets
    
  2. Open this .secrets file using your preferred text editor, and add any secrets as environmental variables.

    For example, to add a JSON credentials file for Google BigQuery, add the following to the .secrets file:

    export GOOGLE_APPLICATION_CREDENTIALS="<SECRETS-FILE-ABSOLUTE-FILEPATH-AND-FILENAME>"
    
  3. Make sure the .secrets file has the following commented out:

    source_env ".secrets"
    

    This will make sure direnv uses .envrc to load the .secrets file without changing the version of the .secrets file.

Allow direnv in your virtual environment

Every time you change .envrc or .secrets, you must allow direnv in your virtual environment

  1. Go to the govuk-feedback-spam-classifier folder in your virtual environment. You will see the following message:

    direnv: error .envrc is blocked. Run `direnv allow` to approve its content.
    
  2. Run direnv allow to allow direnv in your virtual environment.

You have now set up your local machine to try the feedback spam classifier.

Run the feedback spam classifier Python scripts

You run 4 Python scripts in your virtual environment to try the feedback spam classifier. These scripts use the default feedback data set.

If you want to try the feedback spam classifier with a different data set, you must:

  • specify where this different data set is in the params.yaml file in the feedback spam classifier repo
  • make sure this different data set is similar in format and structure to the default data set
  • make any necessary changes to the Python scripts to reflect the different data set’s format and structure

Run script 001

Run the 001_feature_engineering.py Python script in your virtual environment:

python src/001_feature_engineering.py --config params.yaml

This script:

  • turns the raw data set into a Pandas dataframe and identifies relevant columns in the data set
  • searches the text in the Pandas dataframe for features that may indicate the feedback data is spam (for example, if the feedback text has SQL)

Run script 002

Run the 002_train.py Python script in your virtual environment:

python src/002_train.py --config params.yaml

This script:

The visualisations are - a confusion matrix - 3 representations of different calculations of feature importance.

Run script 003

Run the 003_predict.py Python script in your virtual environment:

python src/003_predict.py --config params.yaml

This script:

  • tests the model on 3 sentences specified in the 003_predict.py script
  • attaches a label of spam or not spam and a confidence probability to each sentence of feedback text
  • outputs the labels and probabilities to the command line

Run script 004

Run the 004_predict_new_data.py Python script in your virtual environment:

python src/004_predict_new_data.py --config params.yaml

This script:

  • applies the model to the entire data set
  • attaches a label of spam or not spam and a confidence probability to each sentence of feedback text
  • outputs the data set with the labels and probabilities to the repo data/outputs folder

You can now view the outputs of the feedback spam classifier model.

This page was last reviewed on 26 July 2022. It needs to be reviewed again on 26 January 2023 .