Use the GOV.UK feedback spam classifier
The GOV.UK feedback spam classifier is a binary classifier machine learning model that classifies feedback data as spam or not spam.
To develop this model, we refined a Random Forest Classifier Model that uses the sklearn.ensemble
API.
Use the following documentation to try the feedback spam classifier on your local machine.
Set up your local machine
Before you can try the feedback spam classifier on your local machine, you must set up your local machine.
Set up a Python virtual environment on your local machine.
You can use
venv
, which is built into Python, or an external tool such aspyenv
.Specify a version of Python to use in the virtual environment. You must use Python 3.6.1 or later.
How you do this depends on the virtual environment tool you are using.
Make sure your Python virtual environment has
pip
installed. If your virtual environment does not havepip
installed, then installpip
manually.Start your virtual environment and use
pip
to install all the Python packages that the spam classifier needs to run:pip install -r requirements.txt
Clone the feedback spam classifier repo to your local machine.
Load environment variables
You must load environment variables as part of setting up your local machine.
Install direnv
You should use direnv
to load environment variables, as this program makes sure you only have project-specific variables loaded when you are inside the project. This prevents accidental conflicts with identically named environment variables.
Run the following in the command line to install
direnv
using Homebrew:brew install direnv
Add the shell hooks to your bash profile:
echo 'eval "$(direnv hook bash)"' >> ~/.bash_profile
Check you have added the shell hooks correctly:
cat ~/.bash_profile
If you have added the shell hooks correctly, you should see
eval "$(direnv hook bash)"
as output.Restart your command line interface to finish installing
direnv
.
Load the secrets environment variable
You must store secrets and credentials in a .secrets
file. This file is not version-controlled, so do not commit secrets to GitHub.
Go to the root of the
govuk-feedback-spam-classifier
directory on your local machine and create a.secrets
file:touch .secrets
Open this
.secrets
file using your preferred text editor, and add any secrets as environmental variables.For example, to add a JSON credentials file for Google BigQuery, add the following to the
.secrets
file:export GOOGLE_APPLICATION_CREDENTIALS="<SECRETS-FILE-ABSOLUTE-FILEPATH-AND-FILENAME>"
Make sure the
.secrets
file has the following commented out:source_env ".secrets"
This will make sure
direnv
uses.envrc
to load the.secrets
file without changing the version of the.secrets
file.
Allow direnv
in your virtual environment
Every time you change .envrc
or .secrets
, you must allow direnv
in your virtual environment
Go to the
govuk-feedback-spam-classifier
folder in your virtual environment. You will see the following message:direnv: error .envrc is blocked. Run `direnv allow` to approve its content.
Run
direnv allow
to allowdirenv
in your virtual environment.
You have now set up your local machine to try the feedback spam classifier.
Run the feedback spam classifier Python scripts
You run 4 Python scripts in your virtual environment to try the feedback spam classifier. These scripts use the default feedback data set.
If you want to try the feedback spam classifier with a different data set, you must:
- specify where this different data set is in the params.yaml file in the feedback spam classifier repo
- make sure this different data set is similar in format and structure to the default data set
- make any necessary changes to the Python scripts to reflect the different data set’s format and structure
Run script 001
Run the 001_feature_engineering.py
Python script in your virtual environment:
python src/001_feature_engineering.py --config params.yaml
This script:
- turns the raw data set into a Pandas dataframe and identifies relevant columns in the data set
- searches the text in the Pandas dataframe for features that may indicate the feedback data is spam (for example, if the feedback text has SQL)
Run script 002
Run the 002_train.py
Python script in your virtual environment:
python src/002_train.py --config params.yaml
This script:
- uses the specified data set to train the spam classifier model
- creates visualisations to help users assess model performance
- calculates the features that are most important to the model’s classification decisions
- sends the visualisations to the repo’s
outputs/visualisations
folder
The visualisations are - a confusion matrix - 3 representations of different calculations of feature importance.
Run script 003
Run the 003_predict.py
Python script in your virtual environment:
python src/003_predict.py --config params.yaml
This script:
- tests the model on 3 sentences specified in the
003_predict.py
script - attaches a label of
spam
ornot spam
and a confidence probability to each sentence of feedback text - outputs the labels and probabilities to the command line
Run script 004
Run the 004_predict_new_data.py
Python script in your virtual environment:
python src/004_predict_new_data.py --config params.yaml
This script:
- applies the model to the entire data set
- attaches a label of
spam
ornot spam
and a confidence probability to each sentence of feedback text - outputs the data set with the labels and probabilities to the repo
data/outputs
folder
You can now view the outputs of the feedback spam classifier model.