Use the GOV.UK feedback spam classifier
The GOV.UK feedback spam classifier is a binary classifier machine learning model that classifies feedback data as spam or not spam.
To develop this model, we refined a Random Forest Classifier Model that uses the sklearn.ensemble API.
Use the following documentation to try the feedback spam classifier on your local machine.
Set up your local machine
Before you can try the feedback spam classifier on your local machine, you must set up your local machine.
Set up a Python virtual environment on your local machine.
You can use
venv, which is built into Python, or an external tool such aspyenv.Specify a version of Python to use in the virtual environment. You must use Python 3.6.1 or later.
How you do this depends on the virtual environment tool you are using.
Make sure your Python virtual environment has
pipinstalled. If your virtual environment does not havepipinstalled, then installpipmanually.Start your virtual environment and use
pipto install all the Python packages that the spam classifier needs to run:pip install -r requirements.txtClone the feedback spam classifier repo to your local machine.
Load environment variables
You must load environment variables as part of setting up your local machine.
Install direnv
You should use direnv to load environment variables, as this program makes sure you only have project-specific variables loaded when you are inside the project. This prevents accidental conflicts with identically named environment variables.
Run the following in the command line to install
direnvusing Homebrew:brew install direnvAdd the shell hooks to your bash profile:
echo 'eval "$(direnv hook bash)"' >> ~/.bash_profileCheck you have added the shell hooks correctly:
cat ~/.bash_profileIf you have added the shell hooks correctly, you should see
eval "$(direnv hook bash)"as output.Restart your command line interface to finish installing
direnv.
Load the secrets environment variable
You must store secrets and credentials in a .secrets file. This file is not version-controlled, so do not commit secrets to GitHub.
Go to the root of the
govuk-feedback-spam-classifierdirectory on your local machine and create a.secretsfile:touch .secretsOpen this
.secretsfile using your preferred text editor, and add any secrets as environmental variables.For example, to add a JSON credentials file for Google BigQuery, add the following to the
.secretsfile:export GOOGLE_APPLICATION_CREDENTIALS="<SECRETS-FILE-ABSOLUTE-FILEPATH-AND-FILENAME>"Make sure the
.secretsfile has the following commented out:source_env ".secrets"This will make sure
direnvuses.envrcto load the.secretsfile without changing the version of the.secretsfile.
Allow direnv in your virtual environment
Every time you change .envrc or .secrets, you must allow direnv in your virtual environment
Go to the
govuk-feedback-spam-classifierfolder in your virtual environment. You will see the following message:direnv: error .envrc is blocked. Run `direnv allow` to approve its content.Run
direnv allowto allowdirenvin your virtual environment.
You have now set up your local machine to try the feedback spam classifier.
Run the feedback spam classifier Python scripts
You run 4 Python scripts in your virtual environment to try the feedback spam classifier. These scripts use the default feedback data set.
If you want to try the feedback spam classifier with a different data set, you must:
- specify where this different data set is in the params.yaml file in the feedback spam classifier repo
- make sure this different data set is similar in format and structure to the default data set
- make any necessary changes to the Python scripts to reflect the different data set’s format and structure
Run script 001
Run the 001_feature_engineering.py Python script in your virtual environment:
python src/001_feature_engineering.py --config params.yaml
This script:
- turns the raw data set into a Pandas dataframe and identifies relevant columns in the data set
- searches the text in the Pandas dataframe for features that may indicate the feedback data is spam (for example, if the feedback text has SQL)
Run script 002
Run the 002_train.py Python script in your virtual environment:
python src/002_train.py --config params.yaml
This script:
- uses the specified data set to train the spam classifier model
- creates visualisations to help users assess model performance
- calculates the features that are most important to the model’s classification decisions
- sends the visualisations to the repo’s
outputs/visualisationsfolder
The visualisations are - a confusion matrix - 3 representations of different calculations of feature importance.
Run script 003
Run the 003_predict.py Python script in your virtual environment:
python src/003_predict.py --config params.yaml
This script:
- tests the model on 3 sentences specified in the
003_predict.pyscript - attaches a label of
spamornot spamand a confidence probability to each sentence of feedback text - outputs the labels and probabilities to the command line
Run script 004
Run the 004_predict_new_data.py Python script in your virtual environment:
python src/004_predict_new_data.py --config params.yaml
This script:
- applies the model to the entire data set
- attaches a label of
spamornot spamand a confidence probability to each sentence of feedback text - outputs the data set with the labels and probabilities to the repo
data/outputsfolder
You can now view the outputs of the feedback spam classifier model.