GOV.UK User Feedback processing
GOV.UK provides multiple channels through which users can provide feedback on their experience or report any issues with the platform. Those channels include the ‘Report a Problem’, ‘Contact’, and ‘Is this page useful?’ forms on GOV.UK pages.
The objective of the GOV.UK Feedback Processing project is to facilitate the analysis of user feedback data by:
- Consolidating feedback data from multiple sources into a single source/format
- Sanitising feedback data by removing or masking potentially offensive, private, or sensitive text
The project meets these objectives by:
- Delivering technical components to enable the sourcing, consolidation, and sanitisation of user feedback data
- Defining ‘workflows’ which coordinate the execution of multiple components
- Implementing the mechanisms used to automate the execution of workflows (daily)
All resources defined and implemented for the purposes of feedback processing are deployed into Google Cloud.
Technical definitions for components, workflows, and other required cloud computing resources is contained within the govuk-feedback-processing GitHub repository.
The two sources of GOV.UK user feedback data are:
- SmartSurvey: a vendor tool used to capture user feedback submitted via the ‘Is this page useful?’ feedback route
- Support API: a GOV.UK internal API which routes feedback responses submitted via ‘Report a Problem’, ‘Contact’, and other routes to a Postgres database instance
Components
Note that component descriptions contain hyperlinks to component Terraform definitions in the
govuk-feedback-processing
repository.
BigQuery Table
The BigQuery table to which processed data is saved.
A single user feedback submission may contain multiple user ‘responses’ where a user has been issued more than one ‘prompt’ for feedback or comment.
The BigQuery table uses field nesting to capture all prompts and their associated responses against a single record for a given user submission.
The table is partitioned by feedback record creation date.
Queries against the table must contain a filter on the field containing the record creation date (created
).
Networking Components
It is necessary to create ‘Virtual Private Cloud’ (VPC) in order to enable secure communication between components (such as between a Cloud Function/API and a Cloud SQL instance).
The following network components have been created:
- A compute subnetwork - definition
- A VPC network - definition
- A VPC access connector - definition
- A compute global address - definition
- A private VPC connection - definition
SQL Database Instance
A Postgres instance is used as the basis for enabling access to feedback data contained with the Support API Datastore.
The Support API is a GOV.UK-internal technical component which routes feedback data submitted by GOV.UK users. For the majority of feedback types, the Support API persists feedback in a Postgres database instance hosted in AWS. That database instance is not acccessible for the purposes of data analysis.
Backup files for that database instance are taken daily and stored in S3 cloud object storage. Access to this object storage has been granted to enable the contents to be mirrored in equivalent cloud object storage in Google Cloud. The GOVUK S3 Mirror project assumes responsibility for copying backup files between the AWS and Google Cloud environments.
The files use a proprietary, non-human-readable format which necessitates that they are restored into Postgres database instance before the data they contain can be queried and read.
A Postgres14 database instance (support-api
) has been created in Google Cloud to enable the restoration of Support API
Datastore backup files and access to the data they contain.
The instance has been configured to be accessible via a Google Cloud-internal private network only. The instance cannot by default accept incoming connections from external networks (for the purposes of restoring backup files or handling SQL queries).
The following configuration options have been set for the Posgres database instance:
Option | Value | Description |
---|---|---|
Disk size | 30GB | Ensures there is enough usable storage to enable restoration of Support API backup files (~10GB) |
max_wal_size |
3008 | Increased from the default value to improve performance of pg_restore . More information available in the Postgres documentation
|
maintenance_work_mem |
2048 | As above. More information available in the Postgres documentation |
Publisher/Subscriber Topic
A publisher/subscriber (pub/sub) topic (support-api-backup-staging
) is used to enable notifications relating to the
creation of new Support API database backup files to be passed between the Google Cloud Storage bucket where they are
kept (govuk-s3-mirror_govuk-integration-database-backups
) and the component used to handle them
(propogate-support-api
).
Support API database backup files are copied from GOV.UK AWS to a Google Cloud Storage bucket
(govuk-s3-mirror_govuk-integration-database-backups
) by the GOV.UK S3 Mirror. This process happens daily.
The Cloud Storage bucket is configured
to publish a notification into the support-api-backup-staging
queue when the creation of a new file is completed in
the bucket.
These notifications are used to trigger the support-api-workflow
workflow.
Workflow Triggers
Cloud Scheduler Trigger
The smart-survey-trigger
Cloud Scheduler job is used to automatically invoke the
Smart Survery Workflow on a daily basis at 07:00AM GMT. The job uses the workflow’s HTTP
Uniform Resource Identifier (URI) to trigger an execution of the workflow.
EventArc Trigger
The support-api-workflow-trigger
EventArc trigger is used to automatically invoke the
Support API Workflow when a new message has been published into the
support-api-backup-staging
pub/sub topic.
Filter criteria in the trigger ensure that the workflow is only invoked on publication of messages relating to the creation of new files in relevant upstream buckets (S3 Mirror). The trigger refers to the workflow’s internal identifier to trigger an execution of the workflow.
Workflows
Google Cloud Workflows are used to sequence the execution of individual cloud functions as part of a longer ‘pipeline’ process.
Smart Survey Workflow
The smart-survey-workflow
is used to:
- Read ‘raw’ Smart Survey data from the Smart Survey API into the staging Cloud Storage bucket
- Process the ‘raw’ Smart Survey data from the bucket
- Save the output records into Google BigQuery
The workflow calls the read-raw-feedback
and process-raw-feedback
cloud functions in sequence. The process is run daily at 07:00AM, and is triggered by Cloud Scheduler.
The workflow creates an internal type
variable and sets the value of the variable to “SmartSurvey”. This variable is
passed as input to each called cloud function which requires type
as an input.
Support API Workflow
The support-api-workflow
is used to:
- Fetch the latest Support API Datastore backup file from the S3 Mirror Cloud Storage bucket
- Restore the Support API Datastore backup file to a Google Cloud-hosted Postgres instance
- Read the
raw
Support API data from the Google Cloud-hosted Postres instance into the staging Cloud Storage bucket - Process the
raw
Support API data from the staging bucket - Save the output records into Google BigQuery.
The workflow calls the propogate-support-api
, read-raw-feedback
and process-raw-feedback
cloud functions in sequence. The process is run each time a new
Support API Datastore backup file is created in the S3 Mirror Cloud Storage bucket; typically, this occurs daily at
06:00AM GMT.
The workflow creates an internal type
variable and sets the value of the variable to “SupportAPI”. This variable is
passed as input to each called cloud function which requires type
as an input.
Cloud Functions
Google Cloud Functions are used to execute the logic involved the restoration of Support API database backups, reading raw feedback data from upstream sources, and processing feedback data.
Propogate Support API Data
The propogate-support-api
function is used to:
- Read a Support API Database back-up file from Cloud Storage into the local compute environment
- Use the Postgres
pg_restore
utility to:- Purge the Google Cloud Postgres instance of data
- Restore the contents of the Support API Database back-up file into the Google Cloud Postgres instance
The propogate-support-api
function responds to messages published into the support-api-backup-staging
pub/sub topic,
and accepts a cloud event object as an input. Messages are
published into that topic whenever new Support API datastore backup files are copied into the S3 Mirror bucket as per
the pub/sub topic documentation.
Messages published into the support-api-backup-staging
topic will contain bucketId
and objectId
attributes which
uniquely identify a Support API Database backup file. The function uses these attributes to read the file from
Cloud Storage into the local compute environment.
The function then uses the pg_restore
binary executable utility to restore the contents of the Support API Database
backup file to the Google Cloud-hosted Postgres instance. The connection between the function and the Postgres instance
is facilitated by a Virtual Private Cloud (VPC) connector.
The function depends on the CLOUD_SQL_QUERY_USER
and CLOUD_SQL_QUERY_PASS
secrets having been set in the project
environment. These secrets are the credentials used by the function to authenticate with the Google Cloud-hosted
Postgres instance. Failure to access/determine these secret values will result in the function returning an HTTP 500
(internal server) error.
- Required input: cloud event object containing valid
bucketId
andobjectId
attributes - Output: HTTP response (200 for success)
- Max timeout: 60 minutes
- Resource definition
Read Raw Feedback
The read-raw-feedback
function is used to:
- Connect to an upstream ‘system of record’ containing unmodified feedback data
- Fetch all feedback records which were created between a specified start date and end date
- Stage the unmodified, ‘raw’ feedback data in JSON format in a Cloud Storage bucket
Target systems of record include:
- SmartSurvey - accessed via an API
- Google Cloud-hosted Postgres Instance - DB instance containing Support API data accessed via VPC connection
The function depends the following environment variables/secrets:
Name | Type | Description |
---|---|---|
SMART_SURVEY_API_ENDPOINT |
Environment variable | The target Smart Survey API endpoint to query for feedback data |
SMART_SURVEY_API_TOKEN |
Secret | Used to authenticate with Smart Survey |
SMART_SURVEY_API_SECRET |
Secret | User to authenticate with Smart Survey |
CLOUD_SQL_QUERY_USER |
Secret | Username for authentication with Postgres instance |
CLOUD_SQL_QUERY_PASS |
Secret | Password for authentication with Postgres instance |
Failure access/determine these environment variable/secret values will result in the function returning an HTTP 500 (internal server) error.
- Required input:
type
- string value indicating the type of feedback to be read. Must be one of “SmartSurvey” or “SupportAPI” - Optional input: accepts
start_date
andend_date
values in the format “YYYY-MM-DD HH:MM:SS”. If not specified, or if only one of the two is provided, the function defaults these values to the start and end of prior day - Output:
- HTTP response containing a response code (200 for success) and the name of the output file
- A JSON file in the project staging bucket named using the convention datasource-date; e.g.
supportapi-2022-10-01
.
- Max timeout: 9 minutes
- Resource definition
Process Raw Feedback
The process-raw-feedback
function is used to:
- Read a specified ‘raw’ feedback JSON file from the staging Cloud Storage bucket
- Process the ‘raw’ feedback data:
- Re-format the data to adhere to the output feedback data schema
- Remove instances of profanity from feedback text
- Remove instances of personally identifiable information (PII)
- Output the data to BigQuery
Processes used to remove profanity and PII are described in the Text processing section.
The function requires that two inputs are provided:
type
: string value indicating the type of feedback being processed. Must be one of “SmartSurvey” or “SupportAPI”input_file
: string value containing the name of the input file (present in the staging cloud storage bucket) to be processed
If either of these inputs are not provided or are invalid, the function will return an HTTP 400 (invalid request) error.
- Required input:
type
andinput_file
string values - Output:
- HTTP response (200 for success)
- New records created in the target BigQuery table (
govuk-user-feedback
)
- Max timeout: 60 minutes
- Resource definition
Text Processing
The process-raw-feedback
Cloud function performs text processing on raw feedback text with
the purpose of masking instances of profanity or PII within the text.
Profanity Removal
The better-profanity
Python package is used in combination with a
pre-compiled list of censored words
to:
- Identify instances of profanity in user feedback text
- Replace identified instances of profanity with a mask value (‘****’).
PII Removal
The objective of PII removal is to remove from feedback text any articles of data - such as phone numbers, names, or National Insurance numbers - which could be used to uniquely identify an individual.
Two methods are used to achieve this:
- Google Cloud DLP: Feedback data is parsed using Google’s Data Loss Prevention (DLP) API. The DLP tools can be configured to scan feedback text for specific types of PII, and will replace identified instances of PII with an appropriate mask, e.g. [‘PERSON_NAME’]
- Text pattern matching: Feedback data is scanned to identify the presence of text combinations which match pre-defined (regular expression) patterns. These patterns may be used to identify instances of PII which meet standard patterns relating to their length and content (such as credit card numbers, driving license numbers etc).
Comprehensive lists of the types of PII which are addressed by each method can be found below.
Cloud DLP Info Types
- Date of birth
- Email address
- Passport
- Person name
- Phone number
- Street address
- UK National Insurance number
- UK passport number
- Credit card number
- IBAN code
- IP address
- Medical terms
- Vehicle identification number
- Scotland community health index number
- UK drivers license number
- UK NHS number
- UK taxpayer reference
- SWIFT code
Pattern Matching Types
- National Insurance number
- UK NHS number
- Credit card number
- UK drivers license number
- UK visa application reference number