Skip to main content

GOV.UK User Feedback processing

GOV.UK provides multiple channels through which users can provide feedback on their experience or report any issues with the platform. Those channels include the ‘Report a Problem’, ‘Contact’, and ‘Is this page useful?’ forms on GOV.UK pages.

The objective of the GOV.UK Feedback Processing project is to facilitate the analysis of user feedback data by:

  1. Consolidating feedback data from multiple sources into a single source/format
  2. Sanitising feedback data by removing or masking potentially offensive, private, or sensitive text

The project meets these objectives by:

  1. Delivering technical components to enable the sourcing, consolidation, and sanitisation of user feedback data
  2. Defining ‘workflows’ which coordinate the execution of multiple components
  3. Implementing the mechanisms used to automate the execution of workflows (daily)

All resources defined and implemented for the purposes of feedback processing are deployed into Google Cloud.

Technical definitions for components, workflows, and other required cloud computing resources is contained within the govuk-feedback-processing GitHub repository.

The two sources of GOV.UK user feedback data are:

  • SmartSurvey: a vendor tool used to capture user feedback submitted via the ‘Is this page useful?’ feedback route
  • Support API: a GOV.UK internal API which routes feedback responses submitted via ‘Report a Problem’, ‘Contact’, and other routes to a Postgres database instance

Components

Note that component descriptions contain hyperlinks to component Terraform definitions in the govuk-feedback-processing repository.

BigQuery Table

The BigQuery table to which processed data is saved.

A single user feedback submission may contain multiple user ‘responses’ where a user has been issued more than one ‘prompt’ for feedback or comment.

The BigQuery table uses field nesting to capture all prompts and their associated responses against a single record for a given user submission.

The table is partitioned by feedback record creation date. Queries against the table must contain a filter on the field containing the record creation date (created).

Networking Components

It is necessary to create ‘Virtual Private Cloud’ (VPC) in order to enable secure communication between components (such as between a Cloud Function/API and a Cloud SQL instance).

The following network components have been created:

SQL Database Instance

A Postgres instance is used as the basis for enabling access to feedback data contained with the Support API Datastore.

The Support API is a GOV.UK-internal technical component which routes feedback data submitted by GOV.UK users. For the majority of feedback types, the Support API persists feedback in a Postgres database instance hosted in AWS. That database instance is not acccessible for the purposes of data analysis.

Backup files for that database instance are taken daily and stored in S3 cloud object storage. Access to this object storage has been granted to enable the contents to be mirrored in equivalent cloud object storage in Google Cloud. The GOVUK S3 Mirror project assumes responsibility for copying backup files between the AWS and Google Cloud environments.

The files use a proprietary, non-human-readable format which necessitates that they are restored into Postgres database instance before the data they contain can be queried and read.

A Postgres14 database instance (support-api) has been created in Google Cloud to enable the restoration of Support API Datastore backup files and access to the data they contain.

The instance has been configured to be accessible via a Google Cloud-internal private network only. The instance cannot by default accept incoming connections from external networks (for the purposes of restoring backup files or handling SQL queries).

The following configuration options have been set for the Posgres database instance:

Option Value Description
Disk size 30GB Ensures there is enough usable storage to enable restoration of Support API backup files (~10GB)
max_wal_size 3008 Increased from the default value to improve performance of pg_restore. More information available in the Postgres documentation
maintenance_work_mem 2048 As above. More information available in the Postgres documentation

Publisher/Subscriber Topic

A publisher/subscriber (pub/sub) topic (support-api-backup-staging) is used to enable notifications relating to the creation of new Support API database backup files to be passed between the Google Cloud Storage bucket where they are kept (govuk-s3-mirror_govuk-integration-database-backups) and the component used to handle them (propogate-support-api).

Support API database backup files are copied from GOV.UK AWS to a Google Cloud Storage bucket (govuk-s3-mirror_govuk-integration-database-backups) by the GOV.UK S3 Mirror. This process happens daily.

The Cloud Storage bucket is configured to publish a notification into the support-api-backup-staging queue when the creation of a new file is completed in the bucket.

These notifications are used to trigger the support-api-workflow workflow.

Workflow Triggers

Cloud Scheduler Trigger

The smart-survey-trigger Cloud Scheduler job is used to automatically invoke the Smart Survery Workflow on a daily basis at 07:00AM GMT. The job uses the workflow’s HTTP Uniform Resource Identifier (URI) to trigger an execution of the workflow.

EventArc Trigger

The support-api-workflow-trigger EventArc trigger is used to automatically invoke the Support API Workflow when a new message has been published into the support-api-backup-staging pub/sub topic.

Filter criteria in the trigger ensure that the workflow is only invoked on publication of messages relating to the creation of new files in relevant upstream buckets (S3 Mirror). The trigger refers to the workflow’s internal identifier to trigger an execution of the workflow.

Workflows

Google Cloud Workflows are used to sequence the execution of individual cloud functions as part of a longer ‘pipeline’ process.

Smart Survey Workflow

The smart-survey-workflow is used to:

  1. Read ‘raw’ Smart Survey data from the Smart Survey API into the staging Cloud Storage bucket
  2. Process the ‘raw’ Smart Survey data from the bucket
  3. Save the output records into Google BigQuery

The workflow calls the read-raw-feedback and process-raw-feedback cloud functions in sequence. The process is run daily at 07:00AM, and is triggered by Cloud Scheduler.

The workflow creates an internal type variable and sets the value of the variable to “SmartSurvey”. This variable is passed as input to each called cloud function which requires type as an input.

Support API Workflow

The support-api-workflow is used to:

  1. Fetch the latest Support API Datastore backup file from the S3 Mirror Cloud Storage bucket
  2. Restore the Support API Datastore backup file to a Google Cloud-hosted Postgres instance
  3. Read the raw Support API data from the Google Cloud-hosted Postres instance into the staging Cloud Storage bucket
  4. Process the raw Support API data from the staging bucket
  5. Save the output records into Google BigQuery.

The workflow calls the propogate-support-api, read-raw-feedback and process-raw-feedback cloud functions in sequence. The process is run each time a new Support API Datastore backup file is created in the S3 Mirror Cloud Storage bucket; typically, this occurs daily at 06:00AM GMT.

The workflow creates an internal type variable and sets the value of the variable to “SupportAPI”. This variable is passed as input to each called cloud function which requires type as an input.

Cloud Functions

Google Cloud Functions are used to execute the logic involved the restoration of Support API database backups, reading raw feedback data from upstream sources, and processing feedback data.

Propogate Support API Data

The propogate-support-api function is used to:

  1. Read a Support API Database back-up file from Cloud Storage into the local compute environment
  2. Use the Postgres pg_restore utility to:
    • Purge the Google Cloud Postgres instance of data
    • Restore the contents of the Support API Database back-up file into the Google Cloud Postgres instance

The propogate-support-api function responds to messages published into the support-api-backup-staging pub/sub topic, and accepts a cloud event object as an input. Messages are published into that topic whenever new Support API datastore backup files are copied into the S3 Mirror bucket as per the pub/sub topic documentation.

Messages published into the support-api-backup-staging topic will contain bucketId and objectId attributes which uniquely identify a Support API Database backup file. The function uses these attributes to read the file from Cloud Storage into the local compute environment.

The function then uses the pg_restore binary executable utility to restore the contents of the Support API Database backup file to the Google Cloud-hosted Postgres instance. The connection between the function and the Postgres instance is facilitated by a Virtual Private Cloud (VPC) connector.

The function depends on the CLOUD_SQL_QUERY_USER and CLOUD_SQL_QUERY_PASS secrets having been set in the project environment. These secrets are the credentials used by the function to authenticate with the Google Cloud-hosted Postgres instance. Failure to access/determine these secret values will result in the function returning an HTTP 500 (internal server) error.

  • Required input: cloud event object containing valid bucketId and objectId attributes
  • Output: HTTP response (200 for success)
  • Max timeout: 60 minutes
  • Resource definition

Read Raw Feedback

The read-raw-feedback function is used to:

  1. Connect to an upstream ‘system of record’ containing unmodified feedback data
  2. Fetch all feedback records which were created between a specified start date and end date
  3. Stage the unmodified, ‘raw’ feedback data in JSON format in a Cloud Storage bucket

Target systems of record include:

  • SmartSurvey - accessed via an API
  • Google Cloud-hosted Postgres Instance - DB instance containing Support API data accessed via VPC connection

The function depends the following environment variables/secrets:

Name Type Description
SMART_SURVEY_API_ENDPOINT Environment variable The target Smart Survey API endpoint to query for feedback data
SMART_SURVEY_API_TOKEN Secret Used to authenticate with Smart Survey
SMART_SURVEY_API_SECRET Secret User to authenticate with Smart Survey
CLOUD_SQL_QUERY_USER Secret Username for authentication with Postgres instance
CLOUD_SQL_QUERY_PASS Secret Password for authentication with Postgres instance

Failure access/determine these environment variable/secret values will result in the function returning an HTTP 500 (internal server) error.

  • Required input: type - string value indicating the type of feedback to be read. Must be one of “SmartSurvey” or “SupportAPI”
  • Optional input: accepts start_date and end_date values in the format “YYYY-MM-DD HH:MM:SS”. If not specified, or if only one of the two is provided, the function defaults these values to the start and end of prior day
  • Output:
    • HTTP response containing a response code (200 for success) and the name of the output file
    • A JSON file in the project staging bucket named using the convention datasource-date; e.g. supportapi-2022-10-01.
  • Max timeout: 9 minutes
  • Resource definition

Process Raw Feedback

The process-raw-feedback function is used to:

  1. Read a specified ‘raw’ feedback JSON file from the staging Cloud Storage bucket
  2. Process the ‘raw’ feedback data:
    • Re-format the data to adhere to the output feedback data schema
    • Remove instances of profanity from feedback text
    • Remove instances of personally identifiable information (PII)
  3. Output the data to BigQuery

Processes used to remove profanity and PII are described in the Text processing section.

The function requires that two inputs are provided:

  • type: string value indicating the type of feedback being processed. Must be one of “SmartSurvey” or “SupportAPI”
  • input_file: string value containing the name of the input file (present in the staging cloud storage bucket) to be processed

If either of these inputs are not provided or are invalid, the function will return an HTTP 400 (invalid request) error.

  • Required input: type and input_file string values
  • Output:
    • HTTP response (200 for success)
    • New records created in the target BigQuery table (govuk-user-feedback)
  • Max timeout: 60 minutes
  • Resource definition

Text Processing

The process-raw-feedback Cloud function performs text processing on raw feedback text with the purpose of masking instances of profanity or PII within the text.

Profanity Removal

The better-profanity Python package is used in combination with a pre-compiled list of censored words to:

  1. Identify instances of profanity in user feedback text
  2. Replace identified instances of profanity with a mask value (‘****’).

PII Removal

The objective of PII removal is to remove from feedback text any articles of data - such as phone numbers, names, or National Insurance numbers - which could be used to uniquely identify an individual.

Two methods are used to achieve this:

  1. Google Cloud DLP: Feedback data is parsed using Google’s Data Loss Prevention (DLP) API. The DLP tools can be configured to scan feedback text for specific types of PII, and will replace identified instances of PII with an appropriate mask, e.g. [‘PERSON_NAME’]
  2. Text pattern matching: Feedback data is scanned to identify the presence of text combinations which match pre-defined (regular expression) patterns. These patterns may be used to identify instances of PII which meet standard patterns relating to their length and content (such as credit card numbers, driving license numbers etc).

Comprehensive lists of the types of PII which are addressed by each method can be found below.

Cloud DLP Info Types

  • Date of birth
  • Email address
  • Passport
  • Person name
  • Phone number
  • Street address
  • UK National Insurance number
  • UK passport number
  • Credit card number
  • IBAN code
  • IP address
  • Medical terms
  • Vehicle identification number
  • Scotland community health index number
  • UK drivers license number
  • UK NHS number
  • UK taxpayer reference
  • SWIFT code

Pattern Matching Types

  • National Insurance number
  • UK NHS number
  • Credit card number
  • UK drivers license number
  • UK visa application reference number
This page was last reviewed on 24 April 2024. It needs to be reviewed again on 24 October 2024 .
This page was set to be reviewed before 24 October 2024. This might mean the content is out of date.