Skip to main content

GOV.UK S3 Mirror

The GOV.UK S3 Mirror project in GCP (Google Cloud Platform) copies data from some AWS S3 buckets into GCP Cloud Storage buckets.

The GOV.UK S3 Mirror allows you to:

  • access data from the GOV.UK programme by using GCP instead of AWS
  • trigger workflows in GCP projects when new data becomes available from GOV.UK

Data sources

The GOV.UK S3 Mirror has 1 data source:

Data products

The GOV.UK S3 Mirror has 1 data product:

Currently, the bucket contains backup files of four databases:

  • the production Content Store
  • the draft Content Store
  • the Publishing API
  • the Support API

More could be included from the S3 bucket, and from other S3 buckets, when they are required.

How to access the data

People who are in the following Google Groups have read access to the govuk-s3-mirror_govuk-integration-database-backups Cloud Storage bucket. The specific access permission is roles/storage.objectViewer.

  • group:data-engineering@digital.cabinet-office.gov.uk

This is configured in the file terraform/transfer.tf.

Individual access can be given by adding an email address to that file. For example, "user:firstname.lastname@digital.cabinet-office.gov.uk".

A service account can be given access by adding its email address to that file. For example, "serviceAccount:accountname@projectname.iam.gserviceaccount.com".

Notifications of new files

The govuk-s3-mirror_govuk-integration-database-backups Cloud Storage bucket can publish a notification to a Pub/Sub topic in any project when a new file is created. Other actions can then be triggered by that Pub/Sub topic. An example implementation is configured by the govuk-knowledge-graph-gcp repository.

To create a notification from the bucket, add a terraform configuration to the file terraform/pubsub.tf in the govuk-s3-mirror repository, such as follows.

# Reference the topic that it in your project
resource "google_pubsub_topic" "a_name" {
  project                    = "your-project-name"
  name                       = "your-topic-name"
  message_retention_duration = "604800s" # 604800 seconds is 7 days
  message_storage_policy {
    allowed_persistence_regions = [
      var.region,
    ]
  }
}

# Notify the topic from the bucket
resource "google_storage_notification" "another_name" {
  bucket         = google_storage_bucket.govuk-integration-database-backups.name
  payload_format = "JSON_API_V1"
  topic          = google_pubsub_topic.a_name.id
  event_types    = ["OBJECT_FINALIZE"]
}

Create the topic in your own project as follows.

resource "google_pubsub_topic" "some_name" {
  name                       = "your-topic-name"
  message_retention_duration = "604800s" # 604800 seconds is 7 days
  message_storage_policy {
    allowed_persistence_regions = [
      var.region,
    ]
  }
}

# Allow the govuk-s3-mirror project's bucket to publish to this topic
data "google_iam_policy" "some_other_name" {
  binding {
    role = "roles/pubsub.publisher"
    members = [
      "serviceAccount:service-384988117066@gs-project-accounts.iam.gserviceaccount.com"
    ]
  }
}

resource "google_pubsub_topic_iam_policy" "yet_another_name" {
  topic       = google_pubsub_topic.some_name.name
  policy_data = data.google_iam_policy.some_other_name.policy_data
}

To trigger things from the Pub/Sub topic, such as Cloud Functions or Workflows, you must use EventArc as an intermediary between the Pub/Sub topic and other resources.

Cloud infrastructure

The GOV.UK S3 Mirror runs entirely in GCP. Terraform is used to deploy the infrastructure. The Terraform configuration is in a GitHub repository. The repository is the most accurate and up-to-date representation of the infrastructure. This section gives an overview.

Storage Transfer Service

The GCP Storage Transfer Service is used to copy data from the govuk-integration-database-backups S3 bucket into the govuk-s3-mirror_govuk-integration-database-backups Cloud Storage bucket. Every hour, on the hour, the service:

  • compares the contents of each bucket
  • copies any files that are in the S3 bucket but not in the Cloud Storage bucket into the Cloud Storage bucket.
  • deletes any files from the Cloud Storage bucket that don’t exist in the S3 bucket.

This means that the Cloud Storage bucket contains exactly the same contents as the S3 bucket. GOV.UK stores only a few recent database backup files in the S3 bucket, so only those files are available in the Cloud Storage bucket.

Not all the contents of the S3 bucket are currently copied to the Cloud Storage bucket, because not all are required. If more are required, then they can be included by editing the include_prefixes field in the file terraform/transfer.tf in the GitHub repository.

Cloud Storage

The main bucket is the govuk-s3-mirror_govuk-integration-database-backups Cloud Storage bucket. The contents of this bucket are maintained to be the same as the contents of the S3 bucket. The name of the Cloud Storage bucket incorporates the name of the S3 bucket.

If mirrors of S3 buckets are required, then each one should be given a corresponding Cloud Storage bucket, following the naming convention.

This page was last reviewed on 14 September 2022. It needs to be reviewed again on 14 March 2023 .
This page was set to be reviewed before 14 March 2023. This might mean the content is out of date.