GOV.UK S3 Mirror
The GOV.UK S3 Mirror project in GCP (Google Cloud Platform) copies data from some AWS S3 buckets into GCP Cloud Storage buckets.
The GOV.UK S3 Mirror allows you to:
- access data from the GOV.UK programme by using GCP instead of AWS
- trigger workflows in GCP projects when new data becomes available from GOV.UK
Data sources
The GOV.UK S3 Mirror has 1 data source:
- the
govuk-integration-database-backups
S3 bucket, which contains nightly backups of databases, such as the Content Store, the Publishing API, and the Support API
Data products
The GOV.UK S3 Mirror has 1 data product:
Currently, the bucket contains backup files of four databases:
- the production Content Store
- the draft Content Store
- the Publishing API
- the Support API
More could be included from the S3 bucket, and from other S3 buckets, when they are required.
How to access the data
People who are in the following Google Groups have read access to the
govuk-s3-mirror_govuk-integration-database-backups
Cloud Storage
bucket.
The specific access permission is
roles/storage.objectViewer
.
- group:data-engineering@digital.cabinet-office.gov.uk
This is configured in the file
terraform/transfer.tf
.
Individual access can be given by adding an email address to that file. For
example, "user:firstname.lastname@digital.cabinet-office.gov.uk"
.
A service account can be given access by adding its email address to that file.
For example, "serviceAccount:accountname@projectname.iam.gserviceaccount.com"
.
Notifications of new files
The govuk-s3-mirror_govuk-integration-database-backups
Cloud Storage
bucket
can publish a notification to a Pub/Sub topic in any project when a new file is
created. Other actions can then be triggered by that Pub/Sub topic. An example
implementation is configured by the govuk-knowledge-graph-gcp
repository.
To create a notification from the bucket, add a terraform configuration to the
file terraform/pubsub.tf
in the govuk-s3-mirror
repository, such as follows.
# Reference the topic that it in your project
resource "google_pubsub_topic" "a_name" {
project = "your-project-name"
name = "your-topic-name"
message_retention_duration = "604800s" # 604800 seconds is 7 days
message_storage_policy {
allowed_persistence_regions = [
var.region,
]
}
}
# Notify the topic from the bucket
resource "google_storage_notification" "another_name" {
bucket = google_storage_bucket.govuk-integration-database-backups.name
payload_format = "JSON_API_V1"
topic = google_pubsub_topic.a_name.id
event_types = ["OBJECT_FINALIZE"]
}
Create the topic in your own project as follows.
resource "google_pubsub_topic" "some_name" {
name = "your-topic-name"
message_retention_duration = "604800s" # 604800 seconds is 7 days
message_storage_policy {
allowed_persistence_regions = [
var.region,
]
}
}
# Allow the govuk-s3-mirror project's bucket to publish to this topic
data "google_iam_policy" "some_other_name" {
binding {
role = "roles/pubsub.publisher"
members = [
"serviceAccount:service-384988117066@gs-project-accounts.iam.gserviceaccount.com"
]
}
}
resource "google_pubsub_topic_iam_policy" "yet_another_name" {
topic = google_pubsub_topic.some_name.name
policy_data = data.google_iam_policy.some_other_name.policy_data
}
To trigger things from the Pub/Sub topic, such as Cloud Functions or Workflows, you must use EventArc as an intermediary between the Pub/Sub topic and other resources.
Cloud infrastructure
The GOV.UK S3 Mirror runs entirely in GCP. Terraform is used to deploy the infrastructure. The Terraform configuration is in a GitHub repository. The repository is the most accurate and up-to-date representation of the infrastructure. This section gives an overview.
Storage Transfer Service
The GCP Storage Transfer
Service is used to copy data
from the govuk-integration-database-backups
S3
bucket
into the govuk-s3-mirror_govuk-integration-database-backups
Cloud Storage
bucket. Every hour, on the hour, the service:
- compares the contents of each bucket
- copies any files that are in the S3 bucket but not in the Cloud Storage bucket into the Cloud Storage bucket.
- deletes any files from the Cloud Storage bucket that don’t exist in the S3 bucket.
This means that the Cloud Storage bucket contains exactly the same contents as the S3 bucket. GOV.UK stores only a few recent database backup files in the S3 bucket, so only those files are available in the Cloud Storage bucket.
Not all the contents of the S3 bucket are currently copied to the Cloud Storage
bucket, because not all are required. If more are required, then they can be
included by editing the include_prefixes
field in the file
terraform/transfer.tf
in the GitHub repository.
Cloud Storage
The main bucket is the govuk-s3-mirror_govuk-integration-database-backups
Cloud Storage
bucket.
The contents of this bucket are maintained to be the same as the contents of the
S3 bucket. The name of the Cloud Storage bucket incorporates the name of the S3 bucket.
If mirrors of S3 buckets are required, then each one should be given a corresponding Cloud Storage bucket, following the naming convention.