GovSearch
GovSearch is a tool to help content designers find content on GOV.UK. It collects most of the content on www.gov.uk from multiple data sources and indexes it to provide an advanced search engine.
The production version of the app is available at https://gov-search.service.gov.uk for Signon users.
Technology
GovSearch runs on GCP (Google Cloud Platform). Most of the data is obtained from GOV.UK via the GOV.UK S3 Mirror, which fetches nightly GOV.UK database backups and makes them available in a GCP bucket.
The website is written in Typescript, using the Express framework. It fetches search results from BigQuery.
Source code
The source code of the GCP configuration and data preparation is alphagov/govuk-knowledge-graph-gcp.
The source code of the website is alphagov/govuk-knowledge-graph-search.
Data pipeline
Every day:
New backups of the GOV.UK Content Store and the Publishing API become available in the govuk-s3-mirror_govuk-integration-database-backups Bucket, as a result of the S3 Mirror copy process.
A GCP Workflow (defined in the govuk-knowledge-graph project) is triggered by the mirror’s Pub/Sub topic and creates two virtual machines, one running MongoDB to recreate an instance of the content store, and one running Postgres to recreate an instance of the Publishing API.
From those two databases, a process creates CSV files, uploads them to a bucket, imports them into BigQuery and executes some queries in BigQuery to derive other tables.
Scheduled queries in BigQuery run once a day to derive tables for the GovSearch app to use directly.
GovSearch: available at https://gov-search.service.gov.uk. GovSearch is available 24/7 but might be missing some data for a few minutes at about 6am when BigQuery tables are deleted and repopulated.
While Gov Search is the normal way to use the service, the Neo4j frontend allows for more complex searches using the Cypher language.
See the documentation of the Knowledge Graph Data Pipeline for more information about the batch data updates.
Hosting
GovGraph is hosted in the govuk-knowledge-graph google cloud project. It uses the following services.
- Load balancer to manage the domain names and user authentication
- Cloud run to run the front-end
- BigQuery to hold the data
- Cloud compute to run the Neo4j, mongo and postgres virtual machines
- Workflows, Pub/Sub and Eventarc to run the batch data updates
The terraform configuration describes how each service is set up.
Front-end
The user-facing application is hosted as a Cloud Run service. Its source code is in a separate repository.
Architecture
It is written in TypeScript, and consists of:
- an API, an ExpressJS server which fetches the search results from BigQuery
- a single-page app (SPA), which takes user input, queries the API, and displays the search results.
The software architecture, as well as instructions on how to run and develop it, are explained in the repository’s README file.
Authorisation
See the GovSearch Signon page.
Hosting and deployment
The front-end runs as a Cloud Run service and is therefore run as a Docker container. The Dockerfile is in the repository. There is also a script to deploy it to production using gcloud run deploy
. Deploying in the staging or dev environments is a matter of changing the --project
command-line parameter in the script.
Analytics
Google Analytics and Google Tag Manager are enabled for GovSearch in the production environment.
Networking
The Google Cloud project makes use of a load balancer that handles the networking configuration.
Domain names
The public-facing production domain, gov-search.service.gov.uk
is delegated from the team managing service.gov.uk
. We manage the mapping to the service’s public IP address.
We use the following domains internally.
govgraphsearch.dev
for the production environmentgovgraphsearchstaging.dev
for the staging environmentgovgraphsearchdev.dev
for the development environment
Those domains and their corresponding TLS certificates are managed using Cloud Domains.