Skip to main content

GovSearch

GovSearch is a tool to help content designers find content on GOV.UK. It collects most of the content on www.gov.uk from multiple data sources and indexes it to provide an advanced search engine.

The production version of the app is available at https://gov-search.service.gov.uk for Signon users.

Technology

GovSearch runs on GCP (Google Cloud Platform). Most of the data is obtained from GOV.UK via the GOV.UK S3 Mirror, which fetches nightly GOV.UK database backups and makes them available in a GCP bucket.

The website is written in Typescript, using the Express framework. It fetches search results from BigQuery.

Source code

The source code of the GCP configuration and data preparation is alphagov/govuk-knowledge-graph-gcp.

The source code of the website is alphagov/govuk-knowledge-graph-search.

Data pipeline

Every day:

  • New backups of the GOV.UK Content Store and the Publishing API become available in the govuk-s3-mirror_govuk-integration-database-backups Bucket, as a result of the S3 Mirror copy process.

  • A GCP Workflow (defined in the govuk-knowledge-graph project) is triggered by the mirror’s Pub/Sub topic and creates two virtual machines, one running MongoDB to recreate an instance of the content store, and one running Postgres to recreate an instance of the Publishing API.

  • From those two databases, a process creates CSV files, uploads them to a bucket, imports them into BigQuery and executes some queries in BigQuery to derive other tables.

  • Scheduled queries in BigQuery run once a day to derive tables for the GovSearch app to use directly.

  • GovSearch: available at https://gov-search.service.gov.uk. GovSearch is available 24/7 but might be missing some data for a few minutes at about 6am when BigQuery tables are deleted and repopulated.

While Gov Search is the normal way to use the service, the Neo4j frontend allows for more complex searches using the Cypher language.

See the documentation of the Knowledge Graph Data Pipeline for more information about the batch data updates.

Hosting

GovGraph is hosted in the govuk-knowledge-graph google cloud project. It uses the following services.

  • Load balancer to manage the domain names and user authentication
  • Cloud run to run the front-end
  • BigQuery to hold the data
  • Cloud compute to run the Neo4j, mongo and postgres virtual machines
  • Workflows, Pub/Sub and Eventarc to run the batch data updates

The terraform configuration describes how each service is set up.

Front-end

The user-facing application is hosted as a Cloud Run service. Its source code is in a separate repository.

Architecture

It is written in TypeScript, and consists of:

  • an API, an ExpressJS server which fetches the search results from BigQuery
  • a single-page app (SPA), which takes user input, queries the API, and displays the search results.

The software architecture, as well as instructions on how to run and develop it, are explained in the repository’s README file.

Authorisation

See the GovSearch Signon page.

Hosting and deployment

The front-end runs as a Cloud Run service and is therefore run as a Docker container. The Dockerfile is in the repository. There is also a script to deploy it to production using gcloud run deploy. Deploying in the staging or dev environments is a matter of changing the --project command-line parameter in the script.

Analytics

Google Analytics and Google Tag Manager are enabled for GovSearch in the production environment.

Networking

The Google Cloud project makes use of a load balancer that handles the networking configuration.

Domain names

The public-facing production domain, gov-search.service.gov.uk is delegated from the team managing service.gov.uk. We manage the mapping to the service’s public IP address.

We use the following domains internally.

  • govgraphsearch.dev for the production environment
  • govgraphsearchstaging.dev for the staging environment
  • govgraphsearchdev.dev for the development environment

Those domains and their corresponding TLS certificates are managed using Cloud Domains.

This page was last reviewed on 24 April 2024. It needs to be reviewed again on 24 October 2024 .
This page was set to be reviewed before 24 October 2024. This might mean the content is out of date.