Skip to main content

Use the GOV.UK publishing database

The GOV.UK publishing database is a PostgreSQL database that keeps a copy of every version of almost every page ever drafted or published on GOV.UK.

You can connect to and use the publishing database instead of downloading the GOV.UK mirror to your local machine.

If you only want to analyse the most recent version of every page that is currently published, you can use the Content Store.

Data Services does not currently have access to the production environment of the publishing database, so we access the database using the integration environment.

Instead of connecting to the integration environment database, you can download a backup of the publishing database to your local machine if you need to:

  • run resource-intensive queries on the database, as these queries may affect live services that also use the database
  • create your own tables within the database, as you only have read access to the integration environment database

Connect to the publishing database

Before you start, you must:

  • have access to AWS
  • install the GDS command line tools
  • connect to the GDS VPN
  • get Secure Shell (SSH) access to the integration environment

See the Get started on GOV.UK developer documentation for more information on how to do this.

  1. Run the following in the command line to connect to an AWS instance in the integration environment:

    USER=<FIRSTNAMELASTNAME>; gds govuk connect --environment integration ssh publishing_api_db_admin
    

    <FIRSTNAMELASTNAME> must be the same user that you created to SSH into integration.

    You do not need to include USER=<FIRSTNAMELASTNAME>; if the username for your machine is the same as the SSH integration user.

  2. Open a PostgreSQL terminal to the postgresql-primary host as the aws_db_admin user, and connect to the publishing API database:

    sudo psql -U aws_db_admin -h publishing-api-postgres --no-password -d publishing_api_production
    

    This database is named production, but it is not in the production environment.

You can now run queries on the integration environment publishing database.

Download query results from the publishing database

Once you have run queries on the publishing database, download the results of those queries to your local machine by either:

  • printing the query results in the command line, and sending those results to a file on your local machine
  • writing the query results to a file in the db_admin server, compressing the file, and then downloading it to your local machine

Run the following in the command line to print a query’s results and send those results to a file on your local machine:

USER=<FIRSTNAMELASTNAME>; gds govuk connect --environment integration ssh db_admin sudo "psql -U aws_db_admin -h postgresql-primary --no-password -d publishing_api_production -c \"<SQL-QUERY>\"" > <FILENAME>

where:

  • <SQL-QUERY> is a SQL query
  • <FILENAME> is the name of a file to write the query results to

Write query results to a server file

The gds govuk connect command enables you to fetch a file from a server and save it locally.

  1. Run the following in the command line to execute the query, and save the results to a file in the db_admin server:

    USER=<FIRSTNAMELASTNAME>; gds govuk connect --environment integration ssh db_admin:1 sudo "psql -U aws_db_admin -h postgresql-primary --no-password -d publishing_api_production -c \"\\copy ($(<<QUERY-FILE>)) TO '/tmp/<FILENAME>.csv' WITH CSV HEADER\""
    

    where:

    • <QUERY-FILE> is a file on your local machine that contains an SQL query. Keep the first < in <<QUERY-FILE>, for example <myfile.sql.
    • <FILENAME> is a file in the server that contains the query results.

    The db_admin server does not run the database itself.

  2. Compress the results file by running:

    gds govuk connect --environment integration ssh db_admin zip "/tmp/<FILENAME>.zip /tmp/<FILENAME>.csv"
    

    where <FILENAME> is the same as the previous step, a file in the server that contains the query results.

  3. Use scp-pull to fetch the file to your local machine.

    gds govuk connect --environment integration scp-pull db_admin:1 /tmp/<FILENAME>.zip <DOWNLOAD-LOCATION>
    

    where:

    • <FILENAME> is the same as the previous step, a file in the server that contains the query results.
    • <DOWNLOAD-LOCATION> is the location you want to download the backup file to on your local machine.

    The :1 in db_admin:1 guarantees that scp-pull will connect to the same machine as ssh.

    The 1 chooses the first available machine. There is only one machine in the integration environment, so 1 is guaranteed to exist.

Upload the downloaded results into BigQuery

When you have downloaded your query results from the publishing database, you can then upload those results into BigQuery for analysis.

  1. Run the following in the command line to convert the downloaded file to a parquet file using the Pandas package:

    import pandas as pd
    df = pd.read_csv("<FILENAME>.zip")
    
    # gzip compression is slower than snappy (default) but creates a much smaller file (123MB rather than 302MB)
    df.to_parquet("<FILENAME>.parquet", compression='gzip')
    

    where <FILENAME> is the name of the zipped CSV file and of the parquet file that you want to convert it into.

  2. Upload the parquet file to BigQuery using the Google Cloud SDK command bq in the command line:

    bq load --source_format=PARQUET <DATASET>.<TABLE> <FILENAME>.parquet
    

    where:

    • <FILENAME> is the name of the zipped CSV file and of the parquet file to convert the CSV file into
    • <DATASET> and <TABLE> are where to put the data in BigQuery

You have now uploaded your query results into BigQuery for analysis.

Download a backup of the publishing database

Instead of connecting to the integration environment database, you can download a backup of the publishing database to your local machine if you need to:

  • run resource-intensive queries on the database, as these queries may affect live services that also use the database
  • create your own tables within the database, as you only have read access to the integration environment database

Before you start, you must:

You do not need the GDS VPN or Secure Shell (SSH) access to the integration environment.

Download the database backup

  1. The filename of the latest database backup in AWS S3 varies, depending on the date and time the backup was created. Run the following command to find out the filename:

    USER=<FIRSTNAMELASTNAME>; gds aws govuk-integration-readonly aws s3 ls s3://govuk-integration-database-backups/publishing-api-postgres/
    
  2. Download the backup of the publishing database:

    USER=<FIRSTNAMELASTNAME>; gds aws govuk-integration-readonly aws s3 cp s3://govuk-integration-database-backups/publishing-api-postgres/<FILENAME> <DOWNLOAD-LOCATION>
    

    where:

    • <YYYY-MM-DD>T<HH:MM:SS> is the filename of the backup database
    • <DOWNLOAD-LOCATION> is the location you want to download the backup file to on your local machine

Open the database on your local machine

To work with the database backup on your local machine, you must open the backup on a database system.

The following content assumes you’re using PostgreSQL as your database system.

  1. Run the following command to create a PostgreSQL database user called publishing_api:

    sudo -u postgres createuser -s publishing_api
    
  2. Load the data from the backup database into a new PostgreSQL database called publishing_api_production:

    pg_restore --verbose --create --jobs=4 --dbname=postgres <DOWNLOAD-LOCATION>
    

    where:

    • --jobs=4 is the number of CPU threads used (you can change this to suit your device)
    • <DOWNLOAD-LOCATION> is the location that you downloaded the backup file to on your local machine

When your command line becomes responsive and you can enter commands again, this means you’ll have successfully created a database named publishing_api_production that contains the publishing backup data.

Run queries in the database

  1. Run the following command to start using the database.

    psql -d publishing_api_production
    

    You should see the following output:

    psql (13.6)
    Type "help" for help.
    
    publishing_api_production=#
    
  2. Enter a query to start using the database. For example:

    SELECT title FROM editions LIMIT 5;
    

You can now work with this database in PostgreSQL. To learn more about using PostgreSQL, read the PostgreSQL documentation.

Find content that is not included in the publishing database

The GOV.UK publishing database contains records of URLs that users visit.

These URL records correspond to GOV.UK pages. However, the publishing database does not include all GOV.UK pages.

Some GOV.UK pages are partially represented by a URL record.

Some GOV.UK pages are not represented by a URL record.

You can run queries on these records to understand what the publishing database does not include.

Run queries on pages partially represented by a URL record

Some pages are represented by URL records in the database, but those pages’ content is not included in those records.

The most common example of this type of page is a page with dynamic content.

Dynamic content is created automatically in response to information provided by the user.

For example, a search results page at https://gov.uk/search/all.

Using this example, you can run the following query on the search results page to understand what information is included in the URL record and what is not included:

select * from editions where base_path = '/search' limit 1;

Run queries on pages not represented by a URL record

The following types of pages are not represented by a URL record:

  • pages with dynamic content
  • pages with no content
  • pages that do not exist

Pages with dynamic content

Dynamic content is created automatically in response to information provided by the user.

An example of this type of page is https://gov.uk/towing-rules/y/car-or-light-vehicle, which is a smart answer.

In this example, you can run the following query on the smart answer page to see that the publishing database does not include this page:

select * from editions where base_path = '/towing-rules/y/car-or-light-vehicle' limit 1;

Pages with no content

Pages with no content are not represented by a URL record.

An example of this type of page is https://www.gov.uk/random, which has no content and redirects to a page chosen at random.

In this example, you can run the following query on a page with no content to see that the publishing database does not include this page:

select * from editions where base_path = '/random' limit 1;

Pages that do not exist

Pages that do not exist are not represented by a URL record.

URLs of non-existent pages often appear in Google Analytics and server logs. Sometimes these pages are typos.

We have used some common typos to create pages that redirect to the correctly spelled equivalent page.

In this example, you can run the following query on a page with a typo to see that the publishing database does not include this page:

select *, redirects from editions where base_path = '/view-driving-license' limit 1;
This page was last reviewed on 13 May 2022. It needs to be reviewed again on 13 November 2022 .
This page was set to be reviewed before 13 November 2022. This might mean the content is out of date.