Use the GOV.UK publishing database
The GOV.UK publishing database is a PostgreSQL database that keeps a copy of every version of almost every page ever drafted or published on GOV.UK.
You can connect to and use the publishing database instead of downloading the GOV.UK mirror to your local machine.
If you only want to analyse the most recent version of every page that is currently published, you can use the Content Store.
Data Services does not currently have access to the production environment of the publishing database, so we access the database using the integration environment.
Instead of connecting to the integration environment database, you can download a backup of the publishing database to your local machine if you need to:
- run resource-intensive queries on the database, as these queries may affect live services that also use the database
- create your own tables within the database, as you only have read access to the integration environment database
Connect to the publishing database
Before you start, you must:
- have access to AWS
- install the GDS command line tools
- connect to the GDS VPN
- get Secure Shell (SSH) access to the integration environment
See the Get started on GOV.UK developer documentation for more information on how to do this.
Run the following in the command line to connect to an AWS instance in the integration environment:
USER=<FIRSTNAMELASTNAME>; gds govuk connect --environment integration ssh publishing_api_db_admin
<FIRSTNAMELASTNAME>
must be the same user that you created to SSH into integration.You do not need to include
USER=<FIRSTNAMELASTNAME>;
if the username for your machine is the same as the SSH integration user.Open a PostgreSQL terminal to the
postgresql-primary
host as theaws_db_admin
user, and connect to the publishing API database:sudo psql -U aws_db_admin -h publishing-api-postgres --no-password -d publishing_api_production
This database is named
production
, but it is not in the production environment.
You can now run queries on the integration environment publishing database.
Download query results from the publishing database
Once you have run queries on the publishing database, download the results of those queries to your local machine by either:
- printing the query results in the command line, and sending those results to a file on your local machine
- writing the query results to a file in the
db_admin
server, compressing the file, and then downloading it to your local machine
Print query results in the command line
Run the following in the command line to print a query’s results and send those results to a file on your local machine:
USER=<FIRSTNAMELASTNAME>; gds govuk connect --environment integration ssh db_admin sudo "psql -U aws_db_admin -h postgresql-primary --no-password -d publishing_api_production -c \"<SQL-QUERY>\"" > <FILENAME>
where:
<SQL-QUERY>
is a SQL query<FILENAME>
is the name of a file to write the query results to
Write query results to a server file
The gds govuk connect
command enables you to fetch a file from a server and save it locally.
Run the following in the command line to execute the query, and save the results to a file in the
db_admin
server:USER=<FIRSTNAMELASTNAME>; gds govuk connect --environment integration ssh db_admin:1 sudo "psql -U aws_db_admin -h postgresql-primary --no-password -d publishing_api_production -c \"\\copy ($(<<QUERY-FILE>)) TO '/tmp/<FILENAME>.csv' WITH CSV HEADER\""
where:
<QUERY-FILE>
is a file on your local machine that contains an SQL query. Keep the first<
in<<QUERY-FILE>
, for example<myfile.sql
.<FILENAME>
is a file in the server that contains the query results.
The
db_admin
server does not run the database itself.Compress the results file by running:
gds govuk connect --environment integration ssh db_admin zip "/tmp/<FILENAME>.zip /tmp/<FILENAME>.csv"
where
<FILENAME>
is the same as the previous step, a file in the server that contains the query results.Use
scp-pull
to fetch the file to your local machine.gds govuk connect --environment integration scp-pull db_admin:1 /tmp/<FILENAME>.zip <DOWNLOAD-LOCATION>
where:
<FILENAME>
is the same as the previous step, a file in the server that contains the query results.<DOWNLOAD-LOCATION>
is the location you want to download the backup file to on your local machine.
The
:1
indb_admin:1
guarantees thatscp-pull
will connect to the same machine asssh
.The
1
chooses the first available machine. There is only one machine in the integration environment, so1
is guaranteed to exist.
Upload the downloaded results into BigQuery
When you have downloaded your query results from the publishing database, you can then upload those results into BigQuery for analysis.
Run the following in the command line to convert the downloaded file to a parquet file using the Pandas package:
import pandas as pd df = pd.read_csv("<FILENAME>.zip") # gzip compression is slower than snappy (default) but creates a much smaller file (123MB rather than 302MB) df.to_parquet("<FILENAME>.parquet", compression='gzip')
where
<FILENAME>
is the name of the zipped CSV file and of the parquet file that you want to convert it into.Upload the parquet file to BigQuery using the Google Cloud SDK command
bq
in the command line:bq load --source_format=PARQUET <DATASET>.<TABLE> <FILENAME>.parquet
where:
<FILENAME>
is the name of the zipped CSV file and of the parquet file to convert the CSV file into<DATASET>
and<TABLE>
are where to put the data in BigQuery
You have now uploaded your query results into BigQuery for analysis.
Download a backup of the publishing database
Instead of connecting to the integration environment database, you can download a backup of the publishing database to your local machine if you need to:
- run resource-intensive queries on the database, as these queries may affect live services that also use the database
- create your own tables within the database, as you only have read access to the integration environment database
Before you start, you must:
- have access to AWS
- install the GDS command line tools
You do not need the GDS VPN or Secure Shell (SSH) access to the integration environment.
Download the database backup
The filename of the latest database backup in AWS S3 varies, depending on the date and time the backup was created. Run the following command to find out the filename:
USER=<FIRSTNAMELASTNAME>; gds aws govuk-integration-readonly aws s3 ls s3://govuk-integration-database-backups/publishing-api-postgres/
Download the backup of the publishing database:
USER=<FIRSTNAMELASTNAME>; gds aws govuk-integration-readonly aws s3 cp s3://govuk-integration-database-backups/publishing-api-postgres/<FILENAME> <DOWNLOAD-LOCATION>
where:
<YYYY-MM-DD>T<HH:MM:SS>
is the filename of the backup database<DOWNLOAD-LOCATION>
is the location you want to download the backup file to on your local machine
Open the database on your local machine
To work with the database backup on your local machine, you must open the backup on a database system.
The following content assumes you’re using PostgreSQL as your database system.
Run the following command to create a PostgreSQL database user called
publishing_api
:sudo -u postgres createuser -s publishing_api
Load the data from the backup database into a new PostgreSQL database called
publishing_api_production
:pg_restore --verbose --create --jobs=4 --dbname=postgres <DOWNLOAD-LOCATION>
where:
--jobs=4
is the number of CPU threads used (you can change this to suit your device)<DOWNLOAD-LOCATION>
is the location that you downloaded the backup file to on your local machine
When your command line becomes responsive and you can enter commands again, this means you’ll have successfully created a database named publishing_api_production
that contains the publishing backup data.
Run queries in the database
Run the following command to start using the database.
psql -d publishing_api_production
You should see the following output:
psql (13.6) Type "help" for help. publishing_api_production=#
Enter a query to start using the database. For example:
SELECT title FROM editions LIMIT 5;
You can now work with this database in PostgreSQL. To learn more about using PostgreSQL, read the PostgreSQL documentation.
Find content that is not included in the publishing database
The GOV.UK publishing database contains records of URLs that users visit.
These URL records correspond to GOV.UK pages. However, the publishing database does not include all GOV.UK pages.
Some GOV.UK pages are partially represented by a URL record.
Some GOV.UK pages are not represented by a URL record.
You can run queries on these records to understand what the publishing database does not include.
Run queries on pages partially represented by a URL record
Some pages are represented by URL records in the database, but those pages’ content is not included in those records.
The most common example of this type of page is a page with dynamic content.
Dynamic content is created automatically in response to information provided by the user.
For example, a search results page at https://gov.uk/search/all
.
Using this example, you can run the following query on the search results page to understand what information is included in the URL record and what is not included:
select * from editions where base_path = '/search' limit 1;
Run queries on pages not represented by a URL record
The following types of pages are not represented by a URL record:
- pages with dynamic content
- pages with no content
- pages that do not exist
Pages with dynamic content
Dynamic content is created automatically in response to information provided by the user.
An example of this type of page is https://gov.uk/towing-rules/y/car-or-light-vehicle
, which is a smart answer.
In this example, you can run the following query on the smart answer page to see that the publishing database does not include this page:
select * from editions where base_path = '/towing-rules/y/car-or-light-vehicle' limit 1;
Pages with no content
Pages with no content are not represented by a URL record.
An example of this type of page is https://www.gov.uk/random
, which has no content and redirects to a page chosen at random.
In this example, you can run the following query on a page with no content to see that the publishing database does not include this page:
select * from editions where base_path = '/random' limit 1;
Pages that do not exist
Pages that do not exist are not represented by a URL record.
URLs of non-existent pages often appear in Google Analytics and server logs. Sometimes these pages are typos.
We have used some common typos to create pages that redirect to the correctly spelled equivalent page.
In this example, you can run the following query on a page with a typo to see that the publishing database does not include this page:
select *, redirects from editions where base_path = '/view-driving-license' limit 1;