GovGraph (GOV.UK Knowledge Graph)
This page is a work in progress.
The GOV.UK Knowledge Graph, or GovGraph, tables hold GOV.UK content data in BigQuery for analytical workloads.
Access
Access to the BigQuery dataset is limited to GDS staff.
For access, contact the #data-engineering community.
Location
The data is located in BigQuery in the govuk-knowledge-graph
project.
Set-up
The GovGraph Google Cloud Projects also include infrastructure for the infrastructure for the GovSearch app, which uses data from GovGraph.
Read the documentation in the GitHub repository.
Schema
Tables
The schemas of some of the more heavily used Knowledge Graph tables are detailed below.
search.page
The govuk-knowledge-graph.search.page
table includes everything that is in GovSearch.
In this table, different sections or chapters of a content item are kept as separate ‘pages’, instead of being grouped as content items under the contentId
.
‘Gone’ and ‘redirect’ pages are excluded from this table.
More detailed documentation can be found on GitHub.
field name | mode | type | description | notes |
---|---|---|---|---|
url | NULLABLE | STRING | URL of a page | Includes hostname and protocol, e.g. https://www.gov.uk/government/publications/low-pay-commission-research-2024
|
documentType | REQUIRED | STRING | The kind of thing that a page is about | |
contentId | REQUIRED | STRING | The ID of the content item of a page | Multiple URLs can have the same contentId |
locale | REQUIRED | STRING | The ISO 639-1 two-letter code of the language of an edition on GOV.UK | |
publishing_app | REQUIRED | STRING | The application that published a content item on GOV.UK | |
first_published_at | NULLABLE | TIMESTAMP | The date that a page was first published. Automatically determined by the publishing-api, unless overridden by the publishing application. | |
public_updated_at | NULLABLE | TIMESTAMP | When a page was last significantly changed (a major update). Shown to users. Automatically determined by the publishing-api, unless overridden by the publishing application. | |
publisher_updated_at | NULLABLE | TIMESTAMP | When a page was last changed in the Publisher app. More meaningful than ‘updated_at’ in the Publishing API and Content API, which is polluted by editions that are created for techy reasons rather than editing reasons, and editors of mainstream pages tend not to use ‘public_updated_at’. | |
withdrawn_at | NULLABLE | TIMESTAMP | The date the page was withdrawn. | |
withdrawn_explanation | NULLABLE | STRING | The explanation for withdrawing a page | |
page_views | NULLABLE | INTEGER | Number of page views from GA4 over 7 recent days | |
title | NULLABLE | STRING | The title of a page | |
description | NULLABLE | STRING | Description of a page | |
text | NULLABLE | STRING | The content of the page as plain text extracted from the HTML | Null for certain document types, such as contact pages, due to the way the content is generated |
taxons | REPEATED | STRING | Array of titles of taxons that the page is tagged to, and their ancestors | |
primary_organisation | NULLABLE | STRING | Title of the primary organisation that published the page | |
organisations | REPEATED | STRING | Array of titles of organisations that published the page | |
people | REPEATED | STRING | Array of names of people who are associated with the page | |
organisations_ancestry | REPEATED | STRING | Array of titles of organisations (and any parent organisations) that published the page | |
hyperlinks | REPEATED | RECORD | Array of hyperlinks from the body of the page | |
phone_numbers | REPEATED | STRING | Array of phone numbers from the body and metadata of the page | |
is_political | NULLABLE | BOOLEAN | Indicator of whether the page is political. Pages where this is true, and that were published by a previous government, are displayed in ‘history mode’ with a prominent message drawing attention to the fact. | |
government | NULLABLE | STRING | Title of the government that published the page, if the page is political. | |
hyperlinks.link_url | STRING | Link URL | ||
hyperlinks.link_type | STRING | Type of link |
public.publishing_api_editions_current
The govuk-knowledge-graph.public.publishing_api_editions_current
table includes one record per ‘document’ as it currently appears on the GOV.UK website and in the Content API.
A ‘document’ is here defined as a content item in a locale - each content item has one or more “documents” but at most one document per locale (unique key: content_id
, locale
).
In this table, different sections or chapters of a content item are grouped together as content items under the content_id
and common base_path
.
More detailed documentation can be found on GitHub.
field name | mode | type | description | notes |
---|---|---|---|---|
content_id | NULLABLE | STRING | ||
locale | NULLABLE | STRING | ||
id | NULLABLE | INTEGER | ||
title | NULLABLE | STRING | ||
public_updated_at | NULLABLE | TIMESTAMP | ||
publishing_app | NULLABLE | STRING | ||
rendering_app | NULLABLE | STRING | ||
update_type | NULLABLE | STRING | ||
phase | NULLABLE | STRING | ||
analytics_identifier | NULLABLE | STRING | ||
updated_at | NULLABLE | TIMESTAMP | ||
document_type | NULLABLE | STRING | ||
schema_name | NULLABLE | STRING | ||
first_published_at | NULLABLE | TIMESTAMP | ||
base_path | NULLABLE | STRING | The page path (not including hostname, protocol, or any query strings). Note that different chapters of a guide will have the same base path in this table, although they appear under different URLs on the site | |
document_id | NULLABLE | INTEGER | ||
description | NULLABLE | STRING | ||
published_at | NULLABLE | TIMESTAMP | ||
details | NULLABLE | JSON | ||
routes | NULLABLE | JSON | ||
redirects | NULLABLE | JSON | ||
unpublishing_type | NULLABLE | STRING |