BigQuery
This page is a work in progress.
Google BigQuery (often shortened to ‘BQ’) is a cloud-based SQL data warehouse.
BigQuery is used by GDS for the storage and querying of a number of web analytics resources.
Resources
Understanding tables, datasets, and projects
In BigQuery, data is stored in tables, which are grouped together inside datasets. Datasets are organised into projects.
For example, our GOV.UK Universal Analytics data is sent to the govuk-bigquery-analytics.87773428.ga_sessions_intraday_YYYYMMDD
table three times a day. Here,
govuk-bigquery-analytics
is the project ID87773428
is the dataset namega_sessions_intraday_YYYYMMDD
is the table name
There are different types of tables that can be used to store data in BigQuery.
Our raw Google Analytics data, for example, is stored in a sharded table.
The name of each shard ends with the suffix YYYYMMDD
, representing the date in year-month-day format.
Partitioned tables are in use in other datasets, such as the flattened GOV.UK GA4 data.
Access
All GDS staff have access to core Google Analytics datasets by default.
Access to other projects and datasets will be granted to users when required.
To request access, please contact the #data-engineering Slack channel, stating what project or data you need access to, and what roles or permissions you require.
Roles
A role is a set of permissions. Users should only have the specific role or permissions they need to use the Google Cloud Platform.
Contact the Data Engineering community to ask for a role, permission or service account.
Common permissions we use for BigQuery access include:
- BigQuery Data Viewer
- BigQuery Job User
- BigQuery Read Session User
BigQuery Data Viewer allows the user to access and view data. When granted at the project level, this means the user can see all the data held within a project. This does not allow the user to query the data from within that project - a user granted only Data Viewer permissions to a given project would need to query the data from a different project. This permission can also be applied at dataset level.
BigQuery Job User grants the user the ability to run queries from the project in question. The data being queried could be held in a different project but the query cost is allocated to the querying project.
BigQuery Read Session User permissions are required if the Storage Read API is used when querying.
More information on roles and IAM permissions can be found in the Google Cloud documentation.
Quotas
Several projects have quotas in place to limit the amount of data that can be queried. This limits our exposure to high costs from large queries and provides a safety net for those working with the data, as you can only wrack up so much in query costs. Details of specific quotas can be found under the project name on the GCP page.
The aim of these quotas is not to inhibit use of the data we store. Please contact the Data Engineering community if quotas appear to be getting in the way of your work.
Sometimes, it appears that users can trigger queries which surpass quotas in place. We believe that this happens when a number of queries (which individually do not pass the quota, but do together) are added to the queue and run concurrently by BigQuery. BigQuery uses dynamic concurrency, so a number of queries can be run simultaneously. We think that the quota check occurs before a query is moved to ‘Active’.
Quotas and Looker Studio
Looker Studio, as well as other reporting or visualisation tools, seems to be very ‘data hungry’.
One issue occurs when setting up new data connections.
When Looker Studio connects to a sharded table for the first time, it runs a series of queries that references all shards.
The queries are for a small number fields each (4 or 5), and these queries are run multiple times until all fields have been retrieved - essentially running SELECT * FROM * LIMIT 100
.
This does not appear to be the case when using an existing data connection. When using an existing data connection, Looker Studio respects the table_suffix and default date range within the report.
For this reason, we recommend that where possible shared pre-existing data connections are used in Looker Studio.
Some best practice for using Looker Studio with BigQuery data can be found on the Looker Studio best practice page (written for those using GA4 data, but the advice is likely to hold true for other large datasets).