Skip to main content

Use BigQuery

This page is a work in progress.

Google BigQuery (often shortened to ‘BQ’) is a cloud-based SQL data warehouse.

Access

When requesting access please state what data you need access to. If in doubt try the #data-engineering Slack channel.

Understanding tables, datasets, and projects

In BigQuery, data is stored in tables, which are grouped together inside datasets. Datasets are organised into projects.

For example, our GOV.UK Universal Analytics data is sent to the govuk-bigquery-analytics.87773428.ga_sessions_intraday_YYYYMMDD table three times a day. Here,

  • govuk-bigquery-analytics is the project ID
  • 87773428 is the dataset name
  • ga_sessions_intraday_YYYYMMDD is the table name

There are different types of tables that can be used to store data in BigQuery.

Our raw Google Analytics data, for example, is stored in sharded tables. These table names end with the suffix YYYYMMDD, representing the date in year-month-day format.

Partitioned tables are in use in other datasets.

Quotas

Several projects have quotas in place to limit the amount of data that can be queried. This limits our exposure to high costs from large queries and provides a safety net for those working with the data, as you can only wrack up so much in query costs.

The aim of these quotas is not to inhibit use of the data we store. Please post in the #data-engineering Slack channel or contact the Analytics team to discuss your needs if quotas appear to be getting in the way of your work.

Details of specific quotas can be found under the project name on the GCP page.

How did I query more data than permitted by my quota?

Sometimes, users can trigger queries which surpass quotas in place.

We believe that this happens when a number of queries (which individually do not pass the quota, but do together) are added to the queue and run concurrently by BigQuery. BigQuery uses dynamic concurrency, so a number of queries can be run simultaneously. We think that the quota check occurs before a query is moved to ‘Active’.

How did I hit a quota so quickly when using Looker Studio?

Looker Studio, alongside other reporting or visualisation tools, seems to be very ‘data hungry’.

One issue occurs when setting up new data connections. When Looker Studio connects to a sharded set of tables for the first time, it runs a series of queries that references all shards. The queries are for a small number fields each (4 or 5), and these queries are run multiple times until all fields have been retrieved - essentially running SELECT * FROM * LIMIT 100.

This does not appear to be the case when using an existing data connection. When using an existing data connection, Looker Studio respects the table_suffix and default date range within the report.

For this reason, we recommend that where possible shared pre-existing data connections are used in Looker Studio.

Some best practice for using Looker Studio with BigQuery data can be found on the Looker Studio best practice page (written for those using GA4 data, but the advice is likely to hold true for other large datasets).

Roles

A role is a set of permissions. Users should only have the specific role or permissions they need to use the Google Cloud Platform.

Contact the Data Engineering community on Slack to ask for a role, permission or service account.

Common permissions we use for BigQuery access include:

  • BigQuery Data Viewer
  • BigQuery Job User
  • BigQuery Read Session User

BigQuery Data Viewer allows the user to access and view data. When granted at the project level, this means the user can see all the data held within a project. This does not allow the user to query the data from within that project - a user granted only Data Viewer permissions to a given project would need to query the data from a different project. This permission can also be applied at dataset level.

BigQuery Job User grants the user the ability to run queries from the project in question. The data being queried could be held in a different project but the query cost is allocated to the querying project.

BigQuery Read Session User permissions are required if the Storage Read API is used when querying.

More information on roles and IAM permissions can be found in the Google Cloud documentation.

This page was last reviewed on 8 June 2024. It needs to be reviewed again on 8 December 2024 .
This page was set to be reviewed before 8 December 2024. This might mean the content is out of date.