Skip to main content

Use BigQuery

This page is a work in progress.

Google BigQuery (often shortened to ‘BQ’) is a cloud-based SQL data warehouse.

Access

When requesting access please state what data you need access to. If in doubt try the #data-engineering Slack channel.

Understanding tables, datasets, and projects

In BigQuery, data is stored in tables, which are grouped together inside datasets. Datasets are organised into projects.

For example, our GOV.UK Universal Analytics data is sent to the govuk-bigquery-analytics.87773428.ga_sessions_intraday_YYYYMMDD table three times a day. Here,

  • govuk-bigquery-analytics is the project ID
  • 87773428 is the dataset name
  • ga_sessions_intraday_YYYYMMDD is the table name

Our Google Analytics data is stored in sharded tables. These table names end with the suffix YYYYMMDD, representing the date in year-month-day format.

Intraday tables

We have set up our Google Analytics properties to export GOV.UK data several times a day. This day is temporarily stored in intraday tables.

At the end of the day, BigQuery automatically moves the data in the intraday table to a date table (suffixed YYYYMMDD) and deletes the intraday tables in question. New intraday tables are created and added to throughout the next day.

Quotas

Several projects have quotas set up to limit the amount of data that can be queried. The aim of this is twofold. Firstly, this limits our exposure to unlimited costs from queries. Secondly, this provides a safety net for those working with the data, as you can only wrack up so much in query costs.

The aim is not to be a hindrance to the need to use the data we store so please contact the #data-engineering Slack channel to discuss your needs if the quotas are getting in the way of work.

Specific quotas can be found detailed under the project name on the GCP page.

How did I exceed my quota?

This is a very good question that we are still investigating.

Our current thinking is that the dynamic concurrent query queue could be loaded with queries before the quota is breached. We think that the quota check occurs before a query is moved to Active.

Google documentation on query queues

Why is Looker Studio so data hungry?

Another good question that is under investigation.

Currently Looker Studio does not cache results from BQ, meaning the query has to be run every time. Caching results is due for general availability release by the end of 2023 (see the Google documentation).

In addition to not caching results it seems that when creating a new data connection, Looker Studio connects to a sharded set of tables for the first time it runs a series of queries that reference all shards. The queries are for a small number fields each (4 or 5) and then query all shards and then include a LIMIT 100 and this is done multiple times untli all fields are done. This is essentially a SELECT * FROM * LIMIT 100.

This does not appear to be the case when using an existing data connection. In these cases, Looker Studio respects the table_suffix and default date range within the report.

Where possible, use shared pre-existing data connections in Looker Studio.

Roles

Google Cloud Platform permissions (IAM) can be a mysterious thing.

BigQuery Data Viewer at the Project level allows the person to see all the data held within a project, it can also be applied at a dataset level. It does not provide the ability to query the data from that project, they would need to do so from a different project.

BigQuery Job User grants the user the ability to run queries from the project. The data could be held in a different project but the query cost it allocated to the querying project.

BigQuery Read Session User is needed if the Storage Read API is used when querying.

This page was last reviewed on 25 April 2024. It needs to be reviewed again on 25 October 2024 .
This page was set to be reviewed before 25 October 2024. This might mean the content is out of date.