Reference information
As a new starter in Data Services, the following reference information may be of use.
GDS wiki
The GDS Wiki contains GDS-specific information. Some useful pages include:
- GDS induction
- guidance on performance management
- learning and development, including mandatory training, and using the learning and development budget
Cabinet Office Intranet (CabWeb)
You must sign into the GDS Virtual Private Network (VPN) before you can access CabWeb. See the guidance on signing into the GDS VPN using your Google credentials for more information.
Once you’re signed into the VPN, you can access CabWeb. Some useful pages include:
- the Cabinet Office Analysis hub
- human resources (HR) guidance on the HR hub
- SOP guidance on MyHub
Single Operating Platform (SOP)
SOP is a Cabinet Office-wide platform for most human resource functions, including editing your personal information, accessing your payslip, logging expenses, and requesting special leave. For more information on how to use SOP, see the SOP guidance on CabWeb.
To access SOP:
GDS Business Operations Tool (GBOT)
HR requests are made using the GDS Business Operations Tool (GBOT). For more information on GBOT ask your line manager or your business operations team.
The GDS Way
The GDS Way is a public-facing website that documents the specific technology, tools, and processes used at GDS and CDDO.
Although software developer-focused, some useful pages include:
- style guides for different programming languages, including the GDS Python style guide
- tracking, and managing third-party software dependencies
- building accessible services, and understanding WCAG 2.1, which is a legal responsibility of public sector websites
- best practice on using version control
The GDS Data Audit
The Government Digital Service (GDS) is digital by default, producing and collecting various data. Under the new Government Cyber Security Strategy and the associated Cyber Assessment Framework (CAF), there is a requirement relating to effectively managing our data sources (also referred to as data assets). The Data Protection Act (UK’s implementation of the General Data Protection Regulation (GDPR)) also requires us to understand the data we hold and how it is used.
Some of these data sources are captured in this GDS data management audit conducted in 2022 by GDS Data Services. The premise was to create a Github repository of data sources with named owners, in the hope that a federated approach would create accountable owners to maintain it. Each data source has an associated markdown document which describes characteristics of the data organised, by GDS teams.
If you create a data source / data set during your time at GDS you are obliged to (under the CAF and Data Protection Act) document it in this repository:
Fill in a copy of the data management template for your dataset
Place the resulting document under the organisation folder in this repository, under a new folder for the team that is responsible for the data. If a folder does not exist then create one.
This data is itself useful for your colleagues across GDS and also for the Information Assurance team.
GOV.UK Confluence
Ask your delivery manager for access to the GOV.UK Confluence workspace.
Once you have access to GOV.UK Confluence, go to the Analytics on GOV.UK Confluence page. This page has definitions on the custom dimensions used in Google BigQuery analytics tables.
The Aqua book for maintaining analytical quality assurance (AQA)
Our analytical work can have far-reaching implications, including impacting individuals and their livelihoods. The Aqua book provides high-level guidance on producing quality analysis for government. This is termed analytical quality assurance (AQA). This book sets out how departments should ensure their work is fit-for-purpose through verification and validation.
These checks apply to anything that can be loosely defined as a “model”. If your work takes an input, processes it, and produces an output, this comes under the scope of AQA. This includes but is not limited to visualisations, spreadsheets, machine learning models, and even back-of-napkin-type calculations.
The Aqua book establishes four principles:
Proportionality of response
The extent of the analytical quality assurance effort should be proportionate in response to the risks associated with the intended use of the analysis. These risks include financial, legal, operational and reputational impacts. In addition, analysis that is frequently used to support a decision-making process may require a more comprehensive analytical quality assurance response.
Assurance throughout development
Quality assurance considerations should be taken into account throughout the life cycle of the analysis and not only at the end. Effective communication is crucial when understanding the problem, designing the analytical approach, conducting the analysis and relaying the outputs.
Verification and validation
Analytical quality assurance is more than checking that the analysis is error-free and satisfies its specification (verification). It must also include checks that the analysis is appropriate, that is, fit for the purpose for which it is being used (validation).
Analysis with RIGOUR
Quality analysis needs to be the following:
- repeatable ®
- independent (I)
- grounded in reality (G)
- objective (O)
- have understood and managed uncertainty (U)
- the results should address the initial question robustly ®
In particular, it is important to accept that uncertainty is inherent within the inputs and outputs of any piece of analysis. It is important to establish how much we can rely upon the analysis for a given problem.
These principles must be considered when undertaking any work involving data/models.
Note that AQA is not just about software quality assurance. It can also include dealing with ethical considerations, reasons for choosing the method/technique, and validating analytical assumptions and caveats.
Further information
Additional Aqua book resources are available, and the Government Analytical Function, Government Data Quality Hub, and other departments have also produced:
- guides to ensure your work is fit for purpose when working to very tight deadlines
- guides to ensure your data is fit for purpose
- a curriculum around quality assurance, validation, and data linkage
GOV.UK developer documentation
See the documentation on document types on GOV.UK for information on the various document types present on GOV.UK. This list may be incomplete.
The documentation on the analytics (GA4) implementation on GOV.UK may also be of interest, providing information on how we collect GA4 data.
The GOV.UK GA4 implementation record documents the dataLayer pushes implemented on GOV.UK, which provide the majority of the information sent to GA4.
View the JSON of a GOV.UK page
You can view the JSON on a GOV.UK page by either:
- using the GOV.UK Toolkit for Chrome and Firefox Chrome extension
- adding
/api/content
into a page URL, for example, you can changehttps://www.gov.uk/browse/benefits/disability
tohttps://www.gov.uk/api/content/browse/benefits/disability
Using either of these methods lets you view the A and B versions of a page.
Code examples
The following content has code examples for different data sources.
Google BigQuery code examples
Google BigQuery code examples are available in the govuk-data-labs-onboarding
GitHub repo.
Content Store code examples
Downloading the Content Store may take some time.
If you need to use the Content Store in a project, you can instead use the:
- JavaScript files in the
govuk-intent-detector
GitHub repo - PyMongo Jupyter notebook in the define-content-schemas branch of the
govuk-intent-detector
GitHub repo
GOV.UK mirror code examples
Downloading the GOV.UK mirror may take some time.
If you need to use the GOV.UK mirror in a project, you can instead use the page term TF-IDF matrix notebooks in the govuk-intent-detector
GitHub repo.
Learning and development resources
Free? | Materials | Notes |
No | O’Reilly ebooks through ACM membership | O’Reilly Media publishes technology-oriented books with an associated app for reading their books on the go. Useful books and videos include: |
No | Standard individual licence for Pluralsight | Pluralsight provides online courses that lean towards software development and engineering. Some useful courses include: |
Yes | Advanced NLP with spaCy | Free online course by the creators of spaCy on natural language processing, including exercises, slides, videos, multiple choice questions, and interactive, browser-based coding practice. |
Depends | Coursera | Coursera hosts a number of courses on data science. You can “audit” courses for free; but you cannot complete certain assignments or obtain a completion certificate. It’s generally not worth paying for the courses. Good courses include: |
Yes | fast.ai | Online courses on deep learning using fast.ai, practical data ethics, computational linear algebra, and natural language processing |
Yes | Interpretable Machine Learning | Accessible book on interpretable machine learning, including interpretable machine learning models, as well as model-agnostic methods for interpretability. |
Yes | The Illustrated Word2vecJay Alammar’s GitHub Pages | An illustrated guide to word2vec. The author, Jay Alammar, also has a whole host of other illustrated guides. |
Yes | Causal Inference for The Brave and True | A light-hearted yet rigorous approach to learning impact estimation and sensitivity analysis. |
Yes | Datasheets for Datasets | A paper proposing how to document datasets. |
Yes | Managing Python Environments | Short blog post by Pluralsight summarising Python. |
Yes | Hypermodern Python | A recent review on Python best practice for projects. |
Yes | Mathematics for Machine Learning | Mathematical skills book to be able to interpret other advanced machine learning books. |
Yes | huggingface/datasets | The largest hub of ready-to-use natural language processing datasets for machine learning models with fast, easy-to-use and efficient data manipulation tools. |
Yes | ONS Best Practice and Impact - Quality Assurance of Code for Analysis and Research | Cross Governmental guidance on best practice for analysis and research. |
Yes | ethen8181/machine-learning | Machine learning tutorials |
Yes | ageron/handson-ml2 | Complementary code for the Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow O’Reilly book. |
Yes | awesomedata/awesome-public-datasets | A topic-centric list of high quality open datasets. |
Yes | Made with ML | Machine learning operations and engineering courses. |
Yes | ikatsov/tensor-house | A collection of reference machine learning and optimization models for enterprise operations, including marketing, pricing, and supply chain. |
Yes | jghoman/awesome-apache-airflow | Resources for Apache Airflow. |
Yes | aws/amazon-sagemaker-examples | AWS Sagemaker examples - these are automatically loaded into Sagemaker instances. |
Yes | Chris-Engelhardt/data_sci_guide | A community-curated list of data science courses, including direct, free replacement courses for paid options. |
No | Introduction to Statistical Learning: With Applications in R | An accessible primer into machine learning - recommended read for newcomers to data science, and as a refresher. |
Yes | datastacktv/data-engineer-roadmap | Roadmap for those wishing to study data engineering. |
Yes | alastairrushworth/free-data-science | Resources and learning materials across a broad range of popular data science topics and arranged thematically. |