Skip to main content

Start a new data science project

When starting a new data science project, you should:

  • store your project in GitHub
  • set up your project using govcookiecutter
  • organise your workload using Trello

Store your projects in GitHub

You must store all your projects in GitHub repositories (repos).

See the GOV.UK developer documentation on setting up your GitHub account for information on how to get access to GitHub.

You should follow best practice when storing projects in GitHub:

  • store GOV.UK projects in alphagov organisation repos
  • store non-GOV.UK projects in ukgovdatascience organisation repos
  • do not store repos under your GitHub username
  • set your repo to public where possible as the GOV.UK Data Products team likes to code in the open
  • when naming your repo, use hyphens instead of underscores to separate words.
  • to make sure colleagues have access to repo if the repo owner leaves, grant the following GitHub teams access to your alphagov repositories:
    • admin access: gov-uk-data
  • for ukgovdatascience repos, grant the following GitHub teams access:
    • write access: gdsdatascience

By default, GitHub organisation owners have admin access to all repos. If you cannot access a repo, speak to the owner for access.

See the GitHub documentation on access permissions for more information.

Add branch protection rules

You must add branch protection rules to the main branch of your repo to prevent unwanted actions to this branch. At a minimum, set that all pull requests require at least one approved review.

  1. Go to your GitHub repo.
  2. Select Settings and then select Branches in the left-hand navigation.
  3. In Branch protection rules, select Add rule.
  4. Enter “main” into the Branch name pattern field.
  5. In Protect matching branches, select the Require pull request reviews before merging checkbox. By default this requires one approving review.
  6. Change any other options as necessary.
  7. Select Save changes.

Set up your project using govcookiecutter

You should use govcookiecutter to set up your data science project.

govcookiecutter is a template for data science projects in HM Government and the wider public sector.

The GDS Data Science team designed govcookiecutter to:

  • make sure data science projects use Agile approaches to adhere to AQA standards
  • prevent data leakage by enforcing data version control and cleaning Jupyter notebook outputs
  • centralise documentation local to code so documentation is kept up to date and changes made are visible to reviewers
  • make sure secrets and credentials are usable locally, but kept out of version control
  • implement a consistent folder structure to reduce onboarding time

If you use govcookiecutter to set up a data science project, govcookiecutter automatically covers most of these needs.

For more information, see the:

Organise your workload using Trello

GOV.UK Data Products adopts Agile ways of working. We use Trello to maintain a team-wide Kanban-style way of working, although specific projects may adopt other methodologies such as Scrum.

To make sure all projects are visible to the team, you should write and update cards on Trello. GOV.UK Data Products has two Trello boards:

For new epics, create an epic using the Epic template card and summarise the epic.

For new and existing projects, create stories using the Story template card.

Stories should cover a specific task that has a valid acceptance criteria. Acceptance criteria are your “definition of done”. You should make the minimum developments to meet this criteria.

Draft any further work over and above this acceptance criteria as new stories.

Label and estimate Trello cards

For both epics and stories, make sure you label the Trello cards appropriately. If relevant, you should also try to label stories with a time estimate.

Try to include time for colleagues to review your work. If stories are sufficiently small and discrete, reviewing should take a reasonable length of time.

You should include time for colleagues to review because:

  • explicitly providing review time ensures reviewers have dedicated time for them
  • not planning sufficient review time may lead to poorer quality reviews due to non-story related pressures
  • any unused review time is never lost as reviewers can start other stories

Try to make your stories manageable, and doable within a sprint (usually two weeks). This makes sure the story completion rate (otherwise known as velocity) is fast enough, and means that your stories are targeted to meet a need.

If a story is not well defined, split up that story into smaller stories to make it easier to complete them.

If it is difficult to estimate how long a story should take, consider running that story as a spike. A spike is a set time interval to explore and report back on a task.

This page was last reviewed on 9 September 2021. It needs to be reviewed again on 9 March 2022 .
This page was set to be reviewed before 9 March 2022. This might mean the content is out of date.