Start a new data science project
When starting a new data science project, you should:
- store your project in GitHub
- (optional) set up your project using
govcookiecutter
- organise your workload using Trello
Store your projects in GitHub
You must store all your projects in GitHub repositories (repos).
See the GOV.UK developer documentation on setting up your GitHub account for information on how to get access to GitHub.
You should follow best practice when storing projects in GitHub:
- store GOV.UK projects in
alphagov
organisation repos - do not store repos under your GitHub username
- set your repo to public where possible as the GOV.UK Data Products team likes to code in the open
- when naming your repo, use hyphens instead of underscores to separate words.
- to make sure colleagues have access to repo if the repo owner leaves, grant the following GitHub teams access to your
alphagov
repositories: - admin access:
gov-uk-data
By default, GitHub organisation owners have admin access to all repos. If you cannot access a repo, speak to the owner for access.
See the GitHub documentation on access permissions for more information.
Add branch protection rules
You must add branch protection rules to the main
branch of your repo to prevent unwanted actions to this branch. At a minimum, set that all pull requests require at least one approved review.
- Go to your GitHub repo.
- Select Settings and then select Branches in the left-hand navigation.
- In Branch protection rules, select Add rule.
- Enter “main” into the Branch name pattern field.
- In Protect matching branches, select the Require pull request reviews before merging checkbox. By default this requires one approving review.
- Change any other options as necessary.
- Select Save changes.
Set up your project using govcookiecutter
You can use govcookiecutter
to set up your data science projects.
govcookiecutter
is a template for data science projects in HM Government and the wider public sector.
The GDS Data Science team designed govcookiecutter
to:
- make sure data science projects use Agile approaches to adhere to AQA standards
- prevent data leakage by enforcing data version control and cleaning Jupyter notebook outputs
- centralise documentation local to code so documentation is kept up to date and changes made are visible to reviewers
- make sure secrets and credentials are usable locally, but kept out of version control
- implement a consistent folder structure to reduce onboarding time
If you use govcookiecutter
to set up a data science project, govcookiecutter
automatically covers most of these needs.
For more information, see the:
Organise your workload using Trello
GOV.UK Data Services adopts Agile ways of working. We use Trello to maintain a team-wide Kanban-style way of working, although specific projects may adopt other methodologies such as Scrum.
To make sure all projects are visible to the team, you should write and update cards on Trello.
Stories should cover a specific task that has a valid acceptance criteria. Acceptance criteria are your “definition of done”. You should make the minimum developments to meet this criteria.
Draft any further work over and above this acceptance criteria as new stories.
Label and estimate Trello cards
For both epics and stories, make sure you label the Trello cards appropriately. If relevant, you should also try to label stories with a time estimate.
Try to include time for colleagues to review your work. If stories are sufficiently small and discrete, reviewing should take a reasonable length of time.
You should include time for colleagues to review because:
- explicitly providing review time ensures reviewers have dedicated time for them
- not planning sufficient review time may lead to poorer quality reviews due to non-story related pressures
- any unused review time is never lost as reviewers can start other stories
Try to make your stories manageable, and doable within a sprint (usually two weeks). This makes sure the story completion rate (otherwise known as velocity) is fast enough, and means that your stories are targeted to meet a need.
If a story is not well defined, split up that story into smaller stories to make it easier to complete them.
If it is difficult to estimate how long a story should take, consider running that story as a spike. A spike is a set time interval to explore and report back on a task.