14. Operate a reliable service

This guidance will help you apply standard point 14.

Everyone is responsible for meeting the Service Standard. This standard point is most relevant to:

Data architectsDelivery managersProduct managersFrontend developersSoftware developers

Why it's important

Users expect to be able to use DfE services when they need to.

There are some DfE services that are only available at certain times of the year. However, this should not mean that services do not operate reliably outside of these times. For example, if a school needs to access a service in an emergency.

All phases

Things to consider:

  • maximise uptime and speed of response for your service
  • ensure your service has the ability to deploy software changes regularly, without significant downtime. For example, by minimising the effort involved in creating new environments and populating pre-production environments with test data
  • create continuous integration and continuous delivery pipelines from an early phase of the project
  • have regular quality assurance and performance testing, overseen by the team, not automated tools
  • test your service in an environment that protects users' privacy and that's as similar to live as possible
  • monitor the status of your service, together with a proportionate, sustainable plan to respond to problems identified by monitoring
  • agree recovery point objectives and recovery time objectives with the service owner
  • capture appropriate service-level agreements (SLAs), relevant to your service and document in the architecture decision record (ADR)
  • make sure that runbooks and operational processes are captured, for example, how incident requests will be handled
  • ensure the team's approach to incident management is defined and documented
  • include outcomes for users and ethical issues, such as bias, when monitoring your service. This also needs to be reflected in performance
  • plan who will be responsible for supporting your service once live