We design digital government services to be secure and available for users 24 hours a day, all year round. However, despite our best efforts inevitably there will be a time when a service faces a major incident and needs to be taken temporarily offline. In this blog post we share how GOV.UK Verify is prepared for such an event, as required under Point 11 of the Digital by Default Service Standard, which says:
Make a plan for the event of the digital service being taken temporarily offline.
Designing a service for high availability
Monitoring and alerting are key components of developing any service designed for high availability. If there is criminal activity on GOV.UK Verify we want to know immediately. We want to know if any part of GOV.UK Verify is about to fail before it does. If something fails suddenly, or something outside our control fails that impacts our service, we want to know before - or as soon as - our users do.
As such, we monitor our application, servers, network and storage to see how they are responding. We look to see who is making changes on the system and whether someone is trying to piggy-back on legitimate users to drop malware on our servers.
The output of our monitoring is graphed on screens in our office so that we are constantly aware of what is going on and alerts are sent to on-call phones out of hours if anything indicates a major issue.
Planning for a major incident
Having developed the service for security and availability we then plan for disaster.
Our incident response plan starts with a technical question: how do we organise around an incident and restore GOV.UK Verify as quickly as possible? If the issue is security related, the aim is to restore GOV.UK Verify in a secure fashion, isolating the breach and carefully preserving the evidence in case we manage to track down the perpetrator. The difficulty is that when an incident starts it may not be immediately clear what the root cause is. Therefore, we have a single procedure that covers any kind of incident.
It’s essential that everyone knows what to do when an incident occurs and roles and responsibilities are clear. To instill this in the team, we run game days in which we manufacture an event without warning the team, and then monitor it. Once the scenario has been played through, just as we would with a real incident, we hold a retrospective to identify what happened, looking at what worked well and where we could have done things better. This is fed back into the documentation and the next game day.
Communicating in a crisis
An equally important aspect of managing an incident is how we communicate what is going on to our various stakeholders. As GOV.UK Verify is part of a wider federation, there’s a range of user needs we’ve considered in our planning.
GOV.UK Verify’s users need to access a government service. If GOV.UK Verify is not available, they need to know what they can do (for example, wait until a certain time to try again or use an alternative way into the service). Our certified companies and all government services connected to GOV.UK Verify need GOV.UK Verify to interact with their services as usual or, if not, they need to know what is happening and what they can do.
We want to share accurate information to our stakeholders in a timely fashion. Sharing premature, late or simply incorrect information can be damaging as so many companies have found to their cost over the years.
The GOV.UK Verify technical team are at the coalface uncovering what has actually happened and why, and they need to concentrate on that and on restoring the service. At the same time, they are uncovering details of the incident that need to be communicated onwards appropriately.
To manage this flow of information, the incident response team comprises: a technical lead, who organises the investigation team and directs the recovery; a documenter, who is responsible for taking down the actions that they take and keeping a timeline; and an incident manager, who ensures the team has everything they need and passes status information to the Head of Operations.
The Head of Operations works with the management team that will include the Service Delivery Manager, the Programme Manager and the Head of Communications. Other members of the team will be drawn in as appropriate, such as the Legal Advisor and Head of Security. This group is responsible for external communications and escalation. In a major incident, they decide what to tell to whom and when.
Reporting GOV.UK Verify’s service availability
We have Service Level Agreements around incident reporting and there is good practice to follow in the event of a security incident. Currently we notify events via verifystatus.digital.cabinet-office.gov.uk. We’d be really interested to know if you think reporting using other channels - such as Twitter - would be more useful to you. Let us know in the comments below.