Microsoft Azure UK South Region outage on 14th September 2020
14th September 2020 Azure UK outage
70% of our backup clusters went offline in the UK South region of Microsoft’s Azure during the outage.
What went wrong
The impact statement is published below, but what isn’t clear is how Microsoft deals with these issues.
They simply unplug or turn off down their Hyper V hosts when disaster strikes. All our VMs which were affected were just switched off (evident by the warning on reboot). It is easy to save or gracefully shutdown guest VMs (well, at least try to do so). The extra time the hosts were powered up wouldn’t have added too much heat to the environment in this instance.
Our VMs were using LRS redundancy in UK South (London) and were commissioned over a period of 3 years. We have basically been there since Azure went live in the UK.
We don’t have PVLANs at Azure and presume the VMs spawl in the DC where MS sees fit. MS said only ‘a subset of customers in the UK South’ had been affected, but we experienced 70% of our VMs being offline.
Posted from Azure’s status page
Summary of impact: This incident is now mitigated. Between 00:40 UTC on 15 Sep 2020 and 06:06 UTC on 15 Sep 2020, you were identified as a customer using Virtual Machines in UK South who may have experienced connection failures when trying to access some Virtual Machines hosted in the region. These Virtual Machines may have also restarted unexpectedly.
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
RCA – Connectivity Issues – UK South (Tracking ID CSDC-3Z8)
Summary of Impact: Between 13:30 UTC on 14 Sep and 00:41 UTC on 15 Sep 2020, a subset of customers in the UK South may have encountered issues connecting to Azure services hosted in this region. Customers leveraging Availability Zones and configured for zone redundancy would not have experienced a loss in service availability. In some instances, the ability to perform service management would have been impacted. Zone Redundant Storage (ZRS) remained available throughout the incident.
Root Cause and Mitigation: On 14th September 2020, a customer impacting event occurred in a single datacenter in UK South due to a cooling plant issue. The issue occurred when a maintenance activity that was being performed at our facility had the site shut down the water tower makeup pumps via their Building Automation System (BAS). This was shut down in error and was noticed at approximately 13:30 UTC when our teams began to inspect the plant.
By this time, the issue had begun to impact downstream mechanical systems resulting in the electrical infrastructure that supports the mechanical systems shutting down. Microsoft operates its datacenters with 2N design meaning that we operate with a fully redundant, mirrored system. The 2N design is meant to protect against interruptions which could cause potential downtime; however, in this case, the cascading failures impacted both sides of the electrical infrastructure that supports mechanical systems. When the thermal event was detected by our internal systems, automation began to power down various resources of the Network, Storage, and Compute infrastructure to protect hardware and data durability. There were portions of our infrastructure that could not be powered down automatically (for example due to connectivity issues); some of these were shut down via manual intervention.
It took approximately 120 minutes for the team to diagnose the root cause and begin to remediate the mechanical plant issues, with cooling being restored at 15:45 UTC. By 16:30 UTC temperatures across the affected parts of the data center had returned to normal operational ranges.
Networking recovery began at approximately 16:30 UTC by beginning power-cycling network switches to recover them from the self-preservation state they entered when overheated. The recovery order was prioritized to first bring Azure management infrastructure, Storage clusters, and then Compute clusters online. When network switches providing connectivity to a set of resources were power-cycled and started to show health, engineers began recovering the other classes of resources. Network recovery was completed at 23:32 UTC. Shortly after this, any impacted Storage and Compute clusters regained connectivity, and engineers took further steps to bring any remaining unhealthy servers back online.
Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- Review the logs and alarms from all affected mechanical and electrical gear to help ensure there was no damage or failed components. This is complete.
- Review and update Operational Procedure and Change Management to help ensure that the correct checks are in place and system changes via commands across systems are validated visually prior to commencement of work or return to a normal state.
- Validate and update the discrimination study for the Mechanical and Electrical systems.