fbpx

Microsoft Azure entire UK South Region outage on 4th July 2019 explained

On 4th July 2019: @ 07:23 we noticed one of our backup clusters wasn’t responding.

We contacted Azure priority Tech and they were initially unaware of the outage. There weren’t any errors showing on our Azure portal.
Within 5 minutes they confirmed other customers were reporting the same issue.

The Azure status page showed all services had failed, however we had only lost one cluster. The cluster in question had previously existed as an on-premise server which we had replicated to Azure. Due to its architecture, the data was stored on disks rather than the blob storage. Presumable the disks are connected to the VPS via a switch and it is most likely that device had failed. One of the features we like with Azure is that they shut down VPS’ if they can’t connect to their storage within 2 minutes.
Our other clusters were still online presumably because their storage was held in blobs rather than disks.

I would agree that Microsoft’s stance on shutting down VPS’ within 2 minutes of an outage is a good one. We didn’t lose any data either.

So far with Azure we have had 100% data durability and more than 99.9% uptime.

*************
Microsoft’s response

Mon 08/07/2019 11:35
This is reading the high Severity regional platform issue that impacted the entire UK South Region.

As per the internal records, the issue is now mitigated and the root cause has been identified on a storage host that malfunctioned due to high levels of resource utilization on a single storage scale unit.

*************
The network status we published during the outage

4th July 2019: 07:23

We are experiencing an issue with our B23 cluster on Microsoft’s cloud.
This appears to be a Microsoft cloud issue and we are communicating with their support team now. The error is shown below.

At 07:22 AM, Thursday, 04 July 2019 UTC, your virtual machine became unavailable because it lost connectivity to its remote disk. Azure platform continuously monitors reads and writes (IO transactions) from your virtual machine to Azure Storage service where your virtual machine’s disks reside. If transactions do not complete successfully within 120 seconds (inclusive of retries), connectivity is considered to be lost and your virtual machine is shut down temporarily. This is done to preserve data integrity and prevent corruption of your virtual machine. Once Azure platform detects that connectivity to Storage service is restored, your virtual machine is automatically restarted. RDP and SSH connections to your virtual machine, or requests to any services running inside your virtual machine may have failed during this time. We apologize for any inconvenience this may have caused you. We are continuously working on improving the Azure platform to reduce availability incidents.

We apologise for any inconvenience.

————
09:02
Microsoft Support has confirmed this is a regional issue and other customers have reported the outage.

————
09:12
The service has been restored and we continue to monitor this.

————
09:35
The service has failed again.

To prevent any false reporting, we will only clear the error 30 minutes after it has been marked as resolved.

————
12:35
Azure have updated their status page https://status.azure.com/en-us/status

————
14:36
This issue has been resolved. Please open a new ticket if you are experiencing issues.

BOBcloud NOC

 


Andreas,
BOBcloud CTO

 

Leave a Reply

Your email address will not be published. Required fields are marked *

BOBcloud.net
The Old Sorting Office, Corsham, Wiltshire SN13 9AA
Tel: 0800 907 8238 https://www.bobcloud.net/wp-content/themes/bobcloud/images/logo.png