Last week, Microsoft's Azure customers in North Europe faced a seven-hour long difficulty in using and managing their resources hosted on a data center in the region, owing to unavailability of storage space following a small accident. The incident caused a chain of safety precautions and thermal complications furthering recovery time up to seven hours. The incident showed the complexity of managing cloud hosting services, in which a small error can create a ripple effect to impact thousands of customers.
The unavailability of a portion of the storage scale unit affected a variety of Azure services dependent on the impacted storage unit for many hours, including virtual machines, Azure Backup, app services, Azure Cache, Azure Monitor, Azure Functions, Time Series Insights, Azure Analytics, HDInsight, Azure Data Factory, Azure Scheduler, and Azure Site Recovery.
Microsoft has released a detailed report that summarizes what exactly happened, in addition to announcing its commitment to prevent such events in future. According to the report, it all started during a periodic fire suppression system maintenance at the data center, when a suppression agent got accidentally released by workers. The release of a fire suppressant caused an automatic shutdown of Air Handler Units (AHUs) for the purpose of safety.
After 35 minutes, when AHUs were restarted, isolated areas of the impacted zone recorded ambient temperatures above operational levels, the report says. This incident again led the unit's internal thermal health monitoring system to carry out automatic shutdowns to prevent overheating of systems and ensuring data durability. As isolated areas in the impacted zone of the storage unit faced variable temperatures, some servers and storage resources were unable to shut down in a controlled manner, mentions the report. Due to this complication, it took IT almost seven hours to troubleshoot and recover all resources for the unit to revert to normalcy.
The company has initiated an investigation to analyze fire suppression system maintenance and narrow down the root cause. The company also reported on how it could have prevented the impact of the mishap. “Implementation of virtual machines in availability sets with managed disks would have provided resiliency against significant service impact for VM based workloads,” states the report. The company has also announced that it is working to improve recovery times for storage resources and will give another update by October 13.