Microsoft’s East Australia suffered an issue over the last few days with degradation of performance to several users accessing services reliant on storage. The initial investigation seemed to centre around the network as the symptom was higher latency than usual as reported in the register.
The resolution seemed to take some time in coming. With the network initially cleared the finger almost seemed to point at service providers with the following update: “Engineers have now confirmed that there are no issues with the Azure Networking infrastructure and are now engaged with partners in the region who are working towards mitigation.”
Eventually the problem was resolved with Microsoft putting the root cause down to a single underlying Storage Scale Unit reaching its safe utilization thresholds. This resulted in the increase latency and timeouts across some services. While this might have been hard to track one assumes that Microsoft actually have alerts to tell them when thresholds are breached.
Microsoft also informed customers of what they did to rectify the problem: “Storage resources have been rebalanced to ensure consistent performance across the entire region.” This is an obvious step and one wonders why the platform did not do this automatically. However the statement they issued about next steps is probably most concerning: “Continue to monitor the performance of the storage system to ensure the issue does not reappear.”
This seems incredibly reactive and hardly helps stopping the issue happening again in the future. Perhaps it is more about the language used. “Continue” infers that they were already doing this and yet the problem has already occurred once. Surely it would be better to review their protocols for thresholds, possibly lowering them and setting up better alerts so that when these thresholds are breached mitigating action is undertaken.
For customers in Australia who suffered from the degraded service from the 23rd to 29th this must be frustrating. Incidents like this do happen with cloud vendors but one would expect an answer to the issue to be found in less than six days. Furthermore their subsequent remedial actions seem a little thin.
Problems such as these are luckily not daily occurrences on Azure. It is likely that only a small subset of customers were affected by the issue in fact. What is disappointing though is the response from Microsoft that more will not be done to ensure that the problem will cause far less of an issue in the future, or even be prevented from happening.
Lowering the threshold to the point where if they are breached remedial action can be undertaken before there is any impact on customers seems like a sensible move. The issue maybe that this would involve higher costs for Microsoft. The scalability of the cloud is one of its attractions and if companies believe that they can be assured of better performance with on premise then some companies the migration to cloud may slow down.
Whenever these incidents occur cloud providers need to do more to ensure that the issue is not only resolved faster but also prevented from occurring again in the future, if that is possible. Seven days of degradation for a business is no laughing matter, (the Microsoft blog noted that the incident ran from 23-29 February) and if this impact the bottom line of some businesses then it is even more surprising that some sort of apology has not been issued.