Xero suffered a two hour outage, while they are unsure what the cause was the Xero blog indicates that it is “related to a software issue (triggered by the hardware failure) in the switch that corrupted both the primary and redundant network infrastructure“. It is possible that this could be caused by a hardware failure in a load balancing server and the software failed to operate properly. When one side failed to operate the other did not kick in. Xero will continue to investigate the matter with the hardware supplier and will no doubt improve their processes to ensure that there is a faster way to recover the situation in the future.
This can be one of the problems for cloud based companies without physical presence in the data centre. Companies that used to run their own equipment could run to their own computer room and “fix” the problem manually. Nowadays the distances involved means that this is not so simple. While it is possible to reboot devices remotely sometimes fixes need more subtlety and presence.
Customers in both Australia and New Zealand were impacted but not the UK or US. It is the worst nightmare for any cloud software company, that their hosting company has an outage. It is something outside their control and while all the preparation and testing in the world can try and ensure that your solutions do not fail at some point, the unexpected will happen.
The reactions amongst customers was surprisingly mixed, with many taking the more positive view that this hadn’t happened for several years and in one case, accountant Fin Adnam gave him the opportunity for other pursuits as he commented “Thanks for the update, earning no money is not ideal, but the swell is running, so I’m off for a surf!” There were many similar comments and while a few customers showed more annoyance than acceptance Xero did try to keep people updated with a series of updates on their which tried to explain what had happened and apologised for the outage.
Critical incidents like this are rare and it will be interesting to see what Xero learn from the episode. While Xero has a fail over solution for its hardware one questions why this was not involved sooner unless it is held within the same data centre. If this is the case then perhaps Xero needs to consider a second data centre with technology that allows the service to switch across instantly. While it is easy to point fingers and make suggestions from the armchair these are questions that customers in all regions will want answered.
The outage, according to IsitDownrightnow started at 11.52 and was resolved exactly two hours later at 13.52. For some businesses this meant a long lunch, for others especially any in hospitality this would have been a nightmare. While questions have been asked of Xero businesses also need to ask themselves what their own contingency plans are.
For accountants who rely on the software this is tricky, as their business revolves around the born in the cloud company,
by but for the small businesses they support there should be some plan that can be carried out during down time. Too often companies rely on IT services without contemplating what it would mean to be without them and for many having a plan that can cover such an eventuality is sensible.
Who was affected most.
While Xero will have suffered the embarrassment of their first outage for a while. Their competitors will no doubt use this in the near future, or at least their salesmen will. It seems likely now that Xero will accelerate plans to migrate their hosting environment to AWS, first announced in May 2015.
This will not be the last outage that users will suffer, but it appears as though Xero communicated throughout the process which is to be fair all they could do. What Xero now needs to do is ensure that their fail over processes will work in a similar event. It may be that they are waiting to migrate their services to AWS but this project is now likely to be accelerated. We requested a comment from Gary Turner, CEO UK at Xero who responded with, “We’re obviously very disappointed that Xero was unavailable to our customers for just over two hours last night, particularly since this is the longest period of unscheduled downtime in six years. We’re also grateful to our customers for their patience while we worked to resolve the issue.”
Hopefully Xero will come back with something more constructive in the future that delivers a message about what they have done or will be doing to prevent this kind of event happening in the future. While it is unlikely the incident will be replicated exactly, Xero will need to prove that their disaster recovery solution works better than it did last night.