Friday 6 May 2011

Amazon EC2 Issues: Observations and Cloud Disaster Recovery

Being a daily user of Hootsuite I was first alerted to the Amazon EC2 outage through a “Service Down” message strangely in Japanese on the Hootsuite website at 8am on Thursday 24/4. A quick Google later and the size of the outage became apparent. The impact and the sheer number of companies affected was a very prominent illustration on how popular cloud application hosting has become.

At the time of writing the root cause of the of issue hasn’t been posted, however it appears that there was a major network failure. The main problems to customers seemed to arise when the EBS (Elastic Block Storage) volume replication was disrupted and speculation is that a significant re-mirroring process potentially overloaded the EBS plane.

I am certainly not going to blog about the misfortune of Amazon and its customers, however I would be naive not the think that what has been termed a “plane crash” in cloud computing terms wouldn’t be playing heavily on the minds of our customers. Looking at the constructive comments to come out of the issue (in light of no root cause analysis), there are 2 main threads:

1) A need for improved communication.

A big part of the frustration against Amazon was the lack of information and poor communication flow following the outage. Although Quantix are a smaller company and during any issues lines of communication are through the account management team, I believe all service providers should be looking for improvements. I for one will be sitting down with my team and making suggestions for additional channels.


2) A need for customers to build their own resilience across multiple Amazon availability zones.

Amazon’s high levels of availability to date may have caused complacency in its customers leaving them unprepared our being caught out by an unforeseen failure scenario. I’m sure these latest issues will make a lot of Amazon customers bulk up their resilience across multiple availability zones.
Offering Managed Cloud Services, at Quantix we mitigate the risk to customers by taking on this responsibility for them. This means that every server running on the Quantix cloud has full failover capability from our primary to secondary DC, without any intervention from the customer. Please see this link for more information on our cloud topology http://www.quantix-uk.com/about-quantix/quantix-cloud-platform/cloud-technology-topology


The metaphor of the term cloud has been very useful to describe a way of consuming IT, however it lacks any clarity in terms of what goes on “under the bonnet”. Therefore, if your need any clarification on the way Quantix operate our managed cloud, please feel free to contact me directly and I’d be more than happy talk it through at length.