Saturday, June 30, 2012

(Real) Storm Crushes Amazon Cloud, Knocks out Netflix, Pinterest, Instagram

(Real) Storm Crushes Amazon Cloud, Knocks out Netflix, Pinterest, Instagram

A storm in Virginia ruined Friday night movie-watching in California. Welcome to the Cloud. (Photo: Flickr/Mike Miley
Can Amazon handle its fast-growing cloud?
Hurricane-like storms knocked an Amazon data center in Ashburn, Virginia, offline last night, and a chunk of the Internet felt it. The six-hour incident temporarily cut off a number of popular internet services, including Netflix, PinterestHeroku, and Instagram.
The outage was the second for this particular Amazon data center in the past month. It’s bad news for a cloud computing platform that’s sold as a more reliable alternative to traditional data centers.
In theory, big outages like this aren’t supposed to happen. Amazon is supposed to keep the data centers up and running — something it’s has become very good at — and customers like Netflix, freed from that drudgery, are supposed to be free to cook up compelling new web application like video streaming.
In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage.
That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According toTwitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed.
Friday’s outage wasn’t nearly as severe as the one that took out Amazon in April 2011. Then, a botched network update rolled across several data centers, causing widespread outages on the Amazon cloud.
“We lost a much bigger proportion of just one [Amazon data center] than the last power outage, and the ELBs didn’t route around it,” said Netflix’s Cockroft, via Twitter.
So on Saturday, there are two big questions that need to be answered. First, why did Amazon’s Ashburn data center fail? A storm shouldn’t have taken out Amazon’s backup generators. Second, Why were companies like Netflix so drastically affected by a single data center outage?
So far, Amazon isn’t saying a lot. “Severe thunderstorms caused us to lose primary and backup generator power to an Availability Zone in our east region overnight,” said Amazon spokeswoman Tera Randall on Saturday morning. “We have restored service to most of our impacted customers and continue to work to restore service for our remaining impacted customers.”
The powerful storms cut power to nearly a million customers, said Dominion Virginia Power. Storm winds hit 80 miles per hour, and killed at least six people in Virginia, according to reports.
At Netflix, services were offline for about three hours — between 8 pm and 11 pm pacific — according to company spokesman Joris Evers. “We’re actively working to analyze the cause and understand what happened,” he said Saturday.
Netflix doesn’t use Amazon to actually stream its video, so customers who were in the middle of watching movies wouldn’t have been interrupted. But Amazon powers virtually all of the back end services on Netflix.com, so the outage made connecting and starting up new movies impossible for customers.
The full explanation of what actually went wrong is sure to be complex.
When asked via Twitter if he blamed Amazon or should have been better prepared for this outage, Instagram’s Branson said, “lol go troll someone else. I work for a living.”
Amazon is promising to tell more about what exactly happened in the week ahead, and so is Netflix. Cloud-watchers are waiting.