The Washington PostDemocracy Dies in Darkness

Amazon’s cloud-computing outage on Wednesday was triggered by effort to boost system’s capacity

Efforts to add new computer servers created a cascading set of errors that hobbled Web-connected security cameras, robotic vacuums and publishing sites, the company acknowledged in a lengthy postmortem Saturday morning

The lobby of Amazon's New York offices. The company acknowledged technical errors to its cloud-computing network that took down large swaths of the Web on Wednesday. (Mark Lennihan/AP)

SEATTLE — The addition of new servers to Amazon’s dominant cloud-computing network triggered a cascading set of errors that took down large swaths of the Web on Wednesday, the company acknowledged.

Amazon said in a lengthy and technical blog post Saturday morning that a massive computing network in Northern Virginia began to fail after it started to make “a relatively small addition of capacity” to the system just before 6 a.m. Eastern time on Wednesday. But because of “an operating system configuration,” the new capacity set off a series of errors that overwhelmed Amazon’s network of servers.

Within a few hours, the malfunctions began hitting customers of Amazon Web Services, the company’s cloud-computing unit. Customers of the Amazon-owned Ring security camera service couldn’t log in or watch video. Users struggled to operate their iRobot vacuum cleaners because the outage affected the iRobot Home app. And media companies, including The Washington Post (owned by Amazon founder and chief executive Jeff Bezos), experienced publishing system outages.

Amazon Web Services outage hobbles businesses

Amazon acknowledged that the system failure was exacerbated by the co-dependencies its various services have on one another. The company had been trying to add capacity to its Amazon Kinesis service that customers use to process real-time data including video, audio and application logs. To resolve the issue, Amazon needed to restart a piece of its system it described as “many thousands of servers,” a lengthy process that had to be done gradually. But because other Amazon cloud services rely on Kinesis, including its Cognito authentication offering, they failed as well.

And because Amazon uses Cognito itself to let customers know about the status of its cloud operations through its Service Health Dashboard website, it couldn’t immediately update that site. The company has a backup method to update the site, but said “it is a more manual and less familiar tool for our support operators.”

An Amazon spokeswoman didn’t respond Saturday to a request for comment about the outage. In the blog post, the company pledged to do “everything we can to learn from this event.”

The failure of its service underscores a danger of only a handful of vendors managing global cloud computing. Amazon held 45 percent of the global market in 2019, according to the market research firm Gartner. In addition to Ring and iRobot, Amazon’s customers include Netflix, BP and Capital One, all of which run significant pieces of their computing operations on AWS.