December 7, 2021 used to be a day that would live in infamy, but this year it is a day that annoyed many Amazon Web Services (AWS) users. It will also vex many more people who did not realize that their most used services all rely on AWS.
Robotic vacuum cleaners could not be summoned. Music and movies stopped streaming. Whole Foods orders were suddenly canceled. Parts of Amazon’s mammoth retail operation slowed to a standstill. Operational and Control Systems impacted with paralysis left packages to be shipped as well as to be delivered idle on docks.
Joe Stefani, an Amazon seller in Chicago, said his business, Desert Cactus, could not get inventory into the company’s warehouses due to the outage. Stefani said Amazon handles 90% of his company’s orders, shipping products to customers from its fulfillment centers.
Sellers such as Stefani could not access Seller Central, an internal system Amazon uses to manage customer orders. That meant Stefani was unable to print out shipping labels that are required for any shipments sent to Amazon warehouses.
“We could not send in at least 10,000 to 12,000 items,” including NBA and NHL merchandise, Stefani said. “It will end up costing us money in the long run.”
No AWS, no vast Disney empire, no Unilever, or Novartis, no Sony or Hitachi, no Johnson & Johnson, no Delta Airlines, General Electric, or Siemans. These companies and many more rely completely on AWS for their cloud service to host streaming platforms like Netflix, Disney+, and Amazon Prime Video; host games like Valorant, League of Legends, and World of Warcraft; apps like Salesforce, Venmo, and Coinbase; and many other services that rely on AWS. Public and private environments hosted with AWS were impacted.
There are a lot of digital eggs in the AWS basket, and unfortunately major outages have happened with surprising regularity.
Internet outages continue to happen, which begs the question: Why? And, if there is something fundamentally wrong with it, do we need to re-architect the internet?
Here’s what Amazon says caused US-East-1’s woes:
“At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.”
The outage also affected Amazon’s ability to provide updates, it said.
The outage began midmorning on the U.S. East Coast, said Doug Madory, director of internet analysis at Kentika, network intelligence firm… “AWS is the biggest cloud provider and us-east-1 is their biggest data center, so any disruption there has big impacts to many popular websites and other internet services,” he said.
Madory said he did not believe the outage was anything nefarious. He said a recent cluster of outages at providers that host major websites reflects how the networking industry has evolved. “More and more these outages end up being the product of automation and centralization of administration,” he said. “This ends up leading to outages that are hard to completely avoid due to operational complexity but are very impactful when they happen.”
The most glaring weakness is that AWS operates with a Public Cloud business structure, in that all its clients use the same server farms and services without distinct or customizable separations compared to other operating structures. It also used a relational database structure, which many companies now do not use.
AWS controlled 33% of the global cloud infrastructure market in the second quarter, according to Synergy Research Group, followed by Microsoft at 20% and Google at 10%. Revenue at AWS jumped 39% in the third quarter from a year earlier to $16.1 billion (about $50 per person in the US), outpacing growth of 15% across all of Amazon.
As we discussed before, protocols and business habits, both good and bad, are often the cause of systemic failures. This is also where we need to look for solutions.
Oversight of standardized protocols and workflows is important, and often disregarded due to routine and automation.
You must understand that Amazon (or any provider) is under no real legal obligation to ensure your business continuity. This falls to you.
Disaster recovery is a part of business continuity, not the other way round.
If you do not take a business approach to your IT, your IT will break your business.
Planning for your future often requires the help of experts. C level decisions require C level resources to guide your path. Your business, your data, your decisions. You are an expert in your field, sometimes you need an expert in ours.
Conalogix is the resource to turn to for your digital transformation plan.
Today as we go to post, there was a second AWS outage in as many weeks, December 15, 2021. Outage occurred at the West Coast facility US-WEST-1