On Monday, October 4, 2021, for just over six hours, a large minority of people on the planet were without their foremost and primary access to the Internet – Facebook (now Meta), and all of the 78 other companies under its umbrella – which stopped loading new content, essentially going dark, for almost six hours.
This is unbelievably huge.
The Scale And Scope Of This Outage Is Massive.
For End Users, It Just Looked Like The Internet Was Down.
For many people, Facebook is the Internet. When FB went down, many people assumed that it was their Internet connection. Some tried troubleshooting their own equipment, and found it to be normal. Many people reset their passwords and configurations, assuming they had been hacked. Others headed into repair stores and calls to technical support asking why their device was no longer functioning.
A huge number of people turned to Twitter as a way to communicate, from Facebook’s own apologies to people mocking Facebook for its very public mis-step.
For the technically savvy individuals, it appeared as a DNS error. DNS (Domain Name System) is essentially the address book of the Internet. It is the reference that translates “www.facebook.com” into the IP address of the location on the Internet (126.96.36.199 as one example) so mobile phones, laptops, workstations and applications were unable to find not only Facebook, but none of Facebook’s companies.
The Internet Forgot There Is A Facebook.
But what really happened at Facebook was a Border Gateway Protocol (BGP) error, which we will explore in more detail in our next post. BGP is the map and directions to locations on Internet, and it lost the exact locations of all of Facebook’s and its companies’ IP addresses.
So even if you had the specific IP address, you could not have gotten to their content anyway. Figuratively and effectively the Internet forgot where Facebook was, therefore could not direct you to it or any other Facebook company.
Bad Process Gets You Every Time.
A big cause of this failure is that Facebook administrates all of its companies’ Internet locations in-house. All of it apps, site locations and changes are operated from a centralized team that controls all of Facebook and its companies across the globe. Leveraging identical resources for all sites at once. Essentially, all of its eggs in the same configuration basket.
For their part, FB issued a fairly generic, non-committal response at first:
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
The next day, they then followed up with:
During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.
So, a giant megacorporation, with more users of its vast collection of servers than most nations was brought down by a line of faulty code and a failure of process which is intended to prevent this very occurrence.
Facebook (Meta) Lost Money
Contact ConaLogix, let us help you find the best way to update, manage, configure, and adjust your data systems to ensure not only long life and secure access, but also to prevent this type of error from costing your business money.