(edit: Facebook finally published some information about the outage, corroborating expert suspicions)
Yesterday, Facebook, Instagram, and WhatsApp all went down (i.e. didn’t work) for a good chunk of the day. They haven’t given any official statements about what happened, but experts seem to agree that it was related to DNS issues, specifically BGP routing. In simpler terms, the infrastructure that connects Facebook’s servers to the broader internet (and thus your laptop) broke. The issue highlights one important truth about the internet; it relies on a bunch of obscure, behind the scenes pieces of software that nobody really controls.
Outages are a fact of life: if you work in software they are bound to happen to your company sooner or later. There are a lot of different types of outages: they can be related to your application, your infrastructure, or even the infrastructure that supports your infrastructure.
Teams set up all kinds of monitoring, graphs, and alerts to catch these incidents before they happen. But you simply can’t prevent them all. This particular incident (again, we think) seems to have been related to DNS, so let’s dive into what that is exactly.
Someone famous once said that the internet is really just a bunch of cables, and that’s basically true; it just refers to all the computers in the world, networked together via cables or wireless. When you load a website on your laptop, what you’re really doing behind the scenes is just connecting to another computer – in this case, a server – far away, via a bunch of transfers and switches. You ask that server for the web page you want, and it sends it over.
In that interaction between you and the server, there’s a lot going on behind the scenes. As you can probably tell, there’s no single cable that’s going from your laptop to Facebook’s server. There’s an entire set of infrastructure in the internet’s “middle” that takes care of taking your laptop’s request, routing it towards Facebook’s servers, and getting the answer back to you. A big part of that is DNS – the flashy subject of our next section.
If the internet is just a bunch of cables (and it is, as I literally just told you), an obvious question arises – how do all of the connected computers and servers know who each other are? Unlike Lloyd Braun, they do not wear name tags: instead there’s a central “address book” that maps servers and computers to human recognizable destinations. Just like you have contacts in your phone that map a number that’s really hard to remember to a name that’s (at least slightly) easier to remember, the DNS maps a computer’s or network’s IP address to a domain name like www.facebook.com.
Let’s start with an IP address – it’s the basic phone number of the internet. There’s a limited number of them, and each one is unique, just like an ID or a social security number. IP addresses aren’t unique to computers though – they’re unique to internet connected computers. Your computer isn’t actually connected to the internet directly; it runs through a router, which it reaches via a wireless connection if you’re using WiFi. That router is the “parent” of your local network, which might include your laptop, your phone, your internet connected fridge, and your neighbor who figured out your “hella secure” password 7 months ago.
🔍 Deeper Look 🔍
A LAN is a Local Area Network, which is what your series of devices in your home constitutes. While each device in your LAN may have a unique IP address within your LAN, those addresses are not unique globally. You can think of the internet as made up of many of these LANs, sometimes grouped together into WANs (Wide Area Networks).
🔍 Deeper Look 🔍
It would be really annoying to type in ‘192.168.1.1’ into your browser every time you wanted to visit Facebook, so the internet developed something called DNS, which stands for Domain Name Service. The DNS is the address book that maps IP addresses to domain names like www.facebook.com. These domain names are unique on the public internet, just like IP addresses; but they’re a lot easier to remember. Logistically, these servers are hosted and sold by various registrars and registries across the web. This is a very complicated situation that you can read more about here.
The final piece you need to understand what happened to Facebook is called BGP routing. BGP stands for Border Gateway Protocol, and it’s commonly referred to as the “postal service” of the internet. Its job is simple: find the most efficient route for data to travel between two points on the internet. When you load www.facebook.com, BGP is responsible for finding the quickest, most efficient path to Facebook’s servers (and back), via the crazy, disorganized network of computers that is the internet.
So to sum up: the internet is a big network of computers. When you load a site, you’re sending a request to that site’s servers and getting some data back. DNS makes it easy to know which servers in the world belong to whom and what they do; BGP takes care of actually getting your data there and back in the most efficient way possible.
The wacky thing about BGP is that it’s basically autonomous – there’s no central body controlling it, even though it’s seemingly one of the most important parts of internet bedrock. What that means is that misconfiguring anything related to BGP can take entire swaths of the internet offline, because traffic can’t find them.
This is what many people believe seems to have happened to Facebook. Big Blue operates its own data centers – thousands or even more interconnected servers – that store all of your data, host the app, and also carry internal Facebook services like email and internal tools. They seem to have (accidentally?) removed the BGP routes that connect their DNS – the mapping of Facebook domains to their server IP addresses – to the rest of the web. There was nothing wrong with Facebook’s servers or their apps; it’s just that we can’t access them via the internet.
⛓ Related Concepts ⛓
There are a lot of reasons that apps become incommunicado. This particular example relates to public internet networking issues, but that’s just one piece of the puzzle. Even within the category of networking issues there are other reasons things can fail, e.g. your app might be unable to reach your database. Outside of that, apps themselves can be faulty, databases can go down; it’s rough out there, pat your DevOps team on the back.
⛓ Related Concepts ⛓
This isn’t the first time that BGP issues have led to major outages. Quoting from Cloudflare’s explainer on BGP:
“In 2004 a Turkish Internet service provider (ISP) called TTNet accidentally advertised bad BGP routes to its neighbors. These routes claimed that TTNet itself was the best destination for all traffic on the Internet. As these routes spread further and further to more autonomous systems, a massive disruption occurred, creating a 1-day crisis where many people across the world were not able to access some or all of the Internet.
Similarly, in 2008 a Pakistani ISP attempted to use a BGP route to block Pakistani users from visiting YouTube. The ISP then accidentally advertised these routes with its neighboring AS’s and the route quickly spread across the Internet’s BGP network. This route sent users trying to access YouTube to a dead end, which resulted in YouTube being inaccessible for several hours.”
In other words, this is an uncommon but not unheard of issue. Nobody is quite sure why these routes weren’t working, but speculation is that Facebook accidentally misconfigured them. This also may be why it took them so long to fix: their entire intranet (internal internet) may have been down, which blocked people from communicating with each other and even accessing the infrastructure they needed to fix.