DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
Introduction to the Cluster Infrastructure and the Systems Provisioning Engineering teams
1.
2. Cluster Infrastructure &
System Provisioning Engineering
Angelo Failla
Production Engineer – ClusterInfra Dublin
supporting rapid infrastructure and user growth
3. What do we do?
Efficiently bring up
new capacity and
manage the health
of core services
required to operate
our infra.
12. POP TFTP: Asia -> Oregon
RRQ: 150ms
ACK: 150ms
GET DATA BLOCK0: 150ms
DATABLOCK 0 PAYLOAD: 150ms
GET DATABLOCK N: 150ms
DATABLOCK N PAYLOAD: 150ms
POP
14. Solution 1: let’s use iPXE as it talks TCP/HTTP!
- It had a 10 minutes watchdog
(which we had to patch)
- after patch it was still taking > 10 minutes
Solution 2: put fbtftp server in every POP
- our own home made TFTP server
- have it stream files from http
- cache files locally
- couple of minutes to download initrd/kernel
Solution 3 (currently investigating):
use Grub2 and download initrd/kernel via HTTP
configurable tcp window size, patch sent upstream.
Solutions
Hi everyone, my name is Angelo and I am a Production Engineer, I have been working in Facebook Dublin for the last 5 years and I am part of the Cluster Infrastructure team.
Today I am going to talk to you about the Cluster Infrastructure and the SPE team.
As you know we are serving more than 1 billion daily active users. You guys post a lot of cat pictures and we need to store them all! As you know the user base is growing every day, in addition to that we keep adding more features to the products to drive engagement.
This takes a toll on the infrastructure, even though we work hard to achieve performance wins across the physical and software stacks we still are going to need to add new DCs and clusters to our fleet every so often.
Our job is to help doing this efficiently, and as fast as possible, we need to be able to install operating systems on a huge quantities of boxes with no or little human supervision, we need to service this servers for their 3 years or so life cycle, and so on.
Let’s take a look at the some of things my team owns:
We own internal and external dns servers, these are the servers serving both internal and external zones, we own dns configuration pipeline (we also recently presented a talk @ FOSDEM)
we own the physical GPS appliances across our DCs and different stratum servers, we make sure to have all server’s clocks synchronized
we deploy a dynamic/stateless dhcp server based on ISC KEA (I have talked about it at SRECon Europe so you can find video and slides online)
we deploy dynamic/stateless tftp server written in py3, which we hope to release on github soon
we develop and support orchestration tools that prepare the infrastructure to receive new hardware.
System Provisioning Engineering owns Cyborg, a tool built on top of the provisioning backend that orchestrates server and TOR provisioning, it follows machines as they reboot and it make sure they perform all the steps from the moment you power them on until they are ready to serve production traffic..
In order to do so Cyborg needs to hold the parameters used during a provisioning job, that’s IMP.
And as stuff break you will need ways to manage repairs and check that hardware is healthy.
Your system needs to be able to sustain the provisioning of thousands of machines concurrently, so there are certain assumptions you need to make and some design decisions that you have to take if you want to be able to support that.
But that’s easy right? Because you know what? Provisioning, works, right? It’s hands free!
Well… sort of.. The vast majority of the times it works but we still have to deal with edge cases…
We have a long list of hardware, firmware, kernel and initrd permutations to support. This is a lot of edge cases to support and can cause bugs that are difficult to triage and solve. Sometimes changing one thing to fix one of the edge cases can break other permutations so we are working on improving our testing infrastructure using a/b testing techniques, continue integration and so on.
So far I have talked about generic challenges, but let’s describe something a bit more specific now. Let’s talk about TFTP…
I assume most of you know about TFTP, it’s this very very old (1981! I was born when TFTP was standardized!) file transfer protocol that is usually associated with network booting and embedded device. Historically It has been used for netbooting because it’s easy to implement and due to its design it can be implemented by small footprint code that fits in ROMs etc.
In this example we have a POP somewhere in Asia and a DC in the WC. The network latency between themis 150ms.
TFTP is an UDP based protocol therefore clients and servers have to implement they own flow control. Every packet the client want to download has to be requested with a special packet.
Another challenge we faced approximately 2 summers ago was bringing up our first ipv6 only cluster.
People/vendors say they support IPv6, the reality is that as soon as you remove V4 stuff are going to break, and badly.
Another challenge we have is being able to bring up and down capacity as fast as possible, having hardware sitting in DCs for too long is not nice (you end up wasting a lot of power).
Being fast in decommission and turnup requires a lot of communication with tier owners to make sure that their services are fully integrated with our DC/cluster automation tooling.
People/vendors say they support IPv6, the reality is that as soon as you remove V4 stuff are going to break, and badly.