Mahesh talks about the buddha-nature of Erlang/OTP, pointing out how the various features of the language tie together into one seamless Fault Tolerant whole. Mahesh emphasizes that Erlang begins and ends with Fault Tolerance. Fault Tolerance is baked into the very genes of Erlang/OTP - something that ends up being amazingly useful when building any kind of system. Mahesh Paolini-Subramanya is the V.P. of R&D at Ubiquiti Networks - a manufacturer of disruptive technology platforms for emerging markets. He has spent the recent past building out Erlang-based massively concurrent Cloud Services and VoIP platforms. Mahesh was previously the CTO of Vocalocity after its merger with Aptela, where he was a founder and CTO.
17. Concurrency
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
18. Concurrency
Fault detection
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
19. Concurrency
Fault detection
Fault identification
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
20. Concurrency
Fault detection
Fault Identification
Error Encapsulation
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
21. Concurrency
Fault detection
Fault Identification
Error Encapsulation
Code upgrade
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
22. Concurrency
Fault detection
Fault Identification
Error Encapsulation
Code upgrade
Stable Storage
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
94. Can you recover quickly?
Buffer-bloat doesn’t matter, right?
Once up, can you deal with the backlog?
Back-pressure isn’t an issue, right?
Queues
95. Can you recover quickly?
Buffer-bloat doesn’t matter, right?
Once up, can you deal with the backlog?
Back-pressure isn’t an issue, right?
Queues
NOPE
This is a story about unexpectedness.The only constant is change
Our story starts on a happy Saturday in february
Its still Friday
Just part of one cluster failed, but a threshold had been passed
No worries, we’ll just bounce that one cluster, it’ll all be good
Total System Meltdown
All the calls keep retrying, causing memory utilization to go through the roof
Voicemail conversion was going on independent of everything else, causing CPU utilization to spike
Eventually, the cache timed out, and tried to reload stuff from the disk.
And then everyone tries the Apps, and the Twitters and the facebooks and the everythings.
Total System Meltdown
What about testing? Didn’t you check loads? Specs? Capabilities?
There is only so much planning you can do. At some point, the 1000 year flood hits
The point being, Shit will happen.The question is, when Shit happens, can you clean up?
There is a formal definition of Fault Tolerance
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
‘Distributed’ problems mean you spend a huge chunk of your time dealing with theadminstrivia of distribution.With erlang you get that for free!Processes, Messages, Immutability, “Writing Concurrent Programs in Java”
Ok, not really true. You still have to deal with ‘deep problems’ (hard core parallelization issues, etc.)But you’d have to deal with that anyhow!
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
The Six Essential Characteristics of a Fault Tolerant System
The Six Essential Characteristics of a Fault Tolerant System
Let it Crash
BEAM --> insanely reliable. will last till the heat death of the universe if you leave it alone
JVM is not necessarily your friend.Running on the JVM is not necessarily good - do you trust all the other java code? i don't. trust _me_, i've been there
The Six Essential Characteristics of a Fault Tolerant System
Let it Crash
Mnesia, ETS, gen_servers, etc.
Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
The bigger they are, the harder they fall
Just connect to a remote node and trace to figure out what is going on
Why wait? Just log on to a node
Soft real-time. Brief discussion of instrumentation and ‘reductions’
i/o (and message passing. basically the same thing) is _wicked_ fast. Not just IPC, but network, web (cowboy) websockets, etc.
The Buddha nature of erlang
This is pretty much what we’re talking about right?Systems – Development/Production and Internal/External
Its not just us
Its not just us
Lets talk about systems
The Six Essential Characteristics of a Fault Tolerant System
Loose Coupling, of course, gives us all these benefits
Loose Coupling, of course, gives us all these benefitsLoosely couple systems can operate concurrently. Well D-UHErrors can be contained/constrained
Keep components/modules/systems ‘loosely coupled’Connect via specs/apis/busesDo this by default, even when you don’t need to!
Builds trust Trust in the stupidity of people, trust that things will fail, trust that you will be affected
The amount of brainpower we have is limited.Reduce complexity by being able to focus on specific / limited areas
There are many studies (some not so controversial) that show the number of bugs/line is constantFocus on smaller areas gives you fewer things to tackle
Isn’t Performance an issue w/ Loose Coupling?
remember the bit about failure? well, why optimize if you're going to fail anyhow? yeah yeah, you might fail because you don't perform, but that is rarely the problem
yes, that mine craft plugin you built might gt a million signupsit won'tseriously – it doesn't register statistically
DashboardsOtherwise, how do you know whats going on?
Out of band access Don’t rely on the system to always tell you whats happening
Corresponds to how we think, and helps deal with edge-cases much *much* better!
Be PolyglotEverything fails – even erlang. (noooo)
Why Polyglot?Because you want to limit your failure modes (increasing diversity can actually reduce systemic risk)
Macro Effects Matter! Systems span divisionsFinance, Customer Support, Sales, HR, etc.
Helmuth vonMoltke
People fall ill
Vendors Fail(Amazon)
Fraud: You wonder why your CFO is in Brazil…
Tail Risk (Things that can never happen)This deserves its own section(financial crisis)
Ask yourself this. Over and over again…
The Six Essential Characteristics of a Fault Tolerant System
Yeah, yeah. Understandable lies. But the bottlenecks are pretty far down the road (and much further than you would have gotten before!)
Tail RiskThis deserves its own section(financial crisis)
How fast are you?How quickly can you come back up? Can you store enough state to survive?
Is BufferBloat a problem?
Once you are up, can you draw down the queue fast enough?Or at all, for that matter?
Is backpressure going to be a problem?
If the answer is “Yes”, then the talk is over, because it just works.
What if the answer is “No”? (Now we have a story)
ProgrammableIf you’re lucky, you’re infrastructure will automagically support ramping
Fake it. People respond subconsciously to these, and actually waitYou can even get away with dropping the request(This assumes that you can recover in time)
This happens inside the airport too!Passengers self-select the best gates to enter(intelligent routing)
The question is, what do you do when you can’t come up in time? 3 gallon bucket, 5 gallons of water…
Just start dropping when queue fills upThis is pretty bad – global synchronization becomes a problemPlanes don’t take off till they get clearance from the other end
Slow Start, AQM, RED, CoDEL, …Why don’t we learn from networks?They certainly don’t learn from us, why do we ignore them?
RED / SRED(RED in a different light – toilet bowl)
RED / SRED(RED in a different light – toilet bowl)
The 3rd priority airport always gets the shaft
F(low) REDRED on a per-flow basis (the entire route map)Kinda the default. Discard second request)
RED – P(referential) D(rop)Does RED only for High BW flows (high traffic routes)(Throttle spammy clients. Or features.)
W(eighted) REDDifferent discard probabilities for different flows (translatlantic routes)(Major clients vs small ones0
S(tabilized) RED – estimate flows and probabilitiesR(obust) RED – Protect against low-rate DoS (with filters) (even unintentional DoS)A(daptive) RED – Modify prob based on queue CHO(ose and) K(eep) or CHO(ose and) K(ill) - open for < min; drop tail for > maxelse, compare packet to random packet. if same flow, drop it w/ prob.
Fixed two bugs in REDMade it feedback based (self-tuning)Toilet diagram caused problems
Van Jacobson strikes backUse Queue length as metric (bursts can fill up queue)Drop probabilistically
Yeah, yeah. Understandable lies. But the bottlenecks are pretty far down the road (and much further than you would have gotten before!)