• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
 

Erlang - Because s**t Happens by Mahesh Paolini-Subramanya

on

  • 911 views

Mahesh talks about the buddha-nature of Erlang/OTP, pointing out how the various features of the language tie together into one seamless Fault Tolerant whole. Mahesh emphasizes that Erlang begins and ...

Mahesh talks about the buddha-nature of Erlang/OTP, pointing out how the various features of the language tie together into one seamless Fault Tolerant whole. Mahesh emphasizes that Erlang begins and ends with Fault Tolerance. Fault Tolerance is baked into the very genes of Erlang/OTP - something that ends up being amazingly useful when building any kind of system. Mahesh Paolini-Subramanya is the V.P. of R&D at Ubiquiti Networks - a manufacturer of disruptive technology platforms for emerging markets. He has spent the recent past building out Erlang-based massively concurrent Cloud Services and VoIP platforms. Mahesh was previously the CTO of Vocalocity after its merger with Aptela, where he was a founder and CTO.

Statistics

Views

Total Views
911
Views on SlideShare
668
Embed Views
243

Actions

Likes
0
Downloads
5
Comments
0

10 Embeds 243

http://g33ktalk.com 176
http://www.hakkalabs.co 55
http://cloud.feedly.com 3
http://www.feedspot.com 3
https://hakka.herokuapp.com 1
http://feeds.feedburner.com 1
http://www.newsblur.com 1
http://newsblur.com 1
http://digg.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • An overall approach to Preparedness
  • This is a story about unexpectedness.The only constant is change
  • Our story starts on a happy Saturday in february
  • Its still Friday
  • Just part of one cluster failed, but a threshold had been passed
  • No worries, we’ll just bounce that one cluster, it’ll all be good
  • Total System Meltdown
  • All the calls keep retrying, causing memory utilization to go through the roof
  • Voicemail conversion was going on independent of everything else, causing CPU utilization to spike
  • Eventually, the cache timed out, and tried to reload stuff from the disk.
  • And then everyone tries the Apps, and the Twitters and the facebooks and the everythings.
  • Total System Meltdown
  • What about testing? Didn’t you check loads? Specs? Capabilities?
  • There is only so much planning you can do. At some point, the 1000 year flood hits
  • The point being, Shit will happen.The question is, when Shit happens, can you clean up?
  • There is a formal definition of Fault Tolerance
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • ‘Distributed’ problems mean you spend a huge chunk of your time dealing with theadminstrivia of distribution.With erlang you get that for free!Processes, Messages, Immutability, “Writing Concurrent Programs in Java”
  • Ok, not really true. You still have to deal with ‘deep problems’ (hard core parallelization issues, etc.)But you’d have to deal with that anyhow!
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
  • Garbage Collection, Referential Integrity, Testing!!!
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • Let it Crash
  • BEAM --> insanely reliable. will last till the heat death of the universe if you leave it alone
  • JVM is not necessarily your friend.Running on the JVM is not necessarily good - do you trust all the other java code?     i don't. trust _me_, i've been there
  • The Six Essential Characteristics of a Fault Tolerant System
  • Let it Crash
  • Mnesia, ETS, gen_servers, etc.
  • Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
  • Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
  • The bigger they are, the harder they fall
  • Just connect to a remote node and trace to figure out what is going on
  • Why wait? Just log on to a node
  • Soft real-time. Brief discussion of instrumentation and ‘reductions’
  • i/o (and message passing. basically the same thing) is _wicked_ fast. Not just IPC, but network, web (cowboy) websockets, etc.
  • The Buddha nature of erlang
  • This is pretty much what we’re talking about right?Systems – Development/Production and Internal/External
  • Its not just us
  • Its not just us
  • Lets talk about systems
  • The Six Essential Characteristics of a Fault Tolerant System
  • Loose Coupling, of course, gives us all these benefits
  • Loose Coupling, of course, gives us all these benefitsLoosely couple systems can operate concurrently. Well D-UHErrors can be contained/constrained
  • Keep components/modules/systems ‘loosely coupled’Connect via specs/apis/busesDo this by default, even when you don’t need to!
  • Builds trust  Trust in the stupidity of people, trust that things will fail, trust that you will be affected
  • The amount of brainpower we have is limited.Reduce complexity by being able to focus on specific / limited areas
  • There are many studies (some not so controversial) that show the number of bugs/line is constantFocus on smaller areas gives you fewer things to tackle
  • Isn’t Performance an issue w/ Loose Coupling?
  • remember the bit about failure? well, why optimize if you're going to fail anyhow? yeah yeah, you might fail because you don't perform, but that is rarely the problem
  • yes, that mine craft plugin you built might gt a million signupsit won'tseriously – it doesn't register statistically
  • DashboardsOtherwise, how do you know whats going on?
  • Out of band access Don’t rely on the system to always tell you whats happening
  • Corresponds to how we think, and helps deal with edge-cases much *much* better!
  • Be PolyglotEverything fails – even erlang. (noooo)
  • Why Polyglot?Because you want to limit your failure modes (increasing diversity can actually reduce systemic risk)
  • Macro Effects Matter! Systems span divisionsFinance, Customer Support, Sales, HR, etc.
  • Helmuth vonMoltke
  • People fall ill
  • Vendors Fail(Amazon)
  • Fraud: You wonder why your CFO is in Brazil…
  • Tail Risk (Things that can never happen)This deserves its own section(financial crisis)
  • Ask yourself this. Over and over again…
  • The Six Essential Characteristics of a Fault Tolerant System
  • Yeah, yeah. Understandable lies. But the bottlenecks are pretty far down the road (and much further than you would have gotten before!)
  • Tail RiskThis deserves its own section(financial crisis)
  • How fast are you?How quickly can you come back up? Can you store enough state to survive?
  • Is BufferBloat a problem?
  • Once you are up, can you draw down the queue fast enough?Or at all, for that matter?
  • Is backpressure going to be a problem?
  • If the answer is “Yes”, then the talk is over, because it just works.
  • What if the answer is “No”? (Now we have a story)
  • ProgrammableIf you’re lucky, you’re infrastructure will automagically support ramping
  • Fake it. People respond subconsciously to these, and actually waitYou can even get away with dropping the request(This assumes that you can recover in time)
  • This happens inside the airport too!Passengers self-select the best gates to enter(intelligent routing)
  • The question is, what do you do when you can’t come up in time? 3 gallon bucket, 5 gallons of water…
  • Just start dropping when queue fills upThis is pretty bad – global synchronization becomes a problemPlanes don’t take off till they get clearance from the other end
  • Slow Start, AQM, RED, CoDEL, …Why don’t we learn from networks?They certainly don’t learn from us, why do we ignore them?
  • RED / SRED(RED in a different light – toilet bowl)
  • RED / SRED(RED in a different light – toilet bowl)
  • The 3rd priority airport always gets the shaft
  • F(low) REDRED on a per-flow basis (the entire route map)Kinda the default. Discard second request)
  • RED – P(referential) D(rop)Does RED only for High BW flows (high traffic routes)(Throttle spammy clients. Or features.)
  • W(eighted) REDDifferent discard probabilities for different flows (translatlantic routes)(Major clients vs small ones0
  • S(tabilized) RED – estimate flows and probabilitiesR(obust) RED – Protect against low-rate DoS (with filters) (even unintentional DoS)A(daptive) RED – Modify prob based on queue CHO(ose and) K(eep) or CHO(ose and) K(ill) - open for < min;  drop tail for > maxelse, compare packet to random packet. if same flow, drop it w/ prob.
  • Fixed two bugs in REDMade it feedback based (self-tuning)Toilet diagram caused problems
  • Van Jacobson strikes backUse Queue length as metric (bursts can fill up queue)Drop probabilistically
  • Yeah, yeah. Understandable lies. But the bottlenecks are pretty far down the road (and much further than you would have gotten before!)

Erlang - Because s**t Happens by Mahesh Paolini-Subramanya Erlang - Because s**t Happens by Mahesh Paolini-Subramanya Presentation Transcript

  • { Erlang : Because S**t happens Mahesh Paolini-Subramanya (@dieswaytoofast) V.P. Ubiquiti Networks
  • AGILITY
  • My Vacation
  • (Actually, the day before)
  • A small failure…
  • The Horror! The Horror!
  • Why are my calls failing?
  • You better call me back!
  • I’m still p***ed off!
  • And you’re stupid Apps don’t work!
  • The Horror! The Horror!
  • Surely you Tested?
  • 1000 year floods
  • Fault Tolerance
  •  Concurrency The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  •  Concurrency  Fault detection The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  •  Concurrency  Fault detection  Fault identification The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • erlang…
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • Concurrency Oriented Concurrency Hell My Blue Heaven My Blue Heaven
  • Concurrency Oriented Concurrency Hell My Blue Heaven Deep Problems My Blue Heaven Deep Problems
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • Fault Detection
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • Stack Traces?
  • Immutable Variables  X = 1.
  • Immutable Variables  X = 1.  X = 2. Huh?
  • Immutable Variables  X = 1.  X = 2.  X = X + 1. Huh?
  • Fault Identification
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • Let It Crash
  • BEAM!
  •  Faster to create JVM is not necessarily your friend!
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • Code Upgrade
  •  Live! Hot SwappingCode Upgrade
  •  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
  • The Intangibles
  • 4x – 10x less code
  • Code Size
  •  Faster to create 4x – 10x less code
  •  Faster to create  Easier to reason about 4x – 10x less code
  •  Faster to create  Easier to reason about  Fewer bugs 4x – 10x less code
  •  Faster to create  Easier to reason about  Fewer bugs  Speedy refactoring 4x – 10x less code
  • The Shell is our friend
  • Live Debugging
  • Predictability
  • Performance
  • Fault Tolerance - Systems
  • Romney 2012
  • Fault Tolerance - Systems
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems LOOSECOUPLING
  • Loose Coupling?
  •  Breeds Trust Loose Coupling
  • Loose Coupling
  •  Breeds Trust  Devote more brainpower to specific areas Loose Coupling
  • Loose Coupling
  •  Breeds Trust  Devote more brainpower to specific areas  No. of bugs/line is constant Loose Coupling
  • Performance
  •  60 - 90% of all SW projects fail  10 – 25% of all SW projects get abandoned Fault Tolerance
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems M ONITORING
  • Monitoring?
  •  Dashboards Monitoring?
  •  Dashboards  Out of band systems Monitoring?
  • Supervision
  •  Dashboards  Out of band systems  Polyglot safety Monitoring?
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems POLYGLOT PERSISTENCE
  •  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems EVERYW HERE!!!
  • No battle plan survives contact with the enemy
  •  Not just about Systems  Fault Tolerance
  • Fault Tolerance
  •  People  Vendors Fault Tolerance
  •  People  Vendors  Fraud Fault Tolerance
  • The BusinessBeware the Black Swan
  • Is It Safe?
  • erlang…
  • mahesh@dieswaytoofast.com @dieswaytoofastQuestions
  • Coda Active Queue Management
  • Queues
  • Queues
  • Queues
  • Queues
  •  Can you recover quickly?  Buffer-bloat doesn’t matter, right?  Once up, can you deal with the backlog?  Back-pressure isn’t an issue, right? Queues
  •  Can you recover quickly?  Buffer-bloat doesn’t matter, right?  Once up, can you deal with the backlog?  Back-pressure isn’t an issue, right? Queues NOPE
  • Programmable
  • Behavioral
  • Self Managed
  • Something’s gotta give
  • Tail Drop
  • God (category – TCP/IP)
  • RED
  • RED
  • Newark Airport
  • FRED
  • RED-PD
  • WRED
  • RED – Many many more  SRED  RRED  ARED (and Blue!)  CHOKe
  • Special Mention  RED in a different Light
  • SERIOUSLY!  RED in a different Light  CoDel and fq_codel
  • mahesh@dieswaytoofast.com @dieswaytoofastQuestions