{
Erlang :
Because S**t happens
Mahesh Paolini-Subramanya (@dieswaytoofast)
V.P. Ubiquiti Networks
AGILITY
My Vacation
(Actually, the day before)
A small failure…
The Horror! The Horror!
Why are my calls failing?
You better call me back!
I’m still p***ed off!
And you’re stupid Apps
don’t work!
The Horror! The Horror!
Surely you Tested?
1000 year floods
Fault Tolerance
 Concurrency
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
 Concurrency
 Fault detection
The Big Six
From http://www.erlang.org/download/armstrong_thesis_2003.pdf
 Concurrency
 Fault detection
 Fault identification
The Big Six
From http://www.erlang.org/download/armstrong_thesis_20...
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
The Big Six
From http://www.erlang.org/downlo...
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
The Big Six
From http://www.er...
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
erlang…
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
Concurrency Oriented
Concurrency Hell
My Blue Heaven My Blue Heaven
Concurrency Oriented
Concurrency Hell
My Blue Heaven
Deep Problems
My Blue Heaven
Deep Problems
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
Fault Detection
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
Stack Traces?
Immutable Variables
 X = 1.
Immutable Variables
 X = 1.
 X = 2.
Huh?
Immutable Variables
 X = 1.
 X = 2.
 X = X + 1.
Huh?
Fault Identification
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
Let It Crash
BEAM!
 Faster to create
JVM is not necessarily
your friend!
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
Code Upgrade
 Live!
Hot SwappingCode Upgrade
 Concurrency
 Fault detection
 Fault Identification
 Error Encapsulation
 Code upgrade
 Stable Storage
The Big Six
F...
The Intangibles
4x – 10x less code
Code Size
 Faster to create
4x – 10x less code
 Faster to create
 Easier to reason about
4x – 10x less code
 Faster to create
 Easier to reason about
 Fewer bugs
4x – 10x less code
 Faster to create
 Easier to reason about
 Fewer bugs
 Speedy refactoring
4x – 10x less code
The Shell is our friend
Live Debugging
Predictability
Performance
Fault Tolerance - Systems
Romney 2012
Fault Tolerance - Systems
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
Loose Coupling?
 Breeds Trust
Loose Coupling
Loose Coupling
 Breeds Trust
 Devote more brainpower to specific areas
Loose Coupling
Loose Coupling
 Breeds Trust
 Devote more brainpower to specific areas
 No. of bugs/line is constant
Loose Coupling
Performance
 60 - 90% of all SW projects fail
 10 – 25% of all SW projects get abandoned
Fault Tolerance
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
Monitoring?
 Dashboards
Monitoring?
 Dashboards
 Out of band systems
Monitoring?
Supervision
 Dashboards
 Out of band systems
 Polyglot safety
Monitoring?
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
 Concurrency
 Error encapsulation
 Fault detection
 Fault identification
 Code upgrade
 Stable Storage
The Big Six -...
No battle plan survives
contact with the enemy
 Not just about Systems 
Fault Tolerance
Fault Tolerance
 People
 Vendors
Fault Tolerance
 People
 Vendors
 Fraud
Fault Tolerance
The BusinessBeware the Black Swan
Is It Safe?
erlang…
mahesh@dieswaytoofast.com
@dieswaytoofastQuestions
Coda
Active Queue
Management
Queues
Queues
Queues
Queues
 Can you recover quickly?
 Buffer-bloat doesn’t matter, right?
 Once up, can you deal with the backlog?
 Back-pressure...
 Can you recover quickly?
 Buffer-bloat doesn’t matter, right?
 Once up, can you deal with the backlog?
 Back-pressure...
Programmable
Behavioral
Self Managed
Something’s gotta give
Tail Drop
God
(category – TCP/IP)
RED
RED
Newark Airport
FRED
RED-PD
WRED
RED – Many many more
 SRED
 RRED
 ARED (and Blue!)
 CHOKe
Special Mention
 RED in a different Light
SERIOUSLY!
 RED in a different Light
 CoDel and fq_codel
mahesh@dieswaytoofast.com
@dieswaytoofastQuestions
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
Upcoming SlideShare
Loading in …5
×

Erlang - Because s**t Happens by Mahesh Paolini-Subramanya

1,677 views

Published on

Mahesh talks about the buddha-nature of Erlang/OTP, pointing out how the various features of the language tie together into one seamless Fault Tolerant whole. Mahesh emphasizes that Erlang begins and ends with Fault Tolerance. Fault Tolerance is baked into the very genes of Erlang/OTP - something that ends up being amazingly useful when building any kind of system. Mahesh Paolini-Subramanya is the V.P. of R&D at Ubiquiti Networks - a manufacturer of disruptive technology platforms for emerging markets. He has spent the recent past building out Erlang-based massively concurrent Cloud Services and VoIP platforms. Mahesh was previously the CTO of Vocalocity after its merger with Aptela, where he was a founder and CTO.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,677
On SlideShare
0
From Embeds
0
Number of Embeds
519
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • An overall approach to Preparedness
  • This is a story about unexpectedness.The only constant is change
  • Our story starts on a happy Saturday in february
  • Its still Friday
  • Just part of one cluster failed, but a threshold had been passed
  • No worries, we’ll just bounce that one cluster, it’ll all be good
  • Total System Meltdown
  • All the calls keep retrying, causing memory utilization to go through the roof
  • Voicemail conversion was going on independent of everything else, causing CPU utilization to spike
  • Eventually, the cache timed out, and tried to reload stuff from the disk.
  • And then everyone tries the Apps, and the Twitters and the facebooks and the everythings.
  • Total System Meltdown
  • What about testing? Didn’t you check loads? Specs? Capabilities?
  • There is only so much planning you can do. At some point, the 1000 year flood hits
  • The point being, Shit will happen.The question is, when Shit happens, can you clean up?
  • There is a formal definition of Fault Tolerance
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • ‘Distributed’ problems mean you spend a huge chunk of your time dealing with theadminstrivia of distribution.With erlang you get that for free!Processes, Messages, Immutability, “Writing Concurrent Programs in Java”
  • Ok, not really true. You still have to deal with ‘deep problems’ (hard core parallelization issues, etc.)But you’d have to deal with that anyhow!
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
  • Garbage Collection, Referential Integrity, Testing!!!
  • The Six Essential Characteristics of a Fault Tolerant System
  • The Six Essential Characteristics of a Fault Tolerant System
  • Let it Crash
  • BEAM --> insanely reliable. will last till the heat death of the universe if you leave it alone
  • JVM is not necessarily your friend.Running on the JVM is not necessarily good - do you trust all the other java code?     i don't. trust _me_, i've been there
  • The Six Essential Characteristics of a Fault Tolerant System
  • Let it Crash
  • Mnesia, ETS, gen_servers, etc.
  • Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
  • Testing is infinitely easier. Trivial to simulate (its all messages!)Thank you immutability!
  • The bigger they are, the harder they fall
  • Just connect to a remote node and trace to figure out what is going on
  • Why wait? Just log on to a node
  • Soft real-time. Brief discussion of instrumentation and ‘reductions’
  • i/o (and message passing. basically the same thing) is _wicked_ fast. Not just IPC, but network, web (cowboy) websockets, etc.
  • The Buddha nature of erlang
  • This is pretty much what we’re talking about right?Systems – Development/Production and Internal/External
  • Its not just us
  • Its not just us
  • Lets talk about systems
  • The Six Essential Characteristics of a Fault Tolerant System
  • Loose Coupling, of course, gives us all these benefits
  • Loose Coupling, of course, gives us all these benefitsLoosely couple systems can operate concurrently. Well D-UHErrors can be contained/constrained
  • Keep components/modules/systems ‘loosely coupled’Connect via specs/apis/busesDo this by default, even when you don’t need to!
  • Builds trust  Trust in the stupidity of people, trust that things will fail, trust that you will be affected
  • The amount of brainpower we have is limited.Reduce complexity by being able to focus on specific / limited areas
  • There are many studies (some not so controversial) that show the number of bugs/line is constantFocus on smaller areas gives you fewer things to tackle
  • Isn’t Performance an issue w/ Loose Coupling?
  • remember the bit about failure? well, why optimize if you're going to fail anyhow? yeah yeah, you might fail because you don't perform, but that is rarely the problem
  • yes, that mine craft plugin you built might gt a million signupsit won'tseriously – it doesn't register statistically
  • DashboardsOtherwise, how do you know whats going on?
  • Out of band access Don’t rely on the system to always tell you whats happening
  • Corresponds to how we think, and helps deal with edge-cases much *much* better!
  • Be PolyglotEverything fails – even erlang. (noooo)
  • Why Polyglot?Because you want to limit your failure modes (increasing diversity can actually reduce systemic risk)
  • Macro Effects Matter! Systems span divisionsFinance, Customer Support, Sales, HR, etc.
  • Helmuth vonMoltke
  • People fall ill
  • Vendors Fail(Amazon)
  • Fraud: You wonder why your CFO is in Brazil…
  • Tail Risk (Things that can never happen)This deserves its own section(financial crisis)
  • Ask yourself this. Over and over again…
  • The Six Essential Characteristics of a Fault Tolerant System
  • Yeah, yeah. Understandable lies. But the bottlenecks are pretty far down the road (and much further than you would have gotten before!)
  • Tail RiskThis deserves its own section(financial crisis)
  • How fast are you?How quickly can you come back up? Can you store enough state to survive?
  • Is BufferBloat a problem?
  • Once you are up, can you draw down the queue fast enough?Or at all, for that matter?
  • Is backpressure going to be a problem?
  • If the answer is “Yes”, then the talk is over, because it just works.
  • What if the answer is “No”? (Now we have a story)
  • ProgrammableIf you’re lucky, you’re infrastructure will automagically support ramping
  • Fake it. People respond subconsciously to these, and actually waitYou can even get away with dropping the request(This assumes that you can recover in time)
  • This happens inside the airport too!Passengers self-select the best gates to enter(intelligent routing)
  • The question is, what do you do when you can’t come up in time? 3 gallon bucket, 5 gallons of water…
  • Just start dropping when queue fills upThis is pretty bad – global synchronization becomes a problemPlanes don’t take off till they get clearance from the other end
  • Slow Start, AQM, RED, CoDEL, …Why don’t we learn from networks?They certainly don’t learn from us, why do we ignore them?
  • RED / SRED(RED in a different light – toilet bowl)
  • RED / SRED(RED in a different light – toilet bowl)
  • The 3rd priority airport always gets the shaft
  • F(low) REDRED on a per-flow basis (the entire route map)Kinda the default. Discard second request)
  • RED – P(referential) D(rop)Does RED only for High BW flows (high traffic routes)(Throttle spammy clients. Or features.)
  • W(eighted) REDDifferent discard probabilities for different flows (translatlantic routes)(Major clients vs small ones0
  • S(tabilized) RED – estimate flows and probabilitiesR(obust) RED – Protect against low-rate DoS (with filters) (even unintentional DoS)A(daptive) RED – Modify prob based on queue CHO(ose and) K(eep) or CHO(ose and) K(ill) - open for < min;  drop tail for > maxelse, compare packet to random packet. if same flow, drop it w/ prob.
  • Fixed two bugs in REDMade it feedback based (self-tuning)Toilet diagram caused problems
  • Van Jacobson strikes backUse Queue length as metric (bursts can fill up queue)Drop probabilistically
  • Yeah, yeah. Understandable lies. But the bottlenecks are pretty far down the road (and much further than you would have gotten before!)
  • Erlang - Because s**t Happens by Mahesh Paolini-Subramanya

    1. 1. { Erlang : Because S**t happens Mahesh Paolini-Subramanya (@dieswaytoofast) V.P. Ubiquiti Networks
    2. 2. AGILITY
    3. 3. My Vacation
    4. 4. (Actually, the day before)
    5. 5. A small failure…
    6. 6. The Horror! The Horror!
    7. 7. Why are my calls failing?
    8. 8. You better call me back!
    9. 9. I’m still p***ed off!
    10. 10. And you’re stupid Apps don’t work!
    11. 11. The Horror! The Horror!
    12. 12. Surely you Tested?
    13. 13. 1000 year floods
    14. 14. Fault Tolerance
    15. 15.  Concurrency The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    16. 16.  Concurrency  Fault detection The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    17. 17.  Concurrency  Fault detection  Fault identification The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    18. 18.  Concurrency  Fault detection  Fault Identification  Error Encapsulation The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    19. 19.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    20. 20.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    21. 21. erlang…
    22. 22.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    23. 23. Concurrency Oriented Concurrency Hell My Blue Heaven My Blue Heaven
    24. 24. Concurrency Oriented Concurrency Hell My Blue Heaven Deep Problems My Blue Heaven Deep Problems
    25. 25.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    26. 26. Fault Detection
    27. 27.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    28. 28. Stack Traces?
    29. 29. Immutable Variables  X = 1.
    30. 30. Immutable Variables  X = 1.  X = 2. Huh?
    31. 31. Immutable Variables  X = 1.  X = 2.  X = X + 1. Huh?
    32. 32. Fault Identification
    33. 33.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    34. 34. Let It Crash
    35. 35. BEAM!
    36. 36.  Faster to create JVM is not necessarily your friend!
    37. 37.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    38. 38. Code Upgrade
    39. 39.  Live! Hot SwappingCode Upgrade
    40. 40.  Concurrency  Fault detection  Fault Identification  Error Encapsulation  Code upgrade  Stable Storage The Big Six From http://www.erlang.org/download/armstrong_thesis_2003.pdf
    41. 41. The Intangibles
    42. 42. 4x – 10x less code
    43. 43. Code Size
    44. 44.  Faster to create 4x – 10x less code
    45. 45.  Faster to create  Easier to reason about 4x – 10x less code
    46. 46.  Faster to create  Easier to reason about  Fewer bugs 4x – 10x less code
    47. 47.  Faster to create  Easier to reason about  Fewer bugs  Speedy refactoring 4x – 10x less code
    48. 48. The Shell is our friend
    49. 49. Live Debugging
    50. 50. Predictability
    51. 51. Performance
    52. 52. Fault Tolerance - Systems
    53. 53. Romney 2012
    54. 54. Fault Tolerance - Systems
    55. 55.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems
    56. 56.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems
    57. 57.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems LOOSECOUPLING
    58. 58. Loose Coupling?
    59. 59.  Breeds Trust Loose Coupling
    60. 60. Loose Coupling
    61. 61.  Breeds Trust  Devote more brainpower to specific areas Loose Coupling
    62. 62. Loose Coupling
    63. 63.  Breeds Trust  Devote more brainpower to specific areas  No. of bugs/line is constant Loose Coupling
    64. 64. Performance
    65. 65.  60 - 90% of all SW projects fail  10 – 25% of all SW projects get abandoned Fault Tolerance
    66. 66.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems M ONITORING
    67. 67. Monitoring?
    68. 68.  Dashboards Monitoring?
    69. 69.  Dashboards  Out of band systems Monitoring?
    70. 70. Supervision
    71. 71.  Dashboards  Out of band systems  Polyglot safety Monitoring?
    72. 72.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems
    73. 73.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems POLYGLOT PERSISTENCE
    74. 74.  Concurrency  Error encapsulation  Fault detection  Fault identification  Code upgrade  Stable Storage The Big Six - Systems EVERYW HERE!!!
    75. 75. No battle plan survives contact with the enemy
    76. 76.  Not just about Systems  Fault Tolerance
    77. 77. Fault Tolerance
    78. 78.  People  Vendors Fault Tolerance
    79. 79.  People  Vendors  Fraud Fault Tolerance
    80. 80. The BusinessBeware the Black Swan
    81. 81. Is It Safe?
    82. 82. erlang…
    83. 83. mahesh@dieswaytoofast.com @dieswaytoofastQuestions
    84. 84. Coda Active Queue Management
    85. 85. Queues
    86. 86. Queues
    87. 87. Queues
    88. 88. Queues
    89. 89.  Can you recover quickly?  Buffer-bloat doesn’t matter, right?  Once up, can you deal with the backlog?  Back-pressure isn’t an issue, right? Queues
    90. 90.  Can you recover quickly?  Buffer-bloat doesn’t matter, right?  Once up, can you deal with the backlog?  Back-pressure isn’t an issue, right? Queues NOPE
    91. 91. Programmable
    92. 92. Behavioral
    93. 93. Self Managed
    94. 94. Something’s gotta give
    95. 95. Tail Drop
    96. 96. God (category – TCP/IP)
    97. 97. RED
    98. 98. RED
    99. 99. Newark Airport
    100. 100. FRED
    101. 101. RED-PD
    102. 102. WRED
    103. 103. RED – Many many more  SRED  RRED  ARED (and Blue!)  CHOKe
    104. 104. Special Mention  RED in a different Light
    105. 105. SERIOUSLY!  RED in a different Light  CoDel and fq_codel
    106. 106. mahesh@dieswaytoofast.com @dieswaytoofastQuestions

    ×