Systems that neverstop (and Erlang)      Joe Armstrong
How can we get10 nines reliability?
SIX LAWS
ONEISOLATION
ISOLATION   10 nines = 99.99999999% availability   P(fail) = 10-10   If P(fail | one computer) = 10-3 then       P(fail...
TWOCONCURRENCY
Concurrency   World is concurrent   Need at least TWO computers to make a non-stop    sytem   TWO computer is concurren...
“My first message is that             concurrency   is best regarded as a program        structuring principle”Structured c...
THREE     MUSTDETECT FAILURES
Failure detection   If you can’t detect a failure you can’t fix it   Must work across machine boundaries    the entire ma...
FOUR    FAULTIDENTIFICATION
Failure Identification   Fault detection is not enough - you must no why    the failure occurred   Implies that you have ...
FIVE  LIVE CODEUPGRADE
Live code upgrade   Must upgrade software while it is running   Want zero down time
SIX STABLESTORAGE
Stable storage   Must store stuff forever   No backup necessary - storage just works   Implies multiple copies, distrib...
HISTORY  Those who cannot learn from history aredoomed to repeat it.                        George Santayana
GRAYAs with hardware, the key to software fault-tolerance is tohierarchically decompose large systems into modules, each m...
SCHNEIDERHalt on failure in the event of an error a processorshould halt instead of performing a possibly erroneousoperati...
GRAY   Fault containment through fail-fast software modules.   Process-pairs to tolerant hardware and transient software...
KAYFolks --Just a gentle reminder that I took some pains at the last OOPSLA totry to remind everyone that Smalltalk is not...
GRAYSoftware modularity through processesand messages. As with hardware, the keyto software fault-tolerance is tohierarchi...
Fail FastThe process approach to fault isolation advocates that the processsoftware be fail-fast, it should either functio...
Fail EarlyA fault in a software system can cause one or moreerrors. The latency time which is the interval betweenthe exis...
ARMSTRONG   Processes are the units of error encapsulation. Errors    occurring in a process will not affect other proces...
COMMERCIAL  BREAK
Joe’s 2’nd theorem   Whatever Joe starts talking about, He will end up    talking about Erlang
Erlang was  designed to programfault-tolerant   systems
Concurrent   programming             Functional                          programmingConcurrency Orientedprogramming       ...
Erlang   Very light-weight processes   Very fast message passing   Total separation between processes   Automatic mars...
Properties   No sharing   Hot code replacement   Pure message passing   No locks   Lots of computers (= fault toleran...
What is COP?                    Machine                    Process                    Message    ➡      Large numbers of p...
Thread Safety Erlang programs are automatically thread safe if they dont use an external resource.
Functional        If you call the   same function twice with     the same argumentsit should return the same value        ...
No Mutable State   Mutable state needs locks   No mutable state = no locks = programmers bliss
Multicore ready
The rise of the cores      2 cores wont hurt you      4 cores will hurt a little      8 cores will hurt a bit      16 ...
LAWS
ISOLATION     CONCURRENCYPid = spawn(.....)Pid = spawn(Node, ....)Pid ! Message         receive                          P...
FAULTIDENTIFICATION  link(Pid),       receive            {Pid, ‘EXIT’, Why} ->                  ...       end
LIVE CODE            UPGRADE   Can upgrade code while its running   Existing processes continue to use original code, ne...
STABLE STORAGE   Performed in libraries    mnesia:transaction(        fun() ->              Val = mnesia:read(Key),      ...
Projects   CouchDB   Amazon SimpleDB   Mochiweb (facebook chat)   Scalaris   Nitrogren   Ejabberd (xmpp)   Rabbit M...
Companies   Ericsson   Amazon   Tail-f   Kreditor   Synapse   ...
Books
THE END
Joe armstrong erlanga_languageforprogrammingreliablesystems
Upcoming SlideShare
Loading in …5
×

Joe armstrong erlanga_languageforprogrammingreliablesystems

852 views

Published on

10 nines reliability

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
852
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Joe armstrong erlanga_languageforprogrammingreliablesystems

  1. 1. Systems that neverstop (and Erlang) Joe Armstrong
  2. 2. How can we get10 nines reliability?
  3. 3. SIX LAWS
  4. 4. ONEISOLATION
  5. 5. ISOLATION 10 nines = 99.99999999% availability P(fail) = 10-10 If P(fail | one computer) = 10-3 then P(fail | four computers) = 10-12 Fixed
  6. 6. TWOCONCURRENCY
  7. 7. Concurrency World is concurrent Need at least TWO computers to make a non-stop sytem TWO computer is concurrent and distributed
  8. 8. “My first message is that concurrency is best regarded as a program structuring principle”Structured concurrent programming – Tony Hoare Redmond, July 2001
  9. 9. THREE MUSTDETECT FAILURES
  10. 10. Failure detection If you can’t detect a failure you can’t fix it Must work across machine boundaries the entire machine might fail Implies distributed error handling, no shared state, asynchronous messaging
  11. 11. FOUR FAULTIDENTIFICATION
  12. 12. Failure Identification Fault detection is not enough - you must no why the failure occurred Implies that you have sufficient information for post hock debugging
  13. 13. FIVE LIVE CODEUPGRADE
  14. 14. Live code upgrade Must upgrade software while it is running Want zero down time
  15. 15. SIX STABLESTORAGE
  16. 16. Stable storage Must store stuff forever No backup necessary - storage just works Implies multiple copies, distribution, ... Must keep crash reports
  17. 17. HISTORY Those who cannot learn from history aredoomed to repeat it. George Santayana
  18. 18. GRAYAs with hardware, the key to software fault-tolerance is tohierarchically decompose large systems into modules, each module beinga unit of service and a unit of failure. A failure of a module doesnot propagate beyond the module.... The process achieves fault containment by sharing no state withother processes; its only contact with other processes is via messagescarried by a kernel message system- Jim Gray- Why do computers stop and what can be done about it- Technical Report, 85.7 - Tandem Computers,1985
  19. 19. SCHNEIDERHalt on failure in the event of an error a processorshould halt instead of performing a possibly erroneousoperation.Failure status property when a processor fails,other processors in the system must be informed. Thereason for failure must be communicated.Stable Storage Property The storage of a processorshould be partitioned into stable storage (whichsurvives a processor crash) and volatile storage whichis lost if a processor crashes. Schneider ACM Computing Surveys 22(4):229-319, 1990
  20. 20. GRAY Fault containment through fail-fast software modules. Process-pairs to tolerant hardware and transient software faults. Transaction mechanisms to provide data and message integrity. Transaction mechanisms combined with process-pairs to ease exception handling and tolerate software fault Software modularity through processes and messages.
  21. 21. KAYFolks --Just a gentle reminder that I took some pains at the last OOPSLA totry to remind everyone that Smalltalk is not only NOT its syntax orthe class library, it is not even about classes. Im sorry that I long agocoined the term "objects" for this topic because it gets many people tofocus on the lesser idea.The big idea is "messaging" -- that is what the kernal of Smalltalk/Squeak is all about (and its something that was never quite completedin our Xerox PARC phase)....http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/017019.html
  22. 22. GRAYSoftware modularity through processesand messages. As with hardware, the keyto software fault-tolerance is tohierarchically decompose large systemsinto modules, each module being a unit ofservice and a unit of failure. A failure of amodule does not propagate beyond themodule.
  23. 23. Fail FastThe process approach to fault isolation advocates that the processsoftware be fail-fast, it should either function correctly or itshould detect the fault, signal failure and stop operating. Processes are made fail-fast by defensive programming. They checkall their inputs, intermediate results and data structures as a matterof course. If any error is detected, they signal a failure and stop. Inthe terminology of [Cristian], fail-fast software has small faultdetection latency.GrayWhy ...
  24. 24. Fail EarlyA fault in a software system can cause one or moreerrors. The latency time which is the interval betweenthe existence of the fault and the occurrence of theerror can be very high, which complicates thebackwards analysis of an error ...For an effective error handling we must detect errors andfailures as early as possible Renzel - Error Handling for Business Information Systems, Software Design and Management, GmbH & Co. KG, München, 2003
  25. 25. ARMSTRONG Processes are the units of error encapsulation. Errors occurring in a process will not affect other processes in the system. We call this property strong isolation. Processes do what they are supposed to do or fail as soon as possible. Failure and the reason for failure can be detected by remote processes. Processes share no state, but communicate by message passing. Armstrong Making reliable systems in the presence of software errors PhD Thesis, KTH, 2003
  26. 26. COMMERCIAL BREAK
  27. 27. Joe’s 2’nd theorem Whatever Joe starts talking about, He will end up talking about Erlang
  28. 28. Erlang was designed to programfault-tolerant systems
  29. 29. Concurrent programming Functional programmingConcurrency Orientedprogramming Erlang Fault Multicore tolerance
  30. 30. Erlang Very light-weight processes Very fast message passing Total separation between processes Automatic marshalling/demarshalling Fast sequential code Strict functional code Dynamic typing Transparent distribution Compose sequential AND concurrent code
  31. 31. Properties No sharing Hot code replacement Pure message passing No locks Lots of computers (= fault tolerant scalable ...) Functional programming (no side effects)
  32. 32. What is COP? Machine Process Message ➡ Large numbers of processes ➡ Complete isolation between processes ➡ Location transparency ➡ No Sharing of data ➡ Pure message passing systems
  33. 33. Thread Safety Erlang programs are automatically thread safe if they dont use an external resource.
  34. 34. Functional If you call the same function twice with the same argumentsit should return the same value “jolly good” Joe Armstrong
  35. 35. No Mutable State Mutable state needs locks No mutable state = no locks = programmers bliss
  36. 36. Multicore ready
  37. 37. The rise of the cores  2 cores wont hurt you  4 cores will hurt a little  8 cores will hurt a bit  16 will start hurting  32 cores will hurt a lot (2009)  ...  1 M cores ouch (2019)  (complete paradigm shift)  1997 1 Tflop = 850 KW  2007 1 Tflop = 24 W (factor 35,000)  2017 1 Tflop = ?
  38. 38. LAWS
  39. 39. ISOLATION CONCURRENCYPid = spawn(.....)Pid = spawn(Node, ....)Pid ! Message receive Pattern1 -> Actions1; Pattern2 -> Actions2; ... end
  40. 40. FAULTIDENTIFICATION link(Pid), receive {Pid, ‘EXIT’, Why} -> ... end
  41. 41. LIVE CODE UPGRADE Can upgrade code while its running Existing processes continue to use original code, new processes run new code - no mixups of namespaces Sophisticated roll-forward, roll-back, roll-back-on-error functions in OTP libraries Properly designed systems can be rolled-forward and back with no loss of service. Not easy, but possible
  42. 42. STABLE STORAGE Performed in libraries mnesia:transaction( fun() -> Val = mnesia:read(Key), mnesia:write({Key,Val}), ... end)
  43. 43. Projects CouchDB Amazon SimpleDB Mochiweb (facebook chat) Scalaris Nitrogren Ejabberd (xmpp) Rabbit MQ (amqp) ....
  44. 44. Companies Ericsson Amazon Tail-f Kreditor Synapse ...
  45. 45. Books
  46. 46. THE END

×