Fault tolerance 101
Joe Armstrong
Monday, March 3, 2014
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread o...
Fault
• “behaves as per specification”
• “does not crash”
Monday, March 3, 2014
Many systems have no
specification
Monday, March 3, 2014
Programming is the act of
turning an inexact
description of something
(the specification) into an
exact description of the ...
A program is the most precise
description of the problem
that we have
Monday, March 3, 2014
• The ability to behave in a sensible manner in
the presence of failure. Consumer so$ware,
websites, ...
• The ability to ...
• History
• Hardware Fault Tolerance
• Software Fault Tolerance
• Specifications and code
• Erlang FT
• Demo
Monday, March ...
We cannot prevent failures
Monday, March 3, 2014
Automata Studies
ed. C. Shannon
Princ. Univ. Press 1956
Monday, March 3, 2014
Q: Can we make reliable systems
that behave reasonably from
unreliable components?
A: Yes
Monday, March 3, 2014
The Cornerstones of FT
• Detect Errors
• Correct Errors
• Stop Errors from Propagating
Monday, March 3, 2014
Needs > 1 computer
Computer 1
does the job
Computer 2
watches computer 1
Computer 3
watches computer 1
Computer 3
watches ...
Things to ponder
• Hardware can fail
• Software either complies with
a spec = works or does not do
what the spec says = fa...
Hardware fault tolerance
• System that mask (hide) errors and use
redundancy to mask errors.
Examples: RAID disks, error c...
Tandem nonstop II (1981)
Monday, March 3, 2014
Tandem ...
Tandem Computers, Inc. was the
dominant manufacturer of fault-
tolerant computer systems for ATM
networks,banks...
1.10 on tuesday dec 10
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
What do we do when we
detect an error?
• Mask it (try again)
• Do nothing (crash later - not a tota&y bri&iant
idea)
• Or ...
LET
IT
CRASH
Monday, March 3, 2014
Programming the Ericsson
Diavox (1976)
If you’re in a three-
way call at any time
you can press the #
key then press 1 to
...
if(state == 3waycall && key == “#”){
key = get_next_key();
if(key==”1”){
park(2);
connect([self,1]);
} elseif(key==”2”){
p...
• The Spec tells what to do when things happen
• The Spec does not say what to do when the
behavior goes “off-spec”
• The n...
Joe: “So what happens if we’re in a 3-way conference,
and the guy processes hash and then puts the hook
down, and doesn’t ...
Calls are “files”
• If a process crashes the OS closes all files
opened by the process
• If a call crashes the OS closes all...
Let it crash philosophy
• If a processes crashes the OS detects this
• The OS protects the resources being used by
the pro...
if(state == 3waycall && key == “#”){
key = get_next_key();
if(key==”1”){
park(2);
connect([self,1]);
} elseif(key==”2”){
p...
confcall(“#”) ->
case get_next_key() of
”1” ->
park(2);
connect([self,1]);
”2” ->
park(1);
connect([self,2]);
”*” ->
conne...
Are hardware
and software
faults are
fundamentally
different?
Monday, March 3, 2014
Are there any pure functions?
Monday, March 3, 2014
Class (a) functions: If computing f(X)
fails and f is a pure function computing
f(X) will always fail.
Class (b) functions...
Is this a pure function?
function f(){
int a = 10,
int b = 2,
return a/b
}
Monday, March 3, 2014
function f(){
int a = 10,
int b = 2,
return a/b
}
Cosmic ray hits the memory
ce& where b is stored and
changes the 2 into ...
Monday, March 3, 2014
• Heisenbug - Bug that that seems to disappear or alter its
behavior when one attempts to study it
• Bohrbug - A "good, so...
• If a process fails restart it (fixes many heisenbugs,
especia&y those due to subtle timing errors)
• If you have tried r...
Supervision trees
workers
supervisors
Don’t forget the manual
backup :-)
Monday, March 3, 2014
The failure model
is part of the specification
(especially for air-traffic
control software etc.)
The customer should
underst...
I want fault tolerant storage
That’s impossible
We’ll make three copies of your data,
on three different machines. We’ll
gu...
We’ll make five copies of your data, on
five different machines. We’ll
guarantee that if two machines crashes
you’ll never lo...
You have to explain in the
contract the failure
assumptions and what will
happen if these failures occur.
If a failure occ...
Detecting
Errors
Monday, March 3, 2014
Sequential Languages
function c(){
...
if(...){
throw ...
}
}
function a(){
try {
b();
} catch (...) {
...
throw ...
}
}
f...
Uncaught Exceptions
• What happens if the exception gets to the top of
the stack and no catchpoint handlers is found?
Java...
Sequential Languages
C
program
File 1 File 2
Operating System
Crash
close close
When a process crashes the
OS notices this...
Erlang
Operating System
When an Erlang process crashes the
Erlang VM notices this and sends
messages to any linked process...
Erlang
Unix OS
Erlang VM
P10
Windows
Erlang VM
P245
Crash process 10 crashed
Monday, March 3, 2014
Demo
1. Start a process on one machine. Send it a
message so it crashes.
2. Start a process on one machine. Send it a
mess...
prog1.erl
-module(prog1).
-export([loop/0]).
loop() ->
receive
! N ->
! io:format("node=~p 1/~p = ~p~n",
[node(), N, 1/N])...
One machine
$ erl
Eshell V5.10.1 (abort with ^G)
1> P = spawn(prog1, loop, []).
<0.34.0>
2> P ! 12.
node=nonode@nohost 1/1...
monitor.erl
-module(monitor).
-export([process/1]).
process(Pid) ->
spawn(fun() ->
! ! process_flag(trap_exit, true),
! ! ...
One machine + Monitor
Eshell V5.10.1 (abort with ^G)
1> P = spawn(prog1, loop, []).
<0.34.0>
2> monitor:process(P).
<0.36....
Two machines and a monitor
$ erl -sname one
(one@joe)1> P = spawn('two@joe', prog1, loop, []).
<6803.43.0>
(one@joe)2> mon...
Reminder
Operating System
When an Erlang process
crashes the Erlang notices
this and te&s and linked
processes
Process 200...
Monday, March 3, 2014
Defensive
programming
is a consequence of a
bad concurrency
model
Monday, March 3, 2014
We’ve detected an error
what do we do next?
Monday, March 3, 2014
I’ve detected an error, what should I do?
Try again - it might be a heisenbug
Ok - give up, and tell you’re boss you
gave ...
Do not fail silently
if you cannot do exactly what
you are supposed to do crash.
Somebody else will fix the
problem
Monday,...
Summary
• No shared memory
• Pure message passing
• Remote Error Detection
• Replicated hardware and software on separated...
Does this
strategy work?
Monday, March 3, 2014
•2002 Alexey Shchepin started building an XMPP server
fully in Erlang
•2005 Process One Founded
•2007 Facebook Chat (build...
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
Finally
• Design with small isolated components
• Fault Tolerant = Scalable
• Small components = Understandable
Monday, Ma...
Questions
Monday, March 3, 2014
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/fault-
tolerance-101-qcon-lond...
Upcoming SlideShare
Loading in...5
×

Fault Tolerance 101

357

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1kS4Bkv.

Joe Armstrong describes the foundations of fault tolerant computation and the basic properties a system should have in order to be able to function in an adequate manner despite the occurrence of hardware and software errors, summarizing the key features of Erlang and showing how they can be used for programming fault-tolerant and scalable systems on multi-core clusters. Filmed at qconlondon.com.

Joe Armstrong is the principle inventor of the Erlang programming Language and coined the term "Concurrency Oriented Programming". He has worked for Ericsson where he developed Erlang and was the chief software architect of the project which produced the Erlang OTP system. He is author of several books, the latest being "Programming Erlang: Software for a concurrent world - 2'nd edition".

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
357
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Fault Tolerance 101

  1. 1. Fault tolerance 101 Joe Armstrong Monday, March 3, 2014
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /fault-tolerance-101-qcon-london
  3. 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Fault • “behaves as per specification” • “does not crash” Monday, March 3, 2014
  5. 5. Many systems have no specification Monday, March 3, 2014
  6. 6. Programming is the act of turning an inexact description of something (the specification) into an exact description of the thing (the program) Monday, March 3, 2014
  7. 7. A program is the most precise description of the problem that we have Monday, March 3, 2014
  8. 8. • The ability to behave in a sensible manner in the presence of failure. Consumer so$ware, websites, ... • The ability to behave exactly as specified despite failures. Air traffic control, nuclear power station control. What is fault tolerance? Exact specification is extremely difficult “In a sensible manner” is rather wooly When there is no spec - “in a sensible manner” means - does not crash Monday, March 3, 2014
  9. 9. • History • Hardware Fault Tolerance • Software Fault Tolerance • Specifications and code • Erlang FT • Demo Monday, March 3, 2014
  10. 10. We cannot prevent failures Monday, March 3, 2014
  11. 11. Automata Studies ed. C. Shannon Princ. Univ. Press 1956 Monday, March 3, 2014
  12. 12. Q: Can we make reliable systems that behave reasonably from unreliable components? A: Yes Monday, March 3, 2014
  13. 13. The Cornerstones of FT • Detect Errors • Correct Errors • Stop Errors from Propagating Monday, March 3, 2014
  14. 14. Needs > 1 computer Computer 1 does the job Computer 2 watches computer 1 Computer 3 watches computer 1 Computer 3 watches computer 1 Computer ... watches computer 1 Error detection must work across machine boundaries Must write distributed programs Programs run in para&el Decoupling and separation helps stop errors 'om propagating Monday, March 3, 2014
  15. 15. Things to ponder • Hardware can fail • Software either complies with a spec = works or does not do what the spec says = fails • What should the software do when the system behaves in a way that is not described in the spec? • What do we do when we don’t have a spec? • Can we make reliable systems that behave reasonably from unreliable components? • Detecting or masking errors? • Correcting errors • Propagation of errors • Error firewalls • Self-repairing zones • Static/Dynamic error detection Monday, March 3, 2014
  16. 16. Hardware fault tolerance • System that mask (hide) errors and use redundancy to mask errors. Examples: RAID disks, error correcting bits in memory hardware etc. Monday, March 3, 2014
  17. 17. Tandem nonstop II (1981) Monday, March 3, 2014
  18. 18. Tandem ... Tandem Computers, Inc. was the dominant manufacturer of fault- tolerant computer systems for ATM networks,banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state. Besides handling failures well, this "shared-nothing" messaging system design also scales extremely well to the largest commercial workloads. Each doubling of the total number of processors would double system throughput, up to the maximum configuration of 4000 processors. In contrast, the performance of conventional multiprocessor systems is limited by the speed of some shared memory, bus, or switch. Adding more than 4–8 processors that way gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete well against IBM's largest mainframes, despite being built from simpler minicomputer technology. A& quotes 'om Wikipedia Monday, March 3, 2014
  19. 19. 1.10 on tuesday dec 10 Monday, March 3, 2014
  20. 20. Monday, March 3, 2014
  21. 21. Monday, March 3, 2014
  22. 22. What do we do when we detect an error? • Mask it (try again) • Do nothing (crash later - not a tota&y bri&iant idea) • Or ... Monday, March 3, 2014
  23. 23. LET IT CRASH Monday, March 3, 2014
  24. 24. Programming the Ericsson Diavox (1976) If you’re in a three- way call at any time you can press the # key then press 1 to talk to party 1 2 to talk to party 2 or * to enter a conference call Monday, March 3, 2014
  25. 25. if(state == 3waycall && key == “#”){ key = get_next_key(); if(key==”1”){ park(2); connect([self,1]); } elseif(key==”2”){ park(1); connect([self,2]); } elseif (key==”*”){ connect([self,1,2]); } elseif(key=”onhook”){ /* Uuugh what do I do here */ } Defensive programming Monday, March 3, 2014
  26. 26. • The Spec tells what to do when things happen • The Spec does not say what to do when the behavior goes “off-spec” • The number of ways we can go “off spec” is huge • Most specifications do not include failure analysis, and do not say what to do when you are “off spec” Oh Dear Monday, March 3, 2014
  27. 27. Joe: “So what happens if we’re in a 3-way conference, and the guy processes hash and then puts the hook down, and doesn’t press 1 2 or star?” Bernt: “So what you do is stop the conference, send the phone a ring tone and when they answer go back to the point where you were expecting them to enter 1 2 or star.” Joe: “But that’s not in the spec.” Bernt: “But everybody knows.” Joe: “I didn’t know.” Monday, March 3, 2014
  28. 28. Calls are “files” • If a process crashes the OS closes all files opened by the process • If a call crashes the OS closes all calls opened by the process • The OS’s job is to “keep files safe” (ie it maintains invariants) Monday, March 3, 2014
  29. 29. Let it crash philosophy • If a processes crashes the OS detects this • The OS protects the resources being used by the process • Programs should crash when going off spec Monday, March 3, 2014
  30. 30. if(state == 3waycall && key == “#”){ key = get_next_key(); if(key==”1”){ park(2); connect([self,1]); } elseif(key==”2”){ park(1); connect([self,2]); } elseif (key==”*”){ connect([self,1,2]); } else{ exit(out_of_spec1); } } Defensive programming Monday, March 3, 2014
  31. 31. confcall(“#”) -> case get_next_key() of ”1” -> park(2); connect([self,1]); ”2” -> park(1); connect([self,2]); ”*” -> connect([self,1,2]) end. Failed Patten matching provides the exit Non defensive programming - there is no error detection or correction code Monday, March 3, 2014
  32. 32. Are hardware and software faults are fundamentally different? Monday, March 3, 2014
  33. 33. Are there any pure functions? Monday, March 3, 2014
  34. 34. Class (a) functions: If computing f(X) fails and f is a pure function computing f(X) will always fail. Class (b) functions: If computing f(X) fails and f is a non-pure function it might succeed if we call f(X) again. Monday, March 3, 2014
  35. 35. Is this a pure function? function f(){ int a = 10, int b = 2, return a/b } Monday, March 3, 2014
  36. 36. function f(){ int a = 10, int b = 2, return a/b } Cosmic ray hits the memory ce& where b is stored and changes the 2 into zero A heisenbug Monday, March 3, 2014
  37. 37. Monday, March 3, 2014
  38. 38. • Heisenbug - Bug that that seems to disappear or alter its behavior when one attempts to study it • Bohrbug - A "good, solid bug". Like the deterministic Bohr atom model, they do not change their behavior and are relatively easily detected. • Mandelbug - (named after Benoît Mandelbrot's fractal) is a bug whose causes are so complex it defies repair, or makes its behavior appear chaotic or even non-deterministic. • Schrödinbug (named after Erwin Schrödinger and his thought experiment) is a bug that manifests itself in running software after a programmer notices that the code should never have worked in the first place. • Hindenbug (named after Hindenburg disaster) is a bug with catastrophic behavior. Source: wikipedia Monday, March 3, 2014
  39. 39. • If a process fails restart it (fixes many heisenbugs, especia&y those due to subtle timing errors) • If you have tried restarting a process more than N times in K seconds, then give up. Try and do something simpler instead. • Build trees of processes, if low-level nodes fail and cannot be restarted fail higher up the tree Monday, March 3, 2014
  40. 40. Supervision trees workers supervisors Don’t forget the manual backup :-) Monday, March 3, 2014
  41. 41. The failure model is part of the specification (especially for air-traffic control software etc.) The customer should understand the failure model Monday, March 3, 2014
  42. 42. I want fault tolerant storage That’s impossible We’ll make three copies of your data, on three different machines. We’ll guarantee that if one machine crashes you’ll never lose any data what happens if 2 machines crash at the same time You can still save data on the third machine, but it will be unsafe. Our guarantee will not apply. But I want more safety Monday, March 3, 2014
  43. 43. We’ll make five copies of your data, on five different machines. We’ll guarantee that if two machines crashes you’ll never lose any data what happens if 3 machines crash at the same time You can still save data on machine 4 and 5, but it will be unsafe. Our guarantee will not apply. Why is it unsafe? - it’s stored on two machines Because when machines 1,2,3 come back to life they might outvote the changes on machines 4 and 5 Monday, March 3, 2014
  44. 44. You have to explain in the contract the failure assumptions and what will happen if these failures occur. If a failure occurs that is not planned it is not covered by the contract. “act of God” Monday, March 3, 2014
  45. 45. Detecting Errors Monday, March 3, 2014
  46. 46. Sequential Languages function c(){ ... if(...){ throw ... } } function a(){ try { b(); } catch (...) { ... throw ... } } function b(){ x(); c(); y(); } • Function calls put call frames on the stack • Try instruction put catchpoints on the stack • Exceptions unwind the stack to the last catchpoint Monday, March 3, 2014
  47. 47. Uncaught Exceptions • What happens if the exception gets to the top of the stack and no catchpoint handlers is found? Java: print a stack trace and exit C: core dumped Erlang: Process dies some other process on the same or some other machine possibly catches the error Monday, March 3, 2014
  48. 48. Sequential Languages C program File 1 File 2 Operating System Crash close close When a process crashes the OS notices this and closes any resources owned by the process Monday, March 3, 2014
  49. 49. Erlang Operating System When an Erlang process crashes the Erlang VM notices this and sends messages to any linked processes Process45 Crash Process23 process 45 crashed Process92 process 45 crashed Erlang VM Monday, March 3, 2014
  50. 50. Erlang Unix OS Erlang VM P10 Windows Erlang VM P245 Crash process 10 crashed Monday, March 3, 2014
  51. 51. Demo 1. Start a process on one machine. Send it a message so it crashes. 2. Start a process on one machine. Send it a message so it crashes. Detect the crash 3.Start a process on a remote machine. Send it a message so it crashes. Detect the error on a remote machine. Monday, March 3, 2014
  52. 52. prog1.erl -module(prog1). -export([loop/0]). loop() -> receive ! N -> ! io:format("node=~p 1/~p = ~p~n", [node(), N, 1/N]), ! loop() end. Monday, March 3, 2014
  53. 53. One machine $ erl Eshell V5.10.1 (abort with ^G) 1> P = spawn(prog1, loop, []). <0.34.0> 2> P ! 12. node=nonode@nohost 1/12 = 0.08333333333333333 12 3> P ! 0. 0 4> =ERROR REPORT==== 29-Nov-2013::13:07:26 === Error in process <0.34.0> with exit value: {badarith,[{prog1,loop,0,[{file,"prog1.erl"},{line,7}]}]} 4> P ! 12. 12 Monday, March 3, 2014
  54. 54. monitor.erl -module(monitor). -export([process/1]). process(Pid) -> spawn(fun() -> ! ! process_flag(trap_exit, true), ! ! link(Pid), ! ! monitor(Pid) ! end). monitor(Pid) -> receive ! Any -> ! io:format("Monitor ~p received ~p~n",[Pid,Any]), ! monitor(Pid) end. Monday, March 3, 2014
  55. 55. One machine + Monitor Eshell V5.10.1 (abort with ^G) 1> P = spawn(prog1, loop, []). <0.34.0> 2> monitor:process(P). <0.36.0> 3> P ! 12. node=nonode@nohost 1/12 = 0.08333333333333333 12 4> P ! 0. Monitor <0.34.0> received {'EXIT',<0.34.0>, {badarith, [{prog1,loop,0, [{file,"prog1.erl"},{line,7}]}]}} The process dies and a message is sent to the monitor process Monday, March 3, 2014
  56. 56. Two machines and a monitor $ erl -sname one (one@joe)1> P = spawn('two@joe', prog1, loop, []). <6803.43.0> (one@joe)2> monitor:process(P). <0.47.0> (one@joe)4> P ! 10. 10 node=two@joe 1/10 = 0.1 (one@joe)5> P ! 0. 0 Monitor <6803.43.0> received {'EXIT',<6803.43.0>, {badarith, [{prog1,loop,0, [{file,"prog1.erl"},{line,7}]}]}} $ erl -sname two (two@joe)1> Or we could ki& the machine? Monday, March 3, 2014
  57. 57. Reminder Operating System When an Erlang process crashes the Erlang notices this and te&s and linked processes Process 200 Crash Process300 process 200 crashed Erlang VM Monday, March 3, 2014
  58. 58. Monday, March 3, 2014
  59. 59. Defensive programming is a consequence of a bad concurrency model Monday, March 3, 2014
  60. 60. We’ve detected an error what do we do next? Monday, March 3, 2014
  61. 61. I’ve detected an error, what should I do? Try again - it might be a heisenbug Ok - give up, and tell you’re boss you gave up. You did your best, nobody will blame you. I tried again ten time but it didn’t help .... *@!%$!!**&%%%!!!%$#@*** #$@ We have a problem Huston Monday, March 3, 2014
  62. 62. Do not fail silently if you cannot do exactly what you are supposed to do crash. Somebody else will fix the problem Monday, March 3, 2014
  63. 63. Summary • No shared memory • Pure message passing • Remote Error Detection • Replicated hardware and software on separated machines • Crash when you get an error • Do not fail silently • Some other process fixes the error Monday, March 3, 2014
  64. 64. Does this strategy work? Monday, March 3, 2014
  65. 65. •2002 Alexey Shchepin started building an XMPP server fully in Erlang •2005 Process One Founded •2007 Facebook Chat (build on ejabberd) "the only chat server with built-in clustering" •2008 Facebook chat in Erlang •2009 Feb 175M active users (Dropped and rewrite in C++) •2009 June 8 Jan Koum gets ejabberd working •2013 2 Jan - 18 B messages/day •2013 Feb - Chef11 used by facebook/google/Amazon •2014 19 Feb -19B$ WhatsApp bought by facebook Monday, March 3, 2014
  66. 66. Monday, March 3, 2014
  67. 67. Monday, March 3, 2014
  68. 68. Monday, March 3, 2014
  69. 69. Monday, March 3, 2014
  70. 70. Finally • Design with small isolated components • Fault Tolerant = Scalable • Small components = Understandable Monday, March 3, 2014
  71. 71. Questions Monday, March 3, 2014
  72. 72. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/fault- tolerance-101-qcon-london

×