PASC fault tolerance

433 views
325 views

Published on

A new generic and rigorous approach to the tolerance of data corruptions. Presentation of the paper "Practical Hardening of Crash-Tolerant Systems" published at USENIX ATC 2012. See video at http://bit.ly/LNc5mc

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
433
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • User Impact: >10 Million users unable to use a given service. Revenue Impact: >$100K. Brand Impact: Outage requires press release. Top Tier Revenue Property Impact (see list below)
  • search.yahoo.com sponsored text ads are not displaying in the North placement. Sponsored Ads are instead being moved to the east placement. There was a limit for the number of different data dictionary match types that QP can handle (720 types). The DD built and pushed the night of included an additional 400 types, slowly incrementing over the course of the months, and finally exceeding the l
  • SPEND MORE HERE
  • ----- Meeting Notes (6/8/12 16:43) -----TODO: more detailed figure of how the runtime looks like- event handler- replica state
  • ----- Meeting Notes (6/8/12 15:57) -----Simple exampleno overhead because little computation and network bound
  • ----- Meeting Notes (6/8/12 11:51) -----too many plots, remove the ones for batching one----- Meeting Notes (6/8/12 15:57) -----more concrete example----- Meeting Notes (6/8/12 16:47) -----stress that PASC is not SMR. Paxos is built on top of PASC. Maybe have a bullet
  • ----- Meeting Notes (6/8/12 11:51) -----use bars with one value (max tput) per setting
  • ----- Meeting Notes (6/8/12 15:57) -----Does PASC really detect corruptions?
  • PASC fault tolerance

    1. 1. Taming Data Corruptions in Distributed Systems Marco Serafini (Yahoo! Research BCN)
    2. 2. Infrastructure dependabilityo Service availability, data durabilityo In presence of hardware faultso Current approaches tolerate crashes
    3. 3. Crasheso Assumptions o A server (process) suddenly stops o Until then, only correct steps Crash Time
    4. 4. Data corruptionso What if there are data corruptions? o The state of a process may be corrupted o The process may make incorrect steps before stopping Data corruptions Time
    5. 5. Data corruptionso What if there are data corruptions? o The state of a process may be corrupted o The process may make incorrect steps before stopping NOT COVERED! Data corruptions Time
    6. 6. Sources of data corruptionso Commodity disks are known to be unreliable o Faulty firmware, bad sectors etc.o RAM: ECC errors are frequent o Production machines only see detected errors  Coverage not knowno Interconnects and CPUs also fail o Faulty drivers or bit flips
    7. 7. A horror storyAn 8-hour system-wide outage due to a single hardware fault
    8. 8. What happened?o Quoted from the Amazon service health dashboard o “A handful of messages had a single bit corrupted” o “The message was still intelligible, but the system state information was incorrect” o “We used MD5 checksums throughout the system (but not) for this particular internal state information” o “(The corruption) spread throughout the system causing the symptoms described above”
    9. 9. Error propagationmin u x Event Event handling handling v y mout min Process i Process j
    10. 10. Common practiceo Manual placement of ad-hoc error detection checks o Application knowledge o Time consumingo Hard to structure without fault modelo No error isolation guarantee
    11. 11. Research: Byzantine faults o Byzantine model o Faulty nodes controlled by an adversary o Worst-case model Byzantine fault Time11
    12. 12. Byzantine fault modelo Black-box model of faulty processes: adversarialo Hardening for error isolation [Nysiad NSDI 2008] o Based on state machine replication o Replication and performance costs Agreement on requests Servers Client
    13. 13. Byzantine faultso Byzantine hardening covers attacks and bugs…o … assuming, e.g., design diversity of replicas o Unpractical in most systems  no real adoption Attacks Bugs Data corruptions Security V&V ASC Hardening
    14. 14. A new approach to min error isolation u Event Event x handling handling v y mout min Process i Process j1. General model of process behavior2. Arbitrary State Corruption (ASC) fault model3. Guarantee error isolation through hardening
    15. 15. A new approach to min error isolation u Event Event x handling handling v y mout min Process i Process j1. General model of process behavior2. Arbitrary Correia, D. Ferro(ASC)F. Junqueira with M. State Corruption and fault model3. Guarantee error isolation through Conference 2012 Usenix Annual Technical hardening
    16. 16. Process and fault models Defining Arbitrary State Corruptions
    17. 17. Process model min1) Event Dispatching Upon receive message <REQ, r> do if v > 5 then u = r + v + 5; 2) Event Handling else u = r + v; State v = u; send <WRITE, v> to process p3) Message sending mout
    18. 18. ASC fault modelo An Arbitrary State Corruption can make a process o Crash o Assign an arbitrary value to any variable o Start the execution from an arbitrary instruction v 5 v 12 z 10 z 7 PC 20 PC 320
    19. 19. Fault frequencyo One fault for every processed input message min 1) Event Dispatching Upon receive message <REQ, r> do if v > 5 then u = r + v + 5; 2) Event Handling else u= r + v; State v = u; send <WRITE, v> to process p 3) Message sending mout
    20. 20. Fault diversityo A corrupted variable is different from its replica v 5 5 v 12 5 z 10 10 z 7 41 PC 20 PC 320 original replica original replicao Only holds immediately after the fault o Can be invalidated if instructions modify the variable
    21. 21. Error propagationo Fault diversity does not holdo Hardening preserves diversity Fault Original Replica diversity u v ?
    22. 22. ASC hardeningFrom ASC faults to crashes and message omissions
    23. 23. From ASC to crasheso Transparent: to the hardened processo Local: no process replication on multiple machineso Untrusted: can have faults while executing hardening min u Event handling v mout HARDENING RUNTIME
    24. 24. PASC library Process Replica state state PASC checks EH1 EH2 EH3 User- defined PASC runtime Transparentgithub.com/yahoo/pasc
    25. 25. Evaluation
    26. 26. Hardening an echo servero Little computation, network bound, no overheado PBFT is a reference (Nysiad not available)
    27. 27. HardeningState Machine Replication 6 PBFT PASC Paxos 5 Unprot. PaxosLatency in ms 4 - 15 % 3 + 70 % 2 1 0 0 20 40 60 80 100 120 140 Throughput in Kops/s
    28. 28. Zookeeper (core)
    29. 29. Memory overhead
    30. 30. Scalability 100 Max. throughput (kops/sec) 90 80 70 60 50 40 PASC sKV 30 Unprot. sKV 20 10 0 1 3 5 7 Number of serverso SimpleKV: eventually consistent store, no replication o Scales similarly with hardening o No server “wasted” for replication
    31. 31. PASC fault coverage o Injected random bit flips in Paxos o Code corruptions: bytecode and binary code o State corruptions: pointers and primitive values Code corruptions State corruptions Unprot PASC Unprot PASC Undet. 3 0 93 0 Det. - 1 - 330 Crash 1640 1663 2301 2066Not manif. 1213 1193 2843 2841 Total 2856 2856 5237 5237
    32. 32. Wrap upo Hardware data corruptions are a real dangero Proposed new systematic approach o BFT not realistic o Ad-hoc approaches are not systematico Hardening algorithm for error isolation o Local: does not require replication o Efficient: PASC-Paxos has up to 70% more throughput than PBFT o High fault coverage
    33. 33. Directionso Systematic protection of Yahoo! infrastructure against data corruptionso ASC just scratched the surface – some todos o Reduce memory footprint o Support for external memory (disks/SSDs) o Hardening of legacy code o Theoretical foundations
    34. 34. Thank youserafini@yahoo-inc.com

    ×