Deco3

1,050 views

Published on

P.77 - 89

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,050
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Deco3

    1. 1. The Datacenter as a Computer Chapter 7 2009/12/20 id:daisukebe
    2. 2. Agenda • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    3. 3. • 7 • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    4. 4. • => • H/W • WSC H/W • MTBF 30 -> 10000 1 1 • WSC
    5. 5. • 7 Dealing with Failures and Repairs • 7.1 • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    6. 6. • 1 2 • • •
    7. 7. • RAID => • => • RAID GFS • • OS
    8. 8. • • • => •
    9. 9. Google • DRAM • 2000 • => • 100% • ECC DRAM
    10. 10. ECC DRAM • ECC Error Correction Code 1 2 via http://www.nec.co.jp/products/express/tech/memory/index.shtml
    11. 11. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    12. 12. ~ • • • WSC •
    13. 13. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    14. 14. • • Corrupted: • Unreachable: • Degraded: • Masked: •
    15. 15. • => 99.0% => 99.0% • •
    16. 16. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    17. 17. • Oppenheimer 500 • • H/W 10-25% • Gray Tandem • H/W -> 10% -> 60% -> 20% •
    18. 18. Google • Oppenheimer •
    19. 19. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    20. 20. H/W • Google • 95% 1 reboot • 1% reboot
    21. 21. • reboot 55% 6 • 25% 6 30 1% 1 • 3 • reboot
    22. 22. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    23. 23. • DRAM • ECC • • •
    24. 24. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
    25. 25. • 10 100% • WSC • Pinheiro Google • WSC
    26. 26. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 • 7.5 Tolerating Faults, Not Hiding Them
    27. 27. WSC • WSC • • => • =>
    28. 28. Google • System Health • •
    29. 29. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5
    30. 30. • => •
    31. 31. IT • 24 5-15% / • Google • WSC • 40000 5% 200
    32. 32. Thank you!

    ×