Deco3
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Deco3

on

  • 1,182 views

P.77 - 89

P.77 - 89

Statistics

Views

Total Views
1,182
Views on SlideShare
1,182
Embed Views
0

Actions

Likes
1
Downloads
14
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Deco3 Presentation Transcript

  • 1. The Datacenter as a Computer Chapter 7 2009/12/20 id:daisukebe
  • 2. Agenda • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 3. • 7 • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 4. • => • H/W • WSC H/W • MTBF 30 -> 10000 1 1 • WSC
  • 5. • 7 Dealing with Failures and Repairs • 7.1 • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 6. • 1 2 • • •
  • 7. • RAID => • => • RAID GFS • • OS
  • 8. • • • => •
  • 9. Google • DRAM • 2000 • => • 100% • ECC DRAM
  • 10. ECC DRAM • ECC Error Correction Code 1 2 via http://www.nec.co.jp/products/express/tech/memory/index.shtml
  • 11. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 12. ~ • • • WSC •
  • 13. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 14. • • Corrupted: • Unreachable: • Degraded: • Masked: •
  • 15. • => 99.0% => 99.0% • •
  • 16. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 17. • Oppenheimer 500 • • H/W 10-25% • Gray Tandem • H/W -> 10% -> 60% -> 20% •
  • 18. Google • Oppenheimer •
  • 19. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 20. H/W • Google • 95% 1 reboot • 1% reboot
  • 21. • reboot 55% 6 • 25% 6 30 1% 1 • 3 • reboot
  • 22. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 23. • DRAM • ECC • • •
  • 24. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 25. • 10 100% • WSC • Pinheiro Google • WSC
  • 26. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 • 7.5 Tolerating Faults, Not Hiding Them
  • 27. WSC • WSC • • => • =>
  • 28. Google • System Health • •
  • 29. • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5
  • 30. • => •
  • 31. IT • 24 5-15% / • Google • WSC • 40000 5% 200
  • 32. Thank you!