SlideShare a Scribd company logo
The Datacenter as a Computer
          Chapter 7


                       2009/12/20
                      id:daisukebe
Agenda
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •   7.2.1 Fault Severity
        •   7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •   7.3.1 What Causes Machine Crashes?
        •   7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•   7
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •   7.2.1 Fault Severity
        •   7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •   7.3.1 What Causes Machine Crashes?
        •   7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•               =>

•              H/W

•   WSC                  H/W


•   MTBF 30
    -> 10000     1   1

•   WSC
•   7 Dealing with Failures and Repairs
    •   7.1

    •   7.2 Categorizing Faults
        •     7.2.1 Fault Severity
        •     7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •     7.3.1 What Causes Machine Crashes?
        •     7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•
        1
        2

•
    •
    •
•                     RAID
    =>

•
    =>

•   RAID        GFS

•
•          OS
•
•
•
    =>

•
Google

•                   DRAM

•        2000

•   =>

•                          100%


•                          ECC DRAM
ECC DRAM

•   ECC   Error Correction Code


                                                                 1
                       2
    via http://www.nec.co.jp/products/express/tech/memory/index.shtml
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2

        •     7.2.1 Fault Severity
        •     7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •     7.3.1 What Causes Machine Crashes?
        •     7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
~




•
•
•   WSC

•
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •   7.2.1

        •   7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •   7.3.1 What Causes Machine Crashes?
        •   7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•
    •   Corrupted:

    •   Unreachable:


    •   Degraded:


    •   Masked:


•
•
    => 99.0%
    =>         99.0%


•
•
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •   7.2.1 Fault Severity
        •   7.2.2

    •   7.3 Machine-Level Failures
        •   7.3.1 What Causes Machine Crashes?
        •   7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•   Oppenheimer        500

    •
    •   H/W                  10-25%

•   Gray      Tandem

    •   H/W -> 10%             -> 60%   -> 20%

•
Google
•   Oppenheimer

•
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •     7.2.1 Fault Severity
        •     7.2.2 Causes of Service-Level Faults
    •   7.3

        •     7.3.1 What Causes Machine Crashes?
        •     7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
H/W
•   Google

    •   95%    1       reboot

    •   1%    reboot
•   reboot            55%    6

•   25% 6        30         1% 1

•            3

•                                  reboot
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •   7.2.1 Fault Severity
        •   7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •   7.3.1

        •   7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•   DRAM

    •   ECC

•
    •
    •
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •   7.2.1 Fault Severity
        •   7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •   7.3.1 What Causes Machine Crashes?
        •   7.3.2

    •   7.4 Repairs
    •   7.5 Tolerating Faults, Not Hiding Them
•   10                  100%


•   WSC


•   Pinheiro   Google


•   WSC
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •     7.2.1 Fault Severity
        •     7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •     7.3.1 What Causes Machine Crashes?
        •     7.3.2 Predicting Faults
    •   7.4

    •   7.5 Tolerating Faults, Not Hiding Them
WSC

•              WSC


•
•
    =>

•
    =>
Google

•    System Health


•

•
•   7 Dealing with Failures and Repairs
    •   7.1 Implications of Software-Based Fault Tolerance
    •   7.2 Categorizing Faults
        •     7.2.1 Fault Severity
        •     7.2.2 Causes of Service-Level Faults
    •   7.3 Machine-Level Failures
        •     7.3.1 What Causes Machine Crashes?
        •     7.3.2 Predicting Faults
    •   7.4 Repairs
    •   7.5
•
    =>


•
IT


•   24                   5-15%
          /

•   Google

    •    WSC

    •    40000        5% 200
Thank you!

More Related Content

Similar to Deco3

Selenium ide 1
Selenium ide 1Selenium ide 1
Selenium ide 1
KadarkaraiSelvam
 
Designing and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale ServicesDesigning and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale Servicesbigqiang zou
 
Do you even need to automate the GUI?
Do you even need to automate the GUI? Do you even need to automate the GUI?
Do you even need to automate the GUI?
Matt Heusser
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed Environment
Orkhan Gasimov
 
MyHeritage - End 2 End testing Infra
MyHeritage - End 2 End testing InfraMyHeritage - End 2 End testing Infra
MyHeritage - End 2 End testing Infra
MatanGoren
 
Monitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT InfrastructuresMonitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT Infrastructures
Johnnie Burke-Gaffney
 
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical InfrastructuresManaging and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
Johnnie Burke-Gaffney
 
Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...
Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...
Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...
Theo Jungeblut
 
Presentation application server diagnostics
Presentation   application server diagnosticsPresentation   application server diagnostics
Presentation application server diagnostics
xKinAnx
 
Scaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel DevelopmentScaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel Development
IBM UrbanCode Products
 
[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi
[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi
[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi
Applitools
 
Optimise Your VMware Costs
Optimise Your VMware CostsOptimise Your VMware Costs
Optimise Your VMware Costs
Stuart Hayward
 
Android CD
Android CDAndroid CD
Android CD
Eugen Martynov
 
Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...
Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...
Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...
Perfecto by Perforce
 
Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01
Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01
Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01King Khan
 
Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...
Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...
Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...
Riccardo Coppola
 
The challenges and pitfalls of database deployment automation
The challenges and pitfalls of database deployment automationThe challenges and pitfalls of database deployment automation
The challenges and pitfalls of database deployment automationDBmaestro - Database DevOps
 
Netcetera Proactive Management Service
Netcetera Proactive Management ServiceNetcetera Proactive Management Service
Netcetera Proactive Management Service
Peter Skelton
 
MyHeritage - QA Automations in a Continuous Deployment environment
MyHeritage -  QA Automations in a Continuous Deployment environmentMyHeritage -  QA Automations in a Continuous Deployment environment
MyHeritage - QA Automations in a Continuous Deployment environment
MatanGoren
 

Similar to Deco3 (20)

Selenium ide 1
Selenium ide 1Selenium ide 1
Selenium ide 1
 
Designing and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale ServicesDesigning and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale Services
 
ITE - Chapter 7
ITE - Chapter 7ITE - Chapter 7
ITE - Chapter 7
 
Do you even need to automate the GUI?
Do you even need to automate the GUI? Do you even need to automate the GUI?
Do you even need to automate the GUI?
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed Environment
 
MyHeritage - End 2 End testing Infra
MyHeritage - End 2 End testing InfraMyHeritage - End 2 End testing Infra
MyHeritage - End 2 End testing Infra
 
Monitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT InfrastructuresMonitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT Infrastructures
 
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical InfrastructuresManaging and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
 
Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...
Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...
Debugging,Troubleshooting & Monitoring Distributed Web & Cloud Applications a...
 
Presentation application server diagnostics
Presentation   application server diagnosticsPresentation   application server diagnostics
Presentation application server diagnostics
 
Scaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel DevelopmentScaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel Development
 
[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi
[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi
[webinar] Cutting-edge Functional UI Testing Techniques - w/ Adam Carmi
 
Optimise Your VMware Costs
Optimise Your VMware CostsOptimise Your VMware Costs
Optimise Your VMware Costs
 
Android CD
Android CDAndroid CD
Android CD
 
Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...
Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...
Selenium Today vs. Selenium Tomorrow: Digital as the Convergence of Mobile & ...
 
Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01
Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01
Hotelmanagementsystemcorrectfinalsrs 130112074325-phpapp01
 
Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...
Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...
Automated Mobile UI Testing Fragility: An Exploratory Assessment Study on And...
 
The challenges and pitfalls of database deployment automation
The challenges and pitfalls of database deployment automationThe challenges and pitfalls of database deployment automation
The challenges and pitfalls of database deployment automation
 
Netcetera Proactive Management Service
Netcetera Proactive Management ServiceNetcetera Proactive Management Service
Netcetera Proactive Management Service
 
MyHeritage - QA Automations in a Continuous Deployment environment
MyHeritage -  QA Automations in a Continuous Deployment environmentMyHeritage -  QA Automations in a Continuous Deployment environment
MyHeritage - QA Automations in a Continuous Deployment environment
 

Deco3

  • 1. The Datacenter as a Computer Chapter 7 2009/12/20 id:daisukebe
  • 2. Agenda • 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 3. 7 • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 4. => • H/W • WSC H/W • MTBF 30 -> 10000 1 1 • WSC
  • 5. 7 Dealing with Failures and Repairs • 7.1 • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 6. 1 2 • • •
  • 7. RAID => • => • RAID GFS • • OS
  • 8. • • • => •
  • 9. Google • DRAM • 2000 • => • 100% • ECC DRAM
  • 10. ECC DRAM • ECC Error Correction Code 1 2 via http://www.nec.co.jp/products/express/tech/memory/index.shtml
  • 11. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 12. ~ • • • WSC •
  • 13. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 14. • Corrupted: • Unreachable: • Degraded: • Masked: •
  • 15. => 99.0% => 99.0% • •
  • 16. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 17. Oppenheimer 500 • • H/W 10-25% • Gray Tandem • H/W -> 10% -> 60% -> 20% •
  • 18. Google • Oppenheimer •
  • 19. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 20. H/W • Google • 95% 1 reboot • 1% reboot
  • 21. reboot 55% 6 • 25% 6 30 1% 1 • 3 • reboot
  • 22. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 23. DRAM • ECC • • •
  • 24. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 • 7.4 Repairs • 7.5 Tolerating Faults, Not Hiding Them
  • 25. 10 100% • WSC • Pinheiro Google • WSC
  • 26. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 • 7.5 Tolerating Faults, Not Hiding Them
  • 27. WSC • WSC • • => • =>
  • 28. Google • System Health • •
  • 29. 7 Dealing with Failures and Repairs • 7.1 Implications of Software-Based Fault Tolerance • 7.2 Categorizing Faults • 7.2.1 Fault Severity • 7.2.2 Causes of Service-Level Faults • 7.3 Machine-Level Failures • 7.3.1 What Causes Machine Crashes? • 7.3.2 Predicting Faults • 7.4 Repairs • 7.5
  • 30. => •
  • 31. IT • 24 5-15% / • Google • WSC • 40000 5% 200