SlideShare a Scribd company logo
1 of 17
Download to read offline
Cloud Downtime Due to
Memory Errors
A bigger problem than you might think!
Barbara P Aichinger
Vice President New Business Development
FuturePlus Systems Corporation
Santa Clara, CA
November 20121
 Servers in the Data Center
 Loaded with memory
 Several DDR3 and soon DDR4
channels per processor
 Several DIMMs per channel
 DIMMs becoming denser
 The industry has NO
standardized compliance
tesing
 Price pressures affecting
validation budgets
Cloud Expo
NY 20132
Cloud Hardware
Cloud Expo
NY 20133
Cloud Hardware
24 total DIMM slots
 DRAM Errors in the Wild: A Large-Scale Field Study
 Cosmic Rays Don’t Strike Twice: Understanding the
Nature of DRAM Errors and the Implications for
System Design
 Hard Data on Soft Errors: A Large-Scale Assessment
of Real-World Error Rates in GPGPU
 An Empirical Study of Memory Hardware Errors in a
Server Farm
 A Realistic Evaluation of Memory Hardware Errors
and Software System Susceptibility
Cloud Expo
NY 20134
Studies Show: More Errors than
Expected
 Band-Aids and work arounds
 Chip Kill
 ECC
 Page Retirement
 Mirroring
 Duplication
Cloud Expo
NY 20135
The Result?
You are just rearranging the chairs on the deck
of the Titanic….the ship is still going down
 JEDEC www.JEDEC.com is the international standards
body that governs DRAM Chip Specifications
 JEDEC also specifies DIMM mechanical form factors
and electrical properties of the DIMM etch
 Standardizes connector pinout
 Standardizes speeds and properties that promote
interoperability
 However there is NO standardized Compliance
Testing
Cloud Expo
NY 20136
Memory Standards
Cloud Expo
NY 20137
 Studies show: 2-4% error rate
 If Google has a million servers**
 A 4% error rate per year is 40,000 failures
a year
 3,334 failures per month
 110 failures per day
 4.6 failures per hour
How big of a problem is this?
** Aug 2011 DataKnowledgeCenter.com reported 900,000
Cloud Expo
NY 20138
How is it handled?
DIMM swaps and system replacement
Source: Google
 Motherboard Signal Integrity
 Poor quality Connectors
 Incompatible BIOS Settings
 Protocol Violations by the Memory Controller
 Poor Quality DIMMs
Cloud Expo
NY 20139
What are the causes of these
memory errors
Other than the Cosmic Rays, Aging and Environmental Issues
Design Flaws!
Cloud Expo
NY 201310
Case Study
 Server in the Data Center
 Equipment
 DDR3 Detective®
 Google StressAPP test
Cloud Expo
NY 201311
Failures
 Protocol Compliance is correct timing between events on
the DDR memory bus
 What we found on the server
 Simultaneous access to two DIMMs
 Opening and Closing DRAM Banks too close together
 In adequate Refreshes
 Incorrect protocol during a Calibrate Command
Cloud Expo
NY 201312
Violations Found
All of these can cause the DRAM to lose state
thus resulting in Data Corruption
 Thanks for telling us that. It’s a BIOS
Setting
 Its ok only a few clocks off on the
Refreshes
 We tie ODT high and disable that feature in
the Mode Registers
 Really?
Cloud Expo
NY 201313
Response from the System Vendor
Cloud Expo
NY 201314
Smoking Gun
Cloud Expo
NY 201315
Performance
 Its Cloudy with a chance of Errors
 Ask your vendor about their validation strategy
 Don’t fill your data center with problems!
 The DDR3 Detective can help
 Prequalification before purchase
 Measure effectiveness of your software
 Validate that BIOS changes are compatible
with your DIMMs
Cloud Expo
NY 201316
Summary
www.FuturePlus.com
Barbara Aichinger
Vice President New Business Development
Barb.Aichinger@FuturePlus.com
Office: 603-472-5905
Cell: 603-548-5037
Cloud Expo
NY 201317
FuturePlus Systems
Computer System Validation experts for over 20 years!

More Related Content

Viewers also liked

Viewers also liked (10)

Patrones de modelos de negocios
Patrones de modelos de negociosPatrones de modelos de negocios
Patrones de modelos de negocios
 
Ljud och ljus Jennifer
Ljud och ljus JenniferLjud och ljus Jennifer
Ljud och ljus Jennifer
 
Ljud och ljus Hanna
Ljud och ljus HannaLjud och ljus Hanna
Ljud och ljus Hanna
 
Técnicas para el diseño de modelo de negocio
Técnicas para el diseño de modelo de negocioTécnicas para el diseño de modelo de negocio
Técnicas para el diseño de modelo de negocio
 
Ljud och ljus interaktiv bok emilia
Ljud och ljus   interaktiv bok emiliaLjud och ljus   interaktiv bok emilia
Ljud och ljus interaktiv bok emilia
 
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus SystemsDDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
 
Regímenes de transito
Regímenes de transitoRegímenes de transito
Regímenes de transito
 
Boken om ljud och ljus jonatan
Boken om ljud och ljus jonatanBoken om ljud och ljus jonatan
Boken om ljud och ljus jonatan
 
Estrategia en el modelo de negocio
Estrategia en el modelo de negocioEstrategia en el modelo de negocio
Estrategia en el modelo de negocio
 
Ljud och ljus pp lovisa
Ljud och ljus pp lovisaLjud och ljus pp lovisa
Ljud och ljus pp lovisa
 

Similar to Cloud downtime due to memory errors barbara aichinger

OCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFTOCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFT
Barbara Aichinger
 
Rob Callaghan_OOW14 IO Performance for Database
Rob Callaghan_OOW14 IO Performance for DatabaseRob Callaghan_OOW14 IO Performance for Database
Rob Callaghan_OOW14 IO Performance for Database
Rob Callaghan
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
Andrei Khurshudov
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
Matei Zaharia
 
Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shang
SAIL_QU
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
Mike Miller
 

Similar to Cloud downtime due to memory errors barbara aichinger (20)

Test
TestTest
Test
 
Google Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flawsGoogle Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flaws
 
OCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFTOCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFT
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service Overview
 
New Memory Solutions for Enterprise Computing
New Memory Solutions for Enterprise ComputingNew Memory Solutions for Enterprise Computing
New Memory Solutions for Enterprise Computing
 
Rob Callaghan_OOW14 IO Performance for Database
Rob Callaghan_OOW14 IO Performance for DatabaseRob Callaghan_OOW14 IO Performance for Database
Rob Callaghan_OOW14 IO Performance for Database
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or..."It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
 
Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shang
 
Performance Forensics - Understanding Application Performance
Performance Forensics - Understanding Application PerformancePerformance Forensics - Understanding Application Performance
Performance Forensics - Understanding Application Performance
 
DDR4 Compliance Testing. Its time has come!
DDR4 Compliance Testing.  Its time has come!DDR4 Compliance Testing.  Its time has come!
DDR4 Compliance Testing. Its time has come!
 
Actionable Metrics at Production Scale - LSPE Meetup June 27, 2012
Actionable Metrics at Production Scale - LSPE Meetup June 27, 2012Actionable Metrics at Production Scale - LSPE Meetup June 27, 2012
Actionable Metrics at Production Scale - LSPE Meetup June 27, 2012
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsHybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
 
How AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changesHow AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changes
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
 
DDR DIMM Design
DDR DIMM DesignDDR DIMM Design
DDR DIMM Design
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster Recovery
 

Recently uploaded

+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Recently uploaded (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 

Cloud downtime due to memory errors barbara aichinger

  • 1. Cloud Downtime Due to Memory Errors A bigger problem than you might think! Barbara P Aichinger Vice President New Business Development FuturePlus Systems Corporation Santa Clara, CA November 20121
  • 2.  Servers in the Data Center  Loaded with memory  Several DDR3 and soon DDR4 channels per processor  Several DIMMs per channel  DIMMs becoming denser  The industry has NO standardized compliance tesing  Price pressures affecting validation budgets Cloud Expo NY 20132 Cloud Hardware
  • 3. Cloud Expo NY 20133 Cloud Hardware 24 total DIMM slots
  • 4.  DRAM Errors in the Wild: A Large-Scale Field Study  Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU  An Empirical Study of Memory Hardware Errors in a Server Farm  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility Cloud Expo NY 20134 Studies Show: More Errors than Expected
  • 5.  Band-Aids and work arounds  Chip Kill  ECC  Page Retirement  Mirroring  Duplication Cloud Expo NY 20135 The Result? You are just rearranging the chairs on the deck of the Titanic….the ship is still going down
  • 6.  JEDEC www.JEDEC.com is the international standards body that governs DRAM Chip Specifications  JEDEC also specifies DIMM mechanical form factors and electrical properties of the DIMM etch  Standardizes connector pinout  Standardizes speeds and properties that promote interoperability  However there is NO standardized Compliance Testing Cloud Expo NY 20136 Memory Standards
  • 7. Cloud Expo NY 20137  Studies show: 2-4% error rate  If Google has a million servers**  A 4% error rate per year is 40,000 failures a year  3,334 failures per month  110 failures per day  4.6 failures per hour How big of a problem is this? ** Aug 2011 DataKnowledgeCenter.com reported 900,000
  • 8. Cloud Expo NY 20138 How is it handled? DIMM swaps and system replacement Source: Google
  • 9.  Motherboard Signal Integrity  Poor quality Connectors  Incompatible BIOS Settings  Protocol Violations by the Memory Controller  Poor Quality DIMMs Cloud Expo NY 20139 What are the causes of these memory errors Other than the Cosmic Rays, Aging and Environmental Issues Design Flaws!
  • 10. Cloud Expo NY 201310 Case Study  Server in the Data Center  Equipment  DDR3 Detective®  Google StressAPP test
  • 12.  Protocol Compliance is correct timing between events on the DDR memory bus  What we found on the server  Simultaneous access to two DIMMs  Opening and Closing DRAM Banks too close together  In adequate Refreshes  Incorrect protocol during a Calibrate Command Cloud Expo NY 201312 Violations Found All of these can cause the DRAM to lose state thus resulting in Data Corruption
  • 13.  Thanks for telling us that. It’s a BIOS Setting  Its ok only a few clocks off on the Refreshes  We tie ODT high and disable that feature in the Mode Registers  Really? Cloud Expo NY 201313 Response from the System Vendor
  • 16.  Its Cloudy with a chance of Errors  Ask your vendor about their validation strategy  Don’t fill your data center with problems!  The DDR3 Detective can help  Prequalification before purchase  Measure effectiveness of your software  Validate that BIOS changes are compatible with your DIMMs Cloud Expo NY 201316 Summary
  • 17. www.FuturePlus.com Barbara Aichinger Vice President New Business Development Barb.Aichinger@FuturePlus.com Office: 603-472-5905 Cell: 603-548-5037 Cloud Expo NY 201317 FuturePlus Systems Computer System Validation experts for over 20 years!